All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support
@ 2020-03-19  9:12 Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                   ` (69 more replies)
  0 siblings, 70 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:12 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

Hi,

here is an updated version of the patch-set to enable Linux to run as a
guest in an SEV-ES enabled Hypervisor. The first version can be found
here:

	https://lore.kernel.org/lkml/20200211135256.24617-1-joro@8bytes.org/

The first post also includes a more elaborate description of the
implementation requirements and details.  A branch containing these
patches is here:

	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=sev-es-client-v5.6-rc6

There are lots of small changes since the first version, here is a list
of the major ones, which address most of the valuable review comments I
received, thanks for that!

Changes since v1:

	- Rebased to v5.6-rc6

	- Factored out instruction decoding part of the UMIP handler and
	  re-used it in the SEV-ES code.

	- Several enhancements of the instruction decoder as needed by
	  SEV-ES

	- The instruction fetch and memory access code for instruction
	  emulation now handles different user execution modes as well
	  as segment bases.

	- Added emulation of (REP) MOVS instructions

	- Added handling for nesting #VC handlers - which fixed the NMI
	  issues.

	- Pass error_code as a parameter to the #VC exception handlers

	- Reworked early exception dispatch function

	- Moved the GHCB pages out of the per-cpu areas and only
	  allocate them when they are actually needed. The per-cpu areas
	  only store a pointer now.

	- Removed emulation for INVD, now it will just cause an error if
	  used.

	- Added prefixes to the function names.

	- Fixed a bug which broke bare-metal boot with mem_encrypt=on

The last missing change I have on my list is to rework the NMI handling
patch. I decided to postpone this until Thomas' Gleixners rework of the
x86 entry code is ready and merged, because the NMI handling will
conflict with these changes.

Please review.

Thanks,

	Joerg

Doug Covelli (1):
  x86/vmware: Add VMware specific handling for VMMCALL under SEV-ES

Joerg Roedel (51):
  KVM: SVM: Add GHCB Accessor functions
  x86/traps: Move some definitions to <asm/trap_defs.h>
  x86/insn: Make inat-tables.c suitable for pre-decompression code
  x86/umip: Factor out instruction fetch
  x86/umip: Factor out instruction decoding
  x86/insn: Add insn_get_modrm_reg_off()
  x86/insn: Add insn_rep_prefix() helper
  x86/boot/compressed: Fix debug_puthex() parameter type
  x86/boot/compressed/64: Disable red-zone usage
  x86/boot/compressed/64: Add IDT Infrastructure
  x86/boot/compressed/64: Rename kaslr_64.c to ident_map_64.c
  x86/boot/compressed/64: Add page-fault handler
  x86/boot/compressed/64: Always switch to own page-table
  x86/boot/compressed/64: Don't pre-map memory in KASLR code
  x86/boot/compressed/64: Change add_identity_map() to take start and
    end
  x86/boot/compressed/64: Add stage1 #VC handler
  x86/boot/compressed/64: Call set_sev_encryption_mask earlier
  x86/boot/compressed/64: Check return value of
    kernel_ident_mapping_init()
  x86/boot/compressed/64: Add function to map a page unencrypted
  x86/boot/compressed/64: Setup GHCB Based VC Exception handler
  x86/fpu: Move xgetbv()/xsetbv() into separate header
  x86/idt: Move IDT to data segment
  x86/idt: Split idt_data setup out of set_intr_gate()
  x86/idt: Move two function from k/idt.c to i/a/desc.h
  x86/head/64: Install boot GDT
  x86/head/64: Reload GDT after switch to virtual addresses
  x86/head/64: Load segment registers earlier
  x86/head/64: Switch to initial stack earlier
  x86/head/64: Build k/head64.c with -fno-stack-protector
  x86/head/64: Load IDT earlier
  x86/head/64: Move early exception dispatch to C code
  x86/sev-es: Add SEV-ES Feature Detection
  x86/sev-es: Compile early handler code into kernel image
  x86/sev-es: Setup early #VC handler
  x86/sev-es: Setup GHCB based boot #VC handler
  x86/sev-es: Support nested #VC exceptions
  x86/sev-es: Wire up existing #VC exit-code handlers
  x86/sev-es: Handle instruction fetches from user-space
  x86/sev-es: Harden runtime #VC handler for exceptions from user-space
  x86/sev-es: Filter exceptions not supported from user-space
  x86/sev-es: Handle MMIO String Instructions
  x86/sev-es: Handle RDTSCP Events
  x86/sev-es: Handle #AC Events
  x86/sev-es: Handle #DB Events
  x86/paravirt: Allow hypervisor specific VMMCALL handling under SEV-ES
  x86/realmode: Add SEV-ES specific trampoline entry point
  x86/head/64: Don't call verify_cpu() on starting APs
  x86/head/64: Rename start_cpu0
  x86/sev-es: Support CPU offline/online
  x86/cpufeature: Add SEV_ES_GUEST CPU Feature
  x86/sev-es: Add NMI state tracking

Tom Lendacky (18):
  KVM: SVM: Add GHCB definitions
  x86/cpufeatures: Add SEV-ES CPU feature
  x86/sev-es: Add support for handling IOIO exceptions
  x86/sev-es: Add CPUID handling to #VC handler
  x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  x86/sev-es: Add Runtime #VC Exception Handler
  x86/sev-es: Handle MMIO events
  x86/sev-es: Handle MSR events
  x86/sev-es: Handle DR7 read/write events
  x86/sev-es: Handle WBINVD Events
  x86/sev-es: Handle RDTSC Events
  x86/sev-es: Handle RDPMC Events
  x86/sev-es: Handle INVD Events
  x86/sev-es: Handle MONITOR/MONITORX Events
  x86/sev-es: Handle MWAIT/MWAITX Events
  x86/sev-es: Handle VMMCALL Events
  x86/kvm: Add KVM specific VMMCALL handling under SEV-ES
  x86/realmode: Setup AP jump table

 arch/x86/Kconfig                           |    1 +
 arch/x86/boot/Makefile                     |    2 +-
 arch/x86/boot/compressed/Makefile          |    8 +-
 arch/x86/boot/compressed/head_64.S         |   41 +
 arch/x86/boot/compressed/ident_map_64.c    |  320 ++++++
 arch/x86/boot/compressed/idt_64.c          |   53 +
 arch/x86/boot/compressed/idt_handlers_64.S |   82 ++
 arch/x86/boot/compressed/kaslr.c           |   36 +-
 arch/x86/boot/compressed/kaslr_64.c        |  153 ---
 arch/x86/boot/compressed/misc.h            |   34 +-
 arch/x86/boot/compressed/sev-es.c          |  177 +++
 arch/x86/entry/entry_64.S                  |   52 +
 arch/x86/include/asm/cpu.h                 |    2 +-
 arch/x86/include/asm/cpufeatures.h         |    2 +
 arch/x86/include/asm/desc.h                |   28 +
 arch/x86/include/asm/desc_defs.h           |   10 +
 arch/x86/include/asm/fpu/internal.h        |   29 +-
 arch/x86/include/asm/fpu/xcr.h             |   32 +
 arch/x86/include/asm/insn-eval.h           |    6 +
 arch/x86/include/asm/mem_encrypt.h         |    5 +
 arch/x86/include/asm/msr-index.h           |    3 +
 arch/x86/include/asm/pgtable.h             |    2 +-
 arch/x86/include/asm/processor.h           |    1 +
 arch/x86/include/asm/realmode.h            |    4 +
 arch/x86/include/asm/segment.h             |    2 +-
 arch/x86/include/asm/setup.h               |    1 -
 arch/x86/include/asm/sev-es.h              |  119 ++
 arch/x86/include/asm/svm.h                 |  103 ++
 arch/x86/include/asm/trap_defs.h           |   50 +
 arch/x86/include/asm/traps.h               |   51 +-
 arch/x86/include/asm/x86_init.h            |   16 +-
 arch/x86/include/uapi/asm/svm.h            |   11 +
 arch/x86/kernel/Makefile                   |    5 +
 arch/x86/kernel/cpu/amd.c                  |    9 +-
 arch/x86/kernel/cpu/scattered.c            |    1 +
 arch/x86/kernel/cpu/vmware.c               |   50 +-
 arch/x86/kernel/head64.c                   |   57 +-
 arch/x86/kernel/head_32.S                  |    4 +-
 arch/x86/kernel/head_64.S                  |  169 ++-
 arch/x86/kernel/idt.c                      |   52 +-
 arch/x86/kernel/kvm.c                      |   35 +-
 arch/x86/kernel/nmi.c                      |    8 +
 arch/x86/kernel/sev-es-shared.c            |  444 ++++++++
 arch/x86/kernel/sev-es.c                   | 1165 ++++++++++++++++++++
 arch/x86/kernel/smpboot.c                  |    4 +-
 arch/x86/kernel/traps.c                    |    3 +
 arch/x86/kernel/umip.c                     |   49 +-
 arch/x86/lib/insn-eval.c                   |  130 +++
 arch/x86/mm/extable.c                      |    1 +
 arch/x86/mm/mem_encrypt.c                  |   11 +-
 arch/x86/mm/mem_encrypt_identity.c         |    3 +
 arch/x86/realmode/init.c                   |   12 +
 arch/x86/realmode/rm/header.S              |    3 +
 arch/x86/realmode/rm/trampoline_64.S       |   20 +
 arch/x86/tools/gen-insn-attr-x86.awk       |   50 +-
 tools/arch/x86/tools/gen-insn-attr-x86.awk |   50 +-
 56 files changed, 3352 insertions(+), 419 deletions(-)
 create mode 100644 arch/x86/boot/compressed/ident_map_64.c
 create mode 100644 arch/x86/boot/compressed/idt_64.c
 create mode 100644 arch/x86/boot/compressed/idt_handlers_64.S
 delete mode 100644 arch/x86/boot/compressed/kaslr_64.c
 create mode 100644 arch/x86/boot/compressed/sev-es.c
 create mode 100644 arch/x86/include/asm/fpu/xcr.h
 create mode 100644 arch/x86/include/asm/sev-es.h
 create mode 100644 arch/x86/include/asm/trap_defs.h
 create mode 100644 arch/x86/kernel/sev-es-shared.c
 create mode 100644 arch/x86/kernel/sev-es.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 01/70] KVM: SVM: Add GHCB definitions
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:12   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:12 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Extend the vmcb_safe_area with SEV-ES fields and add a new
'struct ghcb' which will be used for guest-hypervisor communication.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/svm.h | 42 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 6ece8561ba66..f36288c659b5 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -201,6 +201,48 @@ struct __attribute__ ((__packed__)) vmcb_save_area {
 	u64 br_to;
 	u64 last_excp_from;
 	u64 last_excp_to;
+
+	/*
+	 * The following part of the save area is valid only for
+	 * SEV-ES guests when referenced through the GHCB.
+	 */
+	u8 reserved_7[104];
+	u64 reserved_8;		/* rax already available at 0x01f8 */
+	u64 rcx;
+	u64 rdx;
+	u64 rbx;
+	u64 reserved_9;		/* rsp already available at 0x01d8 */
+	u64 rbp;
+	u64 rsi;
+	u64 rdi;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+	u8 reserved_10[16];
+	u64 sw_exit_code;
+	u64 sw_exit_info_1;
+	u64 sw_exit_info_2;
+	u64 sw_scratch;
+	u8 reserved_11[56];
+	u64 xcr0;
+	u8 valid_bitmap[16];
+	u64 x87_state_gpa;
+	u8 reserved_12[1016];
+};
+
+struct __attribute__ ((__packed__)) ghcb {
+	struct vmcb_save_area save;
+
+	u8 shared_buffer[2032];
+
+	u8 reserved_1[10];
+	u16 protocol_version;	/* negotiated SEV-ES/GHCB protocol version */
+	u32 ghcb_usage;
 };
 
 struct __attribute__ ((__packed__)) vmcb {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 01/70] KVM: SVM: Add GHCB definitions
@ 2020-03-19  9:12   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:12 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Tom Lendacky <thomas.lendacky@amd.com>

Extend the vmcb_safe_area with SEV-ES fields and add a new
'struct ghcb' which will be used for guest-hypervisor communication.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/svm.h | 42 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 6ece8561ba66..f36288c659b5 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -201,6 +201,48 @@ struct __attribute__ ((__packed__)) vmcb_save_area {
 	u64 br_to;
 	u64 last_excp_from;
 	u64 last_excp_to;
+
+	/*
+	 * The following part of the save area is valid only for
+	 * SEV-ES guests when referenced through the GHCB.
+	 */
+	u8 reserved_7[104];
+	u64 reserved_8;		/* rax already available at 0x01f8 */
+	u64 rcx;
+	u64 rdx;
+	u64 rbx;
+	u64 reserved_9;		/* rsp already available at 0x01d8 */
+	u64 rbp;
+	u64 rsi;
+	u64 rdi;
+	u64 r8;
+	u64 r9;
+	u64 r10;
+	u64 r11;
+	u64 r12;
+	u64 r13;
+	u64 r14;
+	u64 r15;
+	u8 reserved_10[16];
+	u64 sw_exit_code;
+	u64 sw_exit_info_1;
+	u64 sw_exit_info_2;
+	u64 sw_scratch;
+	u8 reserved_11[56];
+	u64 xcr0;
+	u8 valid_bitmap[16];
+	u64 x87_state_gpa;
+	u8 reserved_12[1016];
+};
+
+struct __attribute__ ((__packed__)) ghcb {
+	struct vmcb_save_area save;
+
+	u8 shared_buffer[2032];
+
+	u8 reserved_1[10];
+	u16 protocol_version;	/* negotiated SEV-ES/GHCB protocol version */
+	u32 ghcb_usage;
 };
 
 struct __attribute__ ((__packed__)) vmcb {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 02/70] KVM: SVM: Add GHCB Accessor functions
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:12   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:12 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Building a correct GHCB for the hypervisor requires setting valid bits
in the GHCB. Simplify that process by providing accessor functions to
set values and to update the valid bitmap.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/svm.h | 61 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index f36288c659b5..e4e9f6bacfaa 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -333,4 +333,65 @@ struct __attribute__ ((__packed__)) vmcb {
 
 #define SVM_CR0_SELECTIVE_MASK (X86_CR0_TS | X86_CR0_MP)
 
+/* GHCB Accessor functions */
+
+#define DEFINE_GHCB_INDICES(field)					\
+	u16 idx = offsetof(struct vmcb_save_area, field) / 8;		\
+	u16 byte_idx  = idx / 8;					\
+	u16 bit_idx   = idx % 8;					\
+	BUILD_BUG_ON(byte_idx > ARRAY_SIZE(ghcb->save.valid_bitmap));
+
+#define GHCB_SET_VALID(ghcb, field)					\
+	{								\
+		DEFINE_GHCB_INDICES(field)				\
+		(ghcb)->save.valid_bitmap[byte_idx] |= BIT(bit_idx);	\
+	}
+
+#define DEFINE_GHCB_SETTER(field)					\
+	static inline void						\
+	ghcb_set_##field(struct ghcb *ghcb, u64 value)			\
+	{								\
+		GHCB_SET_VALID(ghcb, field)				\
+		(ghcb)->save.field = value;				\
+	}
+
+#define DEFINE_GHCB_ACCESSORS(field)					\
+	static inline bool ghcb_is_valid_##field(const struct ghcb *ghcb)	\
+	{								\
+		DEFINE_GHCB_INDICES(field)				\
+		return !!((ghcb)->save.valid_bitmap[byte_idx]		\
+						& BIT(bit_idx));	\
+	}								\
+									\
+	static inline void						\
+	ghcb_set_##field(struct ghcb *ghcb, u64 value)			\
+	{								\
+		GHCB_SET_VALID(ghcb, field)				\
+		(ghcb)->save.field = value;				\
+	}
+
+DEFINE_GHCB_ACCESSORS(cpl)
+DEFINE_GHCB_ACCESSORS(rip)
+DEFINE_GHCB_ACCESSORS(rsp)
+DEFINE_GHCB_ACCESSORS(rax)
+DEFINE_GHCB_ACCESSORS(rcx)
+DEFINE_GHCB_ACCESSORS(rdx)
+DEFINE_GHCB_ACCESSORS(rbx)
+DEFINE_GHCB_ACCESSORS(rbp)
+DEFINE_GHCB_ACCESSORS(rsi)
+DEFINE_GHCB_ACCESSORS(rdi)
+DEFINE_GHCB_ACCESSORS(r8)
+DEFINE_GHCB_ACCESSORS(r9)
+DEFINE_GHCB_ACCESSORS(r10)
+DEFINE_GHCB_ACCESSORS(r11)
+DEFINE_GHCB_ACCESSORS(r12)
+DEFINE_GHCB_ACCESSORS(r13)
+DEFINE_GHCB_ACCESSORS(r14)
+DEFINE_GHCB_ACCESSORS(r15)
+DEFINE_GHCB_ACCESSORS(sw_exit_code)
+DEFINE_GHCB_ACCESSORS(sw_exit_info_1)
+DEFINE_GHCB_ACCESSORS(sw_exit_info_2)
+DEFINE_GHCB_ACCESSORS(sw_scratch)
+DEFINE_GHCB_ACCESSORS(xcr0)
+
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 02/70] KVM: SVM: Add GHCB Accessor functions
@ 2020-03-19  9:12   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:12 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Building a correct GHCB for the hypervisor requires setting valid bits
in the GHCB. Simplify that process by providing accessor functions to
set values and to update the valid bitmap.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/svm.h | 61 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index f36288c659b5..e4e9f6bacfaa 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -333,4 +333,65 @@ struct __attribute__ ((__packed__)) vmcb {
 
 #define SVM_CR0_SELECTIVE_MASK (X86_CR0_TS | X86_CR0_MP)
 
+/* GHCB Accessor functions */
+
+#define DEFINE_GHCB_INDICES(field)					\
+	u16 idx = offsetof(struct vmcb_save_area, field) / 8;		\
+	u16 byte_idx  = idx / 8;					\
+	u16 bit_idx   = idx % 8;					\
+	BUILD_BUG_ON(byte_idx > ARRAY_SIZE(ghcb->save.valid_bitmap));
+
+#define GHCB_SET_VALID(ghcb, field)					\
+	{								\
+		DEFINE_GHCB_INDICES(field)				\
+		(ghcb)->save.valid_bitmap[byte_idx] |= BIT(bit_idx);	\
+	}
+
+#define DEFINE_GHCB_SETTER(field)					\
+	static inline void						\
+	ghcb_set_##field(struct ghcb *ghcb, u64 value)			\
+	{								\
+		GHCB_SET_VALID(ghcb, field)				\
+		(ghcb)->save.field = value;				\
+	}
+
+#define DEFINE_GHCB_ACCESSORS(field)					\
+	static inline bool ghcb_is_valid_##field(const struct ghcb *ghcb)	\
+	{								\
+		DEFINE_GHCB_INDICES(field)				\
+		return !!((ghcb)->save.valid_bitmap[byte_idx]		\
+						& BIT(bit_idx));	\
+	}								\
+									\
+	static inline void						\
+	ghcb_set_##field(struct ghcb *ghcb, u64 value)			\
+	{								\
+		GHCB_SET_VALID(ghcb, field)				\
+		(ghcb)->save.field = value;				\
+	}
+
+DEFINE_GHCB_ACCESSORS(cpl)
+DEFINE_GHCB_ACCESSORS(rip)
+DEFINE_GHCB_ACCESSORS(rsp)
+DEFINE_GHCB_ACCESSORS(rax)
+DEFINE_GHCB_ACCESSORS(rcx)
+DEFINE_GHCB_ACCESSORS(rdx)
+DEFINE_GHCB_ACCESSORS(rbx)
+DEFINE_GHCB_ACCESSORS(rbp)
+DEFINE_GHCB_ACCESSORS(rsi)
+DEFINE_GHCB_ACCESSORS(rdi)
+DEFINE_GHCB_ACCESSORS(r8)
+DEFINE_GHCB_ACCESSORS(r9)
+DEFINE_GHCB_ACCESSORS(r10)
+DEFINE_GHCB_ACCESSORS(r11)
+DEFINE_GHCB_ACCESSORS(r12)
+DEFINE_GHCB_ACCESSORS(r13)
+DEFINE_GHCB_ACCESSORS(r14)
+DEFINE_GHCB_ACCESSORS(r15)
+DEFINE_GHCB_ACCESSORS(sw_exit_code)
+DEFINE_GHCB_ACCESSORS(sw_exit_info_1)
+DEFINE_GHCB_ACCESSORS(sw_exit_info_2)
+DEFINE_GHCB_ACCESSORS(sw_scratch)
+DEFINE_GHCB_ACCESSORS(xcr0)
+
 #endif
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 03/70] x86/cpufeatures: Add SEV-ES CPU feature
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Add CPU feature detection for Secure Encrypted Virtualization with
Encrypted State. This feature enhances SEV by also encrypting the
guest register state, making it in-accessible to the hypervisor.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/amd.c          | 3 ++-
 arch/x86/kernel/cpu/scattered.c    | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index f3327cb56edf..2fee1a2cac2f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -234,6 +234,7 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
+#define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 1f875fbe1384..523a6a76c6c1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -581,7 +581,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 	 *	      If BIOS has not enabled SME then don't advertise the
 	 *	      SME feature (set in scattered.c).
 	 *   For SEV: If BIOS has not enabled SEV then don't advertise the
-	 *            SEV feature (set in scattered.c).
+	 *            SEV and SEV_ES feature (set in scattered.c).
 	 *
 	 *   In all cases, since support for SME and SEV requires long mode,
 	 *   don't advertise the feature under CONFIG_X86_32.
@@ -612,6 +612,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 		setup_clear_cpu_cap(X86_FEATURE_SME);
 clear_sev:
 		setup_clear_cpu_cap(X86_FEATURE_SEV);
+		setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
 	}
 }
 
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 62b137c3c97a..30f354989cf1 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -41,6 +41,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_MBA,		CPUID_EBX,  6, 0x80000008, 0 },
 	{ X86_FEATURE_SME,		CPUID_EAX,  0, 0x8000001f, 0 },
 	{ X86_FEATURE_SEV,		CPUID_EAX,  1, 0x8000001f, 0 },
+	{ X86_FEATURE_SEV_ES,		CPUID_EAX,  3, 0x8000001f, 0 },
 	{ 0, 0, 0, 0, 0 }
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 03/70] x86/cpufeatures: Add SEV-ES CPU feature
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Tom Lendacky <thomas.lendacky@amd.com>

Add CPU feature detection for Secure Encrypted Virtualization with
Encrypted State. This feature enhances SEV by also encrypting the
guest register state, making it in-accessible to the hypervisor.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/amd.c          | 3 ++-
 arch/x86/kernel/cpu/scattered.c    | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index f3327cb56edf..2fee1a2cac2f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -234,6 +234,7 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
+#define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 1f875fbe1384..523a6a76c6c1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -581,7 +581,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 	 *	      If BIOS has not enabled SME then don't advertise the
 	 *	      SME feature (set in scattered.c).
 	 *   For SEV: If BIOS has not enabled SEV then don't advertise the
-	 *            SEV feature (set in scattered.c).
+	 *            SEV and SEV_ES feature (set in scattered.c).
 	 *
 	 *   In all cases, since support for SME and SEV requires long mode,
 	 *   don't advertise the feature under CONFIG_X86_32.
@@ -612,6 +612,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 		setup_clear_cpu_cap(X86_FEATURE_SME);
 clear_sev:
 		setup_clear_cpu_cap(X86_FEATURE_SEV);
+		setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
 	}
 }
 
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 62b137c3c97a..30f354989cf1 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -41,6 +41,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_MBA,		CPUID_EBX,  6, 0x80000008, 0 },
 	{ X86_FEATURE_SME,		CPUID_EAX,  0, 0x8000001f, 0 },
 	{ X86_FEATURE_SEV,		CPUID_EAX,  1, 0x8000001f, 0 },
+	{ X86_FEATURE_SEV_ES,		CPUID_EAX,  3, 0x8000001f, 0 },
 	{ 0, 0, 0, 0, 0 }
 };
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 04/70] x86/traps: Move some definitions to <asm/trap_defs.h>
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (2 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (65 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Move the definition of x86 trap vector numbers and the page-fault
error code bits to the new header file asm/trap_defs.h. This makes it
easier to include them into pre-decompression boot code. No functional
changes.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/trap_defs.h | 49 ++++++++++++++++++++++++++++++++
 arch/x86/include/asm/traps.h     | 44 +---------------------------
 2 files changed, 50 insertions(+), 43 deletions(-)
 create mode 100644 arch/x86/include/asm/trap_defs.h

diff --git a/arch/x86/include/asm/trap_defs.h b/arch/x86/include/asm/trap_defs.h
new file mode 100644
index 000000000000..488f82ac36da
--- /dev/null
+++ b/arch/x86/include/asm/trap_defs.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_TRAP_DEFS_H
+#define _ASM_X86_TRAP_DEFS_H
+
+/* Interrupts/Exceptions */
+enum {
+	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
+	X86_TRAP_DB,		/*  1, Debug */
+	X86_TRAP_NMI,		/*  2, Non-maskable Interrupt */
+	X86_TRAP_BP,		/*  3, Breakpoint */
+	X86_TRAP_OF,		/*  4, Overflow */
+	X86_TRAP_BR,		/*  5, Bound Range Exceeded */
+	X86_TRAP_UD,		/*  6, Invalid Opcode */
+	X86_TRAP_NM,		/*  7, Device Not Available */
+	X86_TRAP_DF,		/*  8, Double Fault */
+	X86_TRAP_OLD_MF,	/*  9, Coprocessor Segment Overrun */
+	X86_TRAP_TS,		/* 10, Invalid TSS */
+	X86_TRAP_NP,		/* 11, Segment Not Present */
+	X86_TRAP_SS,		/* 12, Stack Segment Fault */
+	X86_TRAP_GP,		/* 13, General Protection Fault */
+	X86_TRAP_PF,		/* 14, Page Fault */
+	X86_TRAP_SPURIOUS,	/* 15, Spurious Interrupt */
+	X86_TRAP_MF,		/* 16, x87 Floating-Point Exception */
+	X86_TRAP_AC,		/* 17, Alignment Check */
+	X86_TRAP_MC,		/* 18, Machine Check */
+	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
+	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
+};
+
+/*
+ * Page fault error code bits:
+ *
+ *   bit 0 ==	 0: no page found	1: protection fault
+ *   bit 1 ==	 0: read access		1: write access
+ *   bit 2 ==	 0: kernel-mode access	1: user-mode access
+ *   bit 3 ==				1: use of reserved bit detected
+ *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
+ */
+enum x86_pf_error_code {
+	X86_PF_PROT	=		1 << 0,
+	X86_PF_WRITE	=		1 << 1,
+	X86_PF_USER	=		1 << 2,
+	X86_PF_RSVD	=		1 << 3,
+	X86_PF_INSTR	=		1 << 4,
+	X86_PF_PK	=		1 << 5,
+};
+
+#endif /* _ASM_X86_TRAP_DEFS_H */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index ffa0dc8a535e..2aa786484bb1 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -5,6 +5,7 @@
 #include <linux/context_tracking_state.h>
 #include <linux/kprobes.h>
 
+#include <asm/trap_defs.h>
 #include <asm/debugreg.h>
 #include <asm/siginfo.h>			/* TRAP_TRACE, ... */
 
@@ -132,47 +133,4 @@ void __noreturn handle_stack_overflow(const char *message,
 				      unsigned long fault_address);
 #endif
 
-/* Interrupts/Exceptions */
-enum {
-	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
-	X86_TRAP_DB,		/*  1, Debug */
-	X86_TRAP_NMI,		/*  2, Non-maskable Interrupt */
-	X86_TRAP_BP,		/*  3, Breakpoint */
-	X86_TRAP_OF,		/*  4, Overflow */
-	X86_TRAP_BR,		/*  5, Bound Range Exceeded */
-	X86_TRAP_UD,		/*  6, Invalid Opcode */
-	X86_TRAP_NM,		/*  7, Device Not Available */
-	X86_TRAP_DF,		/*  8, Double Fault */
-	X86_TRAP_OLD_MF,	/*  9, Coprocessor Segment Overrun */
-	X86_TRAP_TS,		/* 10, Invalid TSS */
-	X86_TRAP_NP,		/* 11, Segment Not Present */
-	X86_TRAP_SS,		/* 12, Stack Segment Fault */
-	X86_TRAP_GP,		/* 13, General Protection Fault */
-	X86_TRAP_PF,		/* 14, Page Fault */
-	X86_TRAP_SPURIOUS,	/* 15, Spurious Interrupt */
-	X86_TRAP_MF,		/* 16, x87 Floating-Point Exception */
-	X86_TRAP_AC,		/* 17, Alignment Check */
-	X86_TRAP_MC,		/* 18, Machine Check */
-	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
-	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
-};
-
-/*
- * Page fault error code bits:
- *
- *   bit 0 ==	 0: no page found	1: protection fault
- *   bit 1 ==	 0: read access		1: write access
- *   bit 2 ==	 0: kernel-mode access	1: user-mode access
- *   bit 3 ==				1: use of reserved bit detected
- *   bit 4 ==				1: fault was an instruction fetch
- *   bit 5 ==				1: protection keys block access
- */
-enum x86_pf_error_code {
-	X86_PF_PROT	=		1 << 0,
-	X86_PF_WRITE	=		1 << 1,
-	X86_PF_USER	=		1 << 2,
-	X86_PF_RSVD	=		1 << 3,
-	X86_PF_INSTR	=		1 << 4,
-	X86_PF_PK	=		1 << 5,
-};
 #endif /* _ASM_X86_TRAPS_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The inat-tables.c file has some arrays in it that contain pointers to
other arrays. These pointers need to be relocated when the kernel
image is moved to a different location.

The pre-decompression boot-code has no support for applying ELF
relocations, so initialize these arrays at runtime in the
pre-decompression code to make sure all pointers are correctly
initialized.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/tools/gen-insn-attr-x86.awk       | 50 +++++++++++++++++++++-
 tools/arch/x86/tools/gen-insn-attr-x86.awk | 50 +++++++++++++++++++++-
 2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/arch/x86/tools/gen-insn-attr-x86.awk b/arch/x86/tools/gen-insn-attr-x86.awk
index a42015b305f4..af38469afd14 100644
--- a/arch/x86/tools/gen-insn-attr-x86.awk
+++ b/arch/x86/tools/gen-insn-attr-x86.awk
@@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
 END {
 	if (awkchecked != "")
 		exit 1
+
+	print "#ifndef __BOOT_COMPRESSED\n"
+
 	# print escape opcode map's array
 	print "/* Escape opcode map array */"
 	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
@@ -388,6 +391,51 @@ END {
 		for (j = 0; j < max_lprefix; j++)
 			if (atable[i,j])
 				print "	["i"]["j"] = "atable[i,j]","
-	print "};"
+	print "};\n"
+
+	print "#else /* !__BOOT_COMPRESSED */\n"
+
+	print "/* Escape opcode map array */"
+	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* Group opcode map array */"
+	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* AVX opcode map array */"
+	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "static void inat_init_tables(void)"
+	print "{"
+
+	# print escape opcode map's array
+	print "\t/* Print Escape opcode map array */"
+	for (i = 0; i < geid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (etable[i,j])
+				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
+	print ""
+
+	# print group opcode map's array
+	print "\t/* Print Group opcode map array */"
+	for (i = 0; i < ggid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (gtable[i,j])
+				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
+	print ""
+	# print AVX opcode map's array
+	print "\t/* Print AVX opcode map array */"
+	for (i = 0; i < gaid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (atable[i,j])
+				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
+
+	print "}"
+	print "#endif"
 }
 
diff --git a/tools/arch/x86/tools/gen-insn-attr-x86.awk b/tools/arch/x86/tools/gen-insn-attr-x86.awk
index a42015b305f4..af38469afd14 100644
--- a/tools/arch/x86/tools/gen-insn-attr-x86.awk
+++ b/tools/arch/x86/tools/gen-insn-attr-x86.awk
@@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
 END {
 	if (awkchecked != "")
 		exit 1
+
+	print "#ifndef __BOOT_COMPRESSED\n"
+
 	# print escape opcode map's array
 	print "/* Escape opcode map array */"
 	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
@@ -388,6 +391,51 @@ END {
 		for (j = 0; j < max_lprefix; j++)
 			if (atable[i,j])
 				print "	["i"]["j"] = "atable[i,j]","
-	print "};"
+	print "};\n"
+
+	print "#else /* !__BOOT_COMPRESSED */\n"
+
+	print "/* Escape opcode map array */"
+	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* Group opcode map array */"
+	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* AVX opcode map array */"
+	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "static void inat_init_tables(void)"
+	print "{"
+
+	# print escape opcode map's array
+	print "\t/* Print Escape opcode map array */"
+	for (i = 0; i < geid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (etable[i,j])
+				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
+	print ""
+
+	# print group opcode map's array
+	print "\t/* Print Group opcode map array */"
+	for (i = 0; i < ggid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (gtable[i,j])
+				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
+	print ""
+	# print AVX opcode map's array
+	print "\t/* Print AVX opcode map array */"
+	for (i = 0; i < gaid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (atable[i,j])
+				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
+
+	print "}"
+	print "#endif"
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

The inat-tables.c file has some arrays in it that contain pointers to
other arrays. These pointers need to be relocated when the kernel
image is moved to a different location.

The pre-decompression boot-code has no support for applying ELF
relocations, so initialize these arrays at runtime in the
pre-decompression code to make sure all pointers are correctly
initialized.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/tools/gen-insn-attr-x86.awk       | 50 +++++++++++++++++++++-
 tools/arch/x86/tools/gen-insn-attr-x86.awk | 50 +++++++++++++++++++++-
 2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/arch/x86/tools/gen-insn-attr-x86.awk b/arch/x86/tools/gen-insn-attr-x86.awk
index a42015b305f4..af38469afd14 100644
--- a/arch/x86/tools/gen-insn-attr-x86.awk
+++ b/arch/x86/tools/gen-insn-attr-x86.awk
@@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
 END {
 	if (awkchecked != "")
 		exit 1
+
+	print "#ifndef __BOOT_COMPRESSED\n"
+
 	# print escape opcode map's array
 	print "/* Escape opcode map array */"
 	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
@@ -388,6 +391,51 @@ END {
 		for (j = 0; j < max_lprefix; j++)
 			if (atable[i,j])
 				print "	["i"]["j"] = "atable[i,j]","
-	print "};"
+	print "};\n"
+
+	print "#else /* !__BOOT_COMPRESSED */\n"
+
+	print "/* Escape opcode map array */"
+	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* Group opcode map array */"
+	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* AVX opcode map array */"
+	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "static void inat_init_tables(void)"
+	print "{"
+
+	# print escape opcode map's array
+	print "\t/* Print Escape opcode map array */"
+	for (i = 0; i < geid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (etable[i,j])
+				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
+	print ""
+
+	# print group opcode map's array
+	print "\t/* Print Group opcode map array */"
+	for (i = 0; i < ggid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (gtable[i,j])
+				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
+	print ""
+	# print AVX opcode map's array
+	print "\t/* Print AVX opcode map array */"
+	for (i = 0; i < gaid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (atable[i,j])
+				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
+
+	print "}"
+	print "#endif"
 }
 
diff --git a/tools/arch/x86/tools/gen-insn-attr-x86.awk b/tools/arch/x86/tools/gen-insn-attr-x86.awk
index a42015b305f4..af38469afd14 100644
--- a/tools/arch/x86/tools/gen-insn-attr-x86.awk
+++ b/tools/arch/x86/tools/gen-insn-attr-x86.awk
@@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
 END {
 	if (awkchecked != "")
 		exit 1
+
+	print "#ifndef __BOOT_COMPRESSED\n"
+
 	# print escape opcode map's array
 	print "/* Escape opcode map array */"
 	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
@@ -388,6 +391,51 @@ END {
 		for (j = 0; j < max_lprefix; j++)
 			if (atable[i,j])
 				print "	["i"]["j"] = "atable[i,j]","
-	print "};"
+	print "};\n"
+
+	print "#else /* !__BOOT_COMPRESSED */\n"
+
+	print "/* Escape opcode map array */"
+	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* Group opcode map array */"
+	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "/* AVX opcode map array */"
+	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
+	      "[INAT_LSTPFX_MAX + 1];"
+	print ""
+
+	print "static void inat_init_tables(void)"
+	print "{"
+
+	# print escape opcode map's array
+	print "\t/* Print Escape opcode map array */"
+	for (i = 0; i < geid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (etable[i,j])
+				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
+	print ""
+
+	# print group opcode map's array
+	print "\t/* Print Group opcode map array */"
+	for (i = 0; i < ggid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (gtable[i,j])
+				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
+	print ""
+	# print AVX opcode map's array
+	print "\t/* Print AVX opcode map array */"
+	for (i = 0; i < gaid; i++)
+		for (j = 0; j < max_lprefix; j++)
+			if (atable[i,j])
+				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
+
+	print "}"
+	print "#endif"
 }
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 06/70] x86/umip: Factor out instruction fetch
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (4 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-26 17:21   ` Borislav Petkov
  2020-03-19  9:13   ` Joerg Roedel
                   ` (63 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Factor out the code to fetch the instruction from user-space to a helper
function.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  2 ++
 arch/x86/kernel/umip.c           | 26 +++++-----------------
 arch/x86/lib/insn-eval.c         | 38 ++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 2b6ccf2c49f1..b8b9ef1bbd06 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -19,5 +19,7 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
+int insn_fetch_from_user(struct pt_regs *regs,
+			 unsigned char buf[MAX_INSN_SIZE]);
 
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 4d732a444711..00cb157673b1 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -317,11 +317,11 @@ static void force_sig_info_umip_fault(void __user *addr, struct pt_regs *regs)
  */
 bool fixup_umip_exception(struct pt_regs *regs)
 {
-	int not_copied, nr_copied, reg_offset, dummy_data_size, umip_inst;
-	unsigned long seg_base = 0, *reg_addr;
+	int nr_copied, reg_offset, dummy_data_size, umip_inst;
 	/* 10 bytes is the maximum size of the result of UMIP instructions */
 	unsigned char dummy_data[10] = { 0 };
 	unsigned char buf[MAX_INSN_SIZE];
+	unsigned long *reg_addr;
 	void __user *uaddr;
 	struct insn insn;
 	int seg_defs;
@@ -329,26 +329,12 @@ bool fixup_umip_exception(struct pt_regs *regs)
 	if (!regs)
 		return false;
 
-	/*
-	 * If not in user-space long mode, a custom code segment could be in
-	 * use. This is true in protected mode (if the process defined a local
-	 * descriptor table), or virtual-8086 mode. In most of the cases
-	 * seg_base will be zero as in USER_CS.
-	 */
-	if (!user_64bit_mode(regs))
-		seg_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
-
-	if (seg_base == -1L)
-		return false;
-
-	not_copied = copy_from_user(buf, (void __user *)(seg_base + regs->ip),
-				    sizeof(buf));
-	nr_copied = sizeof(buf) - not_copied;
+	nr_copied = insn_fetch_from_user(regs, buf);
 
 	/*
-	 * The copy_from_user above could have failed if user code is protected
-	 * by a memory protection key. Give up on emulation in such a case.
-	 * Should we issue a page fault?
+	 * The insn_fetch_from_user above could have failed if user code
+	 * is protected by a memory protection key. Give up on emulation
+	 * in such a case.  Should we issue a page fault?
 	 */
 	if (!nr_copied)
 		return false;
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 31600d851fd8..95ae3953e2a2 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1369,3 +1369,41 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs)
 		return (void __user *)-1L;
 	}
 }
+
+/**
+ * insn_fetch_from_user() - Copy instruction bytes from user-space memory
+ * @regs:	Structure with register values as seen when entering kernel mode
+ * @buf:	Array to store the fetched instruction
+ *
+ * Gets the linear address of the instruction and copies the instruction bytes
+ * to the buf.
+ *
+ * Returns:
+ *
+ * Number of instruction bytes copied.
+ *
+ * 0 if nothing was copied.
+ */
+int insn_fetch_from_user(struct pt_regs *regs,
+			 unsigned char buf[MAX_INSN_SIZE])
+{
+	unsigned long seg_base = 0;
+	int not_copied;
+
+	/*
+	 * If not in user-space long mode, a custom code segment could be in
+	 * use. This is true in protected mode (if the process defined a local
+	 * descriptor table), or virtual-8086 mode. In most of the cases
+	 * seg_base will be zero as in USER_CS.
+	 */
+	if (!user_64bit_mode(regs))
+		seg_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
+
+	if (seg_base == -1L)
+		return 0;
+
+	not_copied = copy_from_user(buf, (void __user *)(seg_base + regs->ip),
+				    MAX_INSN_SIZE);
+
+	return MAX_INSN_SIZE - not_copied;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 07/70] x86/umip: Factor out instruction decoding
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Factor out the code used to decode an instruction with the correct
address and operand sizes to a helper function.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  2 ++
 arch/x86/kernel/umip.c           | 23 +---------------
 arch/x86/lib/insn-eval.c         | 45 ++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index b8b9ef1bbd06..b4ff3e3316d1 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -21,5 +21,7 @@ unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
 int insn_fetch_from_user(struct pt_regs *regs,
 			 unsigned char buf[MAX_INSN_SIZE]);
+bool insn_decode(struct pt_regs *regs, struct insn *insn,
+		 unsigned char buf[MAX_INSN_SIZE], int buf_size);
 
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 00cb157673b1..ff6d67242eee 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -324,7 +324,6 @@ bool fixup_umip_exception(struct pt_regs *regs)
 	unsigned long *reg_addr;
 	void __user *uaddr;
 	struct insn insn;
-	int seg_defs;
 
 	if (!regs)
 		return false;
@@ -339,27 +338,7 @@ bool fixup_umip_exception(struct pt_regs *regs)
 	if (!nr_copied)
 		return false;
 
-	insn_init(&insn, buf, nr_copied, user_64bit_mode(regs));
-
-	/*
-	 * Override the default operand and address sizes with what is specified
-	 * in the code segment descriptor. The instruction decoder only sets
-	 * the address size it to either 4 or 8 address bytes and does nothing
-	 * for the operand bytes. This OK for most of the cases, but we could
-	 * have special cases where, for instance, a 16-bit code segment
-	 * descriptor is used.
-	 * If there is an address override prefix, the instruction decoder
-	 * correctly updates these values, even for 16-bit defaults.
-	 */
-	seg_defs = insn_get_code_seg_params(regs);
-	if (seg_defs == -EINVAL)
-		return false;
-
-	insn.addr_bytes = INSN_CODE_SEG_ADDR_SZ(seg_defs);
-	insn.opnd_bytes = INSN_CODE_SEG_OPND_SZ(seg_defs);
-
-	insn_get_length(&insn);
-	if (nr_copied < insn.length)
+	if (!insn_decode(regs, &insn, buf, nr_copied))
 		return false;
 
 	umip_inst = identify_insn(&insn);
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 95ae3953e2a2..1949f5258f9e 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1407,3 +1407,48 @@ int insn_fetch_from_user(struct pt_regs *regs,
 
 	return MAX_INSN_SIZE - not_copied;
 }
+
+/**
+ * insn_decode() - Decode an instruction
+ * @regs:	Structure with register values as seen when entering kernel mode
+ * @insn:	Structure to store decoded instruction
+ * @buf:	Buffer containing the instruction bytes
+ * @buf_size:   Number of instruction bytes available in buf
+ *
+ * Decodes the instruction provided in buf and stores the decoding results in
+ * insn. Also determines the correct address and operand sizes.
+ *
+ * Returns:
+ *
+ * True if instruction was decoded, False otherwise.
+ */
+bool insn_decode(struct pt_regs *regs, struct insn *insn,
+		 unsigned char buf[MAX_INSN_SIZE], int buf_size)
+{
+	int seg_defs;
+
+	insn_init(insn, buf, buf_size, user_64bit_mode(regs));
+
+	/*
+	 * Override the default operand and address sizes with what is specified
+	 * in the code segment descriptor. The instruction decoder only sets
+	 * the address size it to either 4 or 8 address bytes and does nothing
+	 * for the operand bytes. This OK for most of the cases, but we could
+	 * have special cases where, for instance, a 16-bit code segment
+	 * descriptor is used.
+	 * If there is an address override prefix, the instruction decoder
+	 * correctly updates these values, even for 16-bit defaults.
+	 */
+	seg_defs = insn_get_code_seg_params(regs);
+	if (seg_defs == -EINVAL)
+		return false;
+
+	insn->addr_bytes = INSN_CODE_SEG_ADDR_SZ(seg_defs);
+	insn->opnd_bytes = INSN_CODE_SEG_OPND_SZ(seg_defs);
+
+	insn_get_length(insn);
+	if (buf_size < insn->length)
+		return false;
+
+	return true;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 07/70] x86/umip: Factor out instruction decoding
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Factor out the code used to decode an instruction with the correct
address and operand sizes to a helper function.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  2 ++
 arch/x86/kernel/umip.c           | 23 +---------------
 arch/x86/lib/insn-eval.c         | 45 ++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index b8b9ef1bbd06..b4ff3e3316d1 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -21,5 +21,7 @@ unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
 int insn_fetch_from_user(struct pt_regs *regs,
 			 unsigned char buf[MAX_INSN_SIZE]);
+bool insn_decode(struct pt_regs *regs, struct insn *insn,
+		 unsigned char buf[MAX_INSN_SIZE], int buf_size);
 
 #endif /* _ASM_X86_INSN_EVAL_H */
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 00cb157673b1..ff6d67242eee 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -324,7 +324,6 @@ bool fixup_umip_exception(struct pt_regs *regs)
 	unsigned long *reg_addr;
 	void __user *uaddr;
 	struct insn insn;
-	int seg_defs;
 
 	if (!regs)
 		return false;
@@ -339,27 +338,7 @@ bool fixup_umip_exception(struct pt_regs *regs)
 	if (!nr_copied)
 		return false;
 
-	insn_init(&insn, buf, nr_copied, user_64bit_mode(regs));
-
-	/*
-	 * Override the default operand and address sizes with what is specified
-	 * in the code segment descriptor. The instruction decoder only sets
-	 * the address size it to either 4 or 8 address bytes and does nothing
-	 * for the operand bytes. This OK for most of the cases, but we could
-	 * have special cases where, for instance, a 16-bit code segment
-	 * descriptor is used.
-	 * If there is an address override prefix, the instruction decoder
-	 * correctly updates these values, even for 16-bit defaults.
-	 */
-	seg_defs = insn_get_code_seg_params(regs);
-	if (seg_defs == -EINVAL)
-		return false;
-
-	insn.addr_bytes = INSN_CODE_SEG_ADDR_SZ(seg_defs);
-	insn.opnd_bytes = INSN_CODE_SEG_OPND_SZ(seg_defs);
-
-	insn_get_length(&insn);
-	if (nr_copied < insn.length)
+	if (!insn_decode(regs, &insn, buf, nr_copied))
 		return false;
 
 	umip_inst = identify_insn(&insn);
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 95ae3953e2a2..1949f5258f9e 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1407,3 +1407,48 @@ int insn_fetch_from_user(struct pt_regs *regs,
 
 	return MAX_INSN_SIZE - not_copied;
 }
+
+/**
+ * insn_decode() - Decode an instruction
+ * @regs:	Structure with register values as seen when entering kernel mode
+ * @insn:	Structure to store decoded instruction
+ * @buf:	Buffer containing the instruction bytes
+ * @buf_size:   Number of instruction bytes available in buf
+ *
+ * Decodes the instruction provided in buf and stores the decoding results in
+ * insn. Also determines the correct address and operand sizes.
+ *
+ * Returns:
+ *
+ * True if instruction was decoded, False otherwise.
+ */
+bool insn_decode(struct pt_regs *regs, struct insn *insn,
+		 unsigned char buf[MAX_INSN_SIZE], int buf_size)
+{
+	int seg_defs;
+
+	insn_init(insn, buf, buf_size, user_64bit_mode(regs));
+
+	/*
+	 * Override the default operand and address sizes with what is specified
+	 * in the code segment descriptor. The instruction decoder only sets
+	 * the address size it to either 4 or 8 address bytes and does nothing
+	 * for the operand bytes. This OK for most of the cases, but we could
+	 * have special cases where, for instance, a 16-bit code segment
+	 * descriptor is used.
+	 * If there is an address override prefix, the instruction decoder
+	 * correctly updates these values, even for 16-bit defaults.
+	 */
+	seg_defs = insn_get_code_seg_params(regs);
+	if (seg_defs == -EINVAL)
+		return false;
+
+	insn->addr_bytes = INSN_CODE_SEG_ADDR_SZ(seg_defs);
+	insn->opnd_bytes = INSN_CODE_SEG_OPND_SZ(seg_defs);
+
+	insn_get_length(insn);
+	if (buf_size < insn->length)
+		return false;
+
+	return true;
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 08/70] x86/insn: Add insn_get_modrm_reg_off()
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add a function to the instruction decoder which returns the pt_regs
offset of the register specified in the reg field of the modrm byte.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  1 +
 arch/x86/lib/insn-eval.c         | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index b4ff3e3316d1..1e343010129e 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -17,6 +17,7 @@
 
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
+int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
 unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
 int insn_fetch_from_user(struct pt_regs *regs,
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 1949f5258f9e..f18260a19960 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -20,6 +20,7 @@
 
 enum reg_type {
 	REG_TYPE_RM = 0,
+	REG_TYPE_REG,
 	REG_TYPE_INDEX,
 	REG_TYPE_BASE,
 };
@@ -441,6 +442,13 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
 			regno += 8;
 		break;
 
+	case REG_TYPE_REG:
+		regno = X86_MODRM_REG(insn->modrm.value);
+
+		if (X86_REX_R(insn->rex_prefix.value))
+			regno += 8;
+		break;
+
 	case REG_TYPE_INDEX:
 		regno = X86_SIB_INDEX(insn->sib.value);
 		if (X86_REX_X(insn->rex_prefix.value))
@@ -809,6 +817,21 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs)
 	return get_reg_offset(insn, regs, REG_TYPE_RM);
 }
 
+/**
+ * insn_get_modrm_reg_off() - Obtain register in reg part of the ModRM byte
+ * @insn:	Instruction containing the ModRM byte
+ * @regs:	Register values as seen when entering kernel mode
+ *
+ * Returns:
+ *
+ * The register indicated by the reg part of the ModRM byte. The
+ * register is obtained as an offset from the base of pt_regs.
+ */
+int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs)
+{
+	return get_reg_offset(insn, regs, REG_TYPE_REG);
+}
+
 /**
  * get_seg_base_limit() - obtain base address and limit of a segment
  * @insn:	Instruction. Must be valid.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 08/70] x86/insn: Add insn_get_modrm_reg_off()
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Add a function to the instruction decoder which returns the pt_regs
offset of the register specified in the reg field of the modrm byte.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  1 +
 arch/x86/lib/insn-eval.c         | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index b4ff3e3316d1..1e343010129e 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -17,6 +17,7 @@
 
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
+int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
 unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
 int insn_get_code_seg_params(struct pt_regs *regs);
 int insn_fetch_from_user(struct pt_regs *regs,
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 1949f5258f9e..f18260a19960 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -20,6 +20,7 @@
 
 enum reg_type {
 	REG_TYPE_RM = 0,
+	REG_TYPE_REG,
 	REG_TYPE_INDEX,
 	REG_TYPE_BASE,
 };
@@ -441,6 +442,13 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
 			regno += 8;
 		break;
 
+	case REG_TYPE_REG:
+		regno = X86_MODRM_REG(insn->modrm.value);
+
+		if (X86_REX_R(insn->rex_prefix.value))
+			regno += 8;
+		break;
+
 	case REG_TYPE_INDEX:
 		regno = X86_SIB_INDEX(insn->sib.value);
 		if (X86_REX_X(insn->rex_prefix.value))
@@ -809,6 +817,21 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs)
 	return get_reg_offset(insn, regs, REG_TYPE_RM);
 }
 
+/**
+ * insn_get_modrm_reg_off() - Obtain register in reg part of the ModRM byte
+ * @insn:	Instruction containing the ModRM byte
+ * @regs:	Register values as seen when entering kernel mode
+ *
+ * Returns:
+ *
+ * The register indicated by the reg part of the ModRM byte. The
+ * register is obtained as an offset from the base of pt_regs.
+ */
+int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs)
+{
+	return get_reg_offset(insn, regs, REG_TYPE_REG);
+}
+
 /**
  * get_seg_base_limit() - obtain base address and limit of a segment
  * @insn:	Instruction. Must be valid.
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 09/70] x86/insn: Add insn_rep_prefix() helper
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add a function to check whether an instruction has a REP prefix.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  1 +
 arch/x86/lib/insn-eval.c         | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 1e343010129e..41dee0faae97 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -15,6 +15,7 @@
 #define INSN_CODE_SEG_OPND_SZ(params) (params & 0xf)
 #define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4))
 
+bool insn_rep_prefix(struct insn *insn);
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index f18260a19960..5d98dff5a2d7 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -53,6 +53,30 @@ static bool is_string_insn(struct insn *insn)
 	}
 }
 
+/**
+ * insn_rep_prefix() - Determine if instruction has a REP prefix
+ * @insn:	Instruction containing the prefix to inspect
+ *
+ * Returns:
+ *
+ * true if the instruction has a REP prefix, false if not.
+ */
+bool insn_rep_prefix(struct insn *insn)
+{
+	int i;
+
+	insn_get_prefixes(insn);
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		insn_byte_t p = insn->prefixes.bytes[i];
+
+		if (p == 0xf2 || p == 0xf3)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  * get_seg_reg_override_idx() - obtain segment register override index
  * @insn:	Valid instruction with segment override prefixes
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 09/70] x86/insn: Add insn_rep_prefix() helper
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Add a function to check whether an instruction has a REP prefix.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/insn-eval.h |  1 +
 arch/x86/lib/insn-eval.c         | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
index 1e343010129e..41dee0faae97 100644
--- a/arch/x86/include/asm/insn-eval.h
+++ b/arch/x86/include/asm/insn-eval.h
@@ -15,6 +15,7 @@
 #define INSN_CODE_SEG_OPND_SZ(params) (params & 0xf)
 #define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4))
 
+bool insn_rep_prefix(struct insn *insn);
 void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
 int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index f18260a19960..5d98dff5a2d7 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -53,6 +53,30 @@ static bool is_string_insn(struct insn *insn)
 	}
 }
 
+/**
+ * insn_rep_prefix() - Determine if instruction has a REP prefix
+ * @insn:	Instruction containing the prefix to inspect
+ *
+ * Returns:
+ *
+ * true if the instruction has a REP prefix, false if not.
+ */
+bool insn_rep_prefix(struct insn *insn)
+{
+	int i;
+
+	insn_get_prefixes(insn);
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		insn_byte_t p = insn->prefixes.bytes[i];
+
+		if (p == 0xf2 || p == 0xf3)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  * get_seg_reg_override_idx() - obtain segment register override index
  * @insn:	Valid instruction with segment override prefixes
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 10/70] x86/boot/compressed: Fix debug_puthex() parameter type
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (8 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-28 11:23   ` [tip: x86/boot] " tip-bot2 for Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (59 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

In the CONFIG_X86_VERBOSE_BOOTUP=Y case the debug_puthex() macro just
turns into __puthex, which takes 'unsigned long' as parameter. But in
the CONFIG_X86_VERBOSE_BOOTUP=N case it is a function which takes
'unsigned char *', causing compile warnings when the function is used.
Fix the parameter type to get rid of the warnings.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/misc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index c8181392f70d..726e264410ff 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -59,7 +59,7 @@ void __puthex(unsigned long value);
 
 static inline void debug_putstr(const char *s)
 { }
-static inline void debug_puthex(const char *s)
+static inline void debug_puthex(unsigned long value)
 { }
 #define debug_putaddr(x) /* */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 11/70] x86/boot/compressed/64: Disable red-zone usage
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The x86-64 ABI defines a red-zone on the stack:

  The 128-byte area beyond the location pointed to by %rsp is
  considered to be reserved and shall not be modified by signal or
  interrupt handlers. 10 Therefore, functions may use this area for
  temporary data that is not needed across function calls. In
  particular, leaf functions may use this area for their entire stack
  frame, rather than adjusting the stack pointer in the prologue and
  epilogue. This area is known as the red zone.

This is not compatible with exception handling, so disable it for the
pre-decompression boot code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/Makefile            | 2 +-
 arch/x86/boot/compressed/Makefile | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 012b82fc8617..8f55e4ce1ccc 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -65,7 +65,7 @@ clean-files += cpustr.h
 
 # ---------------------------------------------------------------------------
 
-KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP
+KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP -mno-red-zone
 KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
 KBUILD_CFLAGS	+= $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
 GCOV_PROFILE := n
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 26050ae0b27e..e186cc0b628d 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -30,7 +30,7 @@ KBUILD_CFLAGS := -m$(BITS) -O2
 KBUILD_CFLAGS += -fno-strict-aliasing $(call cc-option, -fPIE, -fPIC)
 KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
 cflags-$(CONFIG_X86_32) := -march=i386
-cflags-$(CONFIG_X86_64) := -mcmodel=small
+cflags-$(CONFIG_X86_64) := -mcmodel=small -mno-red-zone
 KBUILD_CFLAGS += $(cflags-y)
 KBUILD_CFLAGS += -mno-mmx -mno-sse
 KBUILD_CFLAGS += $(call cc-option,-ffreestanding)
@@ -87,7 +87,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 
-$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
+$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar
 
 vmlinux-objs-$(CONFIG_EFI_STUB) += $(obj)/eboot.o \
 	$(objtree)/drivers/firmware/efi/libstub/lib.a
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 11/70] x86/boot/compressed/64: Disable red-zone usage
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

The x86-64 ABI defines a red-zone on the stack:

  The 128-byte area beyond the location pointed to by %rsp is
  considered to be reserved and shall not be modified by signal or
  interrupt handlers. 10 Therefore, functions may use this area for
  temporary data that is not needed across function calls. In
  particular, leaf functions may use this area for their entire stack
  frame, rather than adjusting the stack pointer in the prologue and
  epilogue. This area is known as the red zone.

This is not compatible with exception handling, so disable it for the
pre-decompression boot code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/Makefile            | 2 +-
 arch/x86/boot/compressed/Makefile | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 012b82fc8617..8f55e4ce1ccc 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -65,7 +65,7 @@ clean-files += cpustr.h
 
 # ---------------------------------------------------------------------------
 
-KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP
+KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP -mno-red-zone
 KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
 KBUILD_CFLAGS	+= $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
 GCOV_PROFILE := n
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 26050ae0b27e..e186cc0b628d 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -30,7 +30,7 @@ KBUILD_CFLAGS := -m$(BITS) -O2
 KBUILD_CFLAGS += -fno-strict-aliasing $(call cc-option, -fPIE, -fPIC)
 KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
 cflags-$(CONFIG_X86_32) := -march=i386
-cflags-$(CONFIG_X86_64) := -mcmodel=small
+cflags-$(CONFIG_X86_64) := -mcmodel=small -mno-red-zone
 KBUILD_CFLAGS += $(cflags-y)
 KBUILD_CFLAGS += -mno-mmx -mno-sse
 KBUILD_CFLAGS += $(call cc-option,-ffreestanding)
@@ -87,7 +87,7 @@ endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 
-$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
+$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar
 
 vmlinux-objs-$(CONFIG_EFI_STUB) += $(obj)/eboot.o \
 	$(objtree)/drivers/firmware/efi/libstub/lib.a
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 12/70] x86/boot/compressed/64: Add IDT Infrastructure
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (10 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-04-07  2:21   ` Arvind Sankar
  2020-03-19  9:13 ` [PATCH 13/70] x86/boot/compressed/64: Rename kaslr_64.c to ident_map_64.c Joerg Roedel
                   ` (57 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add code needed to setup an IDT in the early pre-decompression
boot-code. The IDT is loaded first in startup_64, which is after
EfiExitBootServices() has been called, and later reloaded when the
kernel image has been relocated to the end of the decompression area.

This allows to setup different IDT handlers before and after the
relocation.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/Makefile          |  1 +
 arch/x86/boot/compressed/head_64.S         | 34 ++++++++++
 arch/x86/boot/compressed/idt_64.c          | 43 +++++++++++++
 arch/x86/boot/compressed/idt_handlers_64.S | 75 ++++++++++++++++++++++
 arch/x86/boot/compressed/misc.h            |  5 ++
 arch/x86/include/asm/desc_defs.h           |  3 +
 6 files changed, 161 insertions(+)
 create mode 100644 arch/x86/boot/compressed/idt_64.c
 create mode 100644 arch/x86/boot/compressed/idt_handlers_64.S

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e186cc0b628d..54d63526e856 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -81,6 +81,7 @@ vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
 ifdef CONFIG_X86_64
 	vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr_64.o
+	vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
 	vmlinux-objs-y += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
 endif
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 1f1f6c8139b3..d27a9ce1bcb0 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -33,6 +33,7 @@
 #include <asm/processor-flags.h>
 #include <asm/asm-offsets.h>
 #include <asm/bootparam.h>
+#include <asm/desc_defs.h>
 #include "pgtable.h"
 
 /*
@@ -358,6 +359,10 @@ SYM_CODE_START(startup_64)
 	movq	%rax, gdt64+2(%rip)
 	lgdt	gdt64(%rip)
 
+	pushq	%rsi
+	call	load_stage1_idt
+	popq	%rsi
+
 	/*
 	 * paging_prepare() sets up the trampoline and checks if we need to
 	 * enable 5-level paging.
@@ -465,6 +470,16 @@ SYM_FUNC_END_ALIAS(efi_stub_entry)
 	.text
 SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 
+/*
+ * Reload GDT after relocation - The GDT at the non-relocated position
+ * might be overwritten soon by the in-place decompression, so reload
+ * GDT at the relocated address. The GDT is referenced by exception
+ * handling and needs to be set up correctly.
+ */
+	leaq	gdt(%rip), %rax
+	movq	%rax, gdt64+2(%rip)
+	lgdt	gdt64(%rip)
+
 /*
  * Clear BSS (stack is currently empty)
  */
@@ -475,6 +490,13 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 	shrq	$3, %rcx
 	rep	stosq
 
+/*
+ * Load stage2 IDT
+ */
+	pushq	%rsi
+	call	load_stage2_idt
+	popq	%rsi
+
 /*
  * Do the extraction, and jump to the new kernel..
  */
@@ -628,6 +650,18 @@ SYM_DATA_START_LOCAL(gdt)
 	.quad   0x0000000000000000	/* TS continued */
 SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
 
+SYM_DATA_START(boot_idt_desc)
+	.word	boot_idt_end - boot_idt
+	.quad	0
+SYM_DATA_END(boot_idt_desc)
+	.balign 8
+SYM_DATA_START(boot_idt)
+	.rept	BOOT_IDT_ENTRIES
+	.quad	0
+	.quad	0
+	.endr
+SYM_DATA_END_LABEL(boot_idt, SYM_L_GLOBAL, boot_idt_end)
+
 #ifdef CONFIG_EFI_MIXED
 SYM_DATA_LOCAL(efi32_boot_args, .long 0, 0)
 SYM_DATA(efi_is64, .byte 1)
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
new file mode 100644
index 000000000000..46ecea671b90
--- /dev/null
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <asm/trap_defs.h>
+#include <asm/segment.h>
+#include "misc.h"
+
+static void set_idt_entry(int vector, void (*handler)(void))
+{
+	unsigned long address = (unsigned long)handler;
+	gate_desc entry;
+
+	memset(&entry, 0, sizeof(entry));
+
+	entry.offset_low    = (u16)(address & 0xffff);
+	entry.segment       = __KERNEL_CS;
+	entry.bits.type     = GATE_TRAP;
+	entry.bits.p        = 1;
+	entry.offset_middle = (u16)((address >> 16) & 0xffff);
+	entry.offset_high   = (u32)(address >> 32);
+
+	memcpy(&boot_idt[vector], &entry, sizeof(entry));
+}
+
+/* Have this here so we don't need to include <asm/desc.h> */
+static void load_boot_idt(const struct desc_ptr *dtr)
+{
+	asm volatile("lidt %0"::"m" (*dtr));
+}
+
+/* Setup IDT before kernel jumping to  .Lrelocated */
+void load_stage1_idt(void)
+{
+	boot_idt_desc.address = (unsigned long)boot_idt;
+
+	load_boot_idt(&boot_idt_desc);
+}
+
+/* Setup IDT after kernel jumping to  .Lrelocated */
+void load_stage2_idt(void)
+{
+	boot_idt_desc.address = (unsigned long)boot_idt;
+
+	load_boot_idt(&boot_idt_desc);
+}
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
new file mode 100644
index 000000000000..3d86ab35ef52
--- /dev/null
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Early IDT handler entry points
+ *
+ * Copyright (C) 2019 SUSE
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#include <asm/segment.h>
+
+#include "../../entry/calling.h"
+
+.macro EXCEPTION_HANDLER name function error_code=0
+SYM_FUNC_START(\name)
+
+	/* Build pt_regs */
+	.if \error_code == 0
+	pushq   $0
+	.endif
+
+	pushq   %rdi
+	pushq   %rsi
+	pushq   %rdx
+	pushq   %rcx
+	pushq   %rax
+	pushq   %r8
+	pushq   %r9
+	pushq   %r10
+	pushq   %r11
+	pushq   %rbx
+	pushq   %rbp
+	pushq   %r12
+	pushq   %r13
+	pushq   %r14
+	pushq   %r15
+
+	/* Call handler with pt_regs */
+	movq    %rsp, %rdi
+	/* Error code is second parameter */
+	movq	ORIG_RAX(%rsp), %rsi
+	call    \function
+
+	/* Restore regs */
+	popq    %r15
+	popq    %r14
+	popq    %r13
+	popq    %r12
+	popq    %rbp
+	popq    %rbx
+	popq    %r11
+	popq    %r10
+	popq    %r9
+	popq    %r8
+	popq    %rax
+	popq    %rcx
+	popq    %rdx
+	popq    %rsi
+	popq    %rdi
+
+	/* Remove error code and return */
+	addq    $8, %rsp
+
+	/*
+	 * Make sure we return to __KERNEL_CS - the CS selector on
+	 * the IRET frame might still be from an old BIOS GDT
+	 */
+	movq	$__KERNEL_CS, 8(%rsp)
+
+	iretq
+SYM_FUNC_END(\name)
+	.endm
+
+	.text
+	.code64
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 726e264410ff..062ae3ae6930 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -23,6 +23,7 @@
 #include <asm/page.h>
 #include <asm/boot.h>
 #include <asm/bootparam.h>
+#include <asm/desc_defs.h>
 
 #define BOOT_CTYPE_H
 #include <linux/acpi.h>
@@ -133,4 +134,8 @@ int count_immovable_mem_regions(void);
 static inline int count_immovable_mem_regions(void) { return 0; }
 #endif
 
+/* idt_64.c */
+extern gate_desc boot_idt[BOOT_IDT_ENTRIES];
+extern struct desc_ptr boot_idt_desc;
+
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/include/asm/desc_defs.h b/arch/x86/include/asm/desc_defs.h
index a91f3b6e4f2a..5621fb3f2d1a 100644
--- a/arch/x86/include/asm/desc_defs.h
+++ b/arch/x86/include/asm/desc_defs.h
@@ -109,6 +109,9 @@ struct desc_ptr {
 
 #endif /* !__ASSEMBLY__ */
 
+/* Boot IDT definitions */
+#define	BOOT_IDT_ENTRIES	32
+
 /* Access rights as returned by LAR */
 #define AR_TYPE_RODATA		(0 * (1 << 9))
 #define AR_TYPE_RWDATA		(1 * (1 << 9))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 13/70] x86/boot/compressed/64: Rename kaslr_64.c to ident_map_64.c
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (11 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 12/70] x86/boot/compressed/64: Add IDT Infrastructure Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (56 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The file contains only code related to identity mapped page-tables.
Rename the file and compile it always in.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/Makefile                    |  2 +-
 .../boot/compressed/{kaslr_64.c => ident_map_64.c}   | 12 ++++++++++++
 arch/x86/boot/compressed/kaslr.c                     |  9 ---------
 arch/x86/boot/compressed/misc.h                      |  8 ++++++++
 4 files changed, 21 insertions(+), 10 deletions(-)
 rename arch/x86/boot/compressed/{kaslr_64.c => ident_map_64.c} (93%)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 54d63526e856..e6b3e0fc48de 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -80,7 +80,7 @@ vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/kernel_info.o $(obj)/head_$(BITS).o
 vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
 ifdef CONFIG_X86_64
-	vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr_64.o
+	vmlinux-objs-y += $(obj)/ident_map_64.o
 	vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
 	vmlinux-objs-y += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
diff --git a/arch/x86/boot/compressed/kaslr_64.c b/arch/x86/boot/compressed/ident_map_64.c
similarity index 93%
rename from arch/x86/boot/compressed/kaslr_64.c
rename to arch/x86/boot/compressed/ident_map_64.c
index 9557c5a15b91..3a2115582920 100644
--- a/arch/x86/boot/compressed/kaslr_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -29,6 +29,18 @@
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
 #include "../../mm/ident_map.c"
 
+#ifdef CONFIG_X86_5LEVEL
+unsigned int __pgtable_l5_enabled;
+unsigned int pgdir_shift = 39;
+unsigned int ptrs_per_p4d = 1;
+#endif
+
+/* Used by PAGE_KERN* macros: */
+pteval_t __default_kernel_pte_mask __read_mostly = ~0;
+
+/* Used by pgtable.h asm code to force instruction serialization. */
+unsigned long __force_order;
+
 /* Used to track our page table allocation area. */
 struct alloc_pgt_data {
 	unsigned char *pgt_buf;
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index d7408af55738..7c61a8c5b9cf 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -43,17 +43,8 @@
 #define STATIC
 #include <linux/decompress/mm.h>
 
-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled;
-unsigned int pgdir_shift __ro_after_init = 39;
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-#endif
-
 extern unsigned long get_cmd_line_ptr(void);
 
-/* Used by PAGE_KERN* macros: */
-pteval_t __default_kernel_pte_mask __read_mostly = ~0;
-
 /* Simplified build-specific string for starting entropy. */
 static const char build_str[] = UTS_RELEASE " (" LINUX_COMPILE_BY "@"
 		LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 062ae3ae6930..3a030a878d53 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -134,6 +134,14 @@ int count_immovable_mem_regions(void);
 static inline int count_immovable_mem_regions(void) { return 0; }
 #endif
 
+/* ident_map_64.c */
+#ifdef CONFIG_X86_5LEVEL
+extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
+#endif
+
+/* Used by PAGE_KERN* macros: */
+extern pteval_t __default_kernel_pte_mask;
+
 /* idt_64.c */
 extern gate_desc boot_idt[BOOT_IDT_ENTRIES];
 extern struct desc_ptr boot_idt_desc;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 14/70] x86/boot/compressed/64: Add page-fault handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Install a page-fault handler to add an identity mapping to addresses
not yet mapped. Also do some checking whether the error code is sane.

This makes non SEV-ES machines use the exception handling
infrastructure in the pre-decompressions boot code too, making it less
likely to break in the future.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c    | 38 ++++++++++++++++++++++
 arch/x86/boot/compressed/idt_64.c          |  2 ++
 arch/x86/boot/compressed/idt_handlers_64.S |  2 ++
 arch/x86/boot/compressed/misc.h            |  6 ++++
 4 files changed, 48 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 3a2115582920..0865d181b85d 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -19,11 +19,13 @@
 /* No PAGE_TABLE_ISOLATION support needed either: */
 #undef CONFIG_PAGE_TABLE_ISOLATION
 
+#include "error.h"
 #include "misc.h"
 
 /* These actually do the work of building the kernel identity maps. */
 #include <asm/init.h>
 #include <asm/pgtable.h>
+#include <asm/trap_defs.h>
 /* Use the static base for this part of the boot process */
 #undef __PAGE_OFFSET
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
@@ -163,3 +165,39 @@ void finalize_identity_maps(void)
 {
 	write_cr3(top_level_pgt);
 }
+
+static void pf_error(unsigned long error_code, unsigned long address,
+		     struct pt_regs *regs)
+{
+	error_putstr("Unexpected page-fault:");
+	error_putstr("\nError Code: ");
+	error_puthex(error_code);
+	error_putstr("\nCR2: 0x");
+	error_puthex(address);
+	error_putstr("\nRIP relative to _head: 0x");
+	error_puthex(regs->ip - (unsigned long)_head);
+	error_putstr("\n");
+
+	error("Stopping.\n");
+}
+
+void do_boot_page_fault(struct pt_regs *regs)
+{
+	unsigned long address = native_read_cr2();
+	unsigned long error_code = regs->orig_ax;
+
+	/*
+	 * Check for unexpected error codes. Unexpected are:
+	 *	- Faults on present pages
+	 *	- User faults
+	 *	- Reserved bits set
+	 */
+	if (error_code & (X86_PF_PROT | X86_PF_USER | X86_PF_RSVD))
+		pf_error(error_code, address, regs);
+
+	/*
+	 * Error code is sane - now identity map the 2M region around
+	 * the faulting address.
+	 */
+	add_identity_map(address & PMD_MASK, PMD_SIZE);
+}
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 46ecea671b90..84ba57d9d436 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -39,5 +39,7 @@ void load_stage2_idt(void)
 {
 	boot_idt_desc.address = (unsigned long)boot_idt;
 
+	set_idt_entry(X86_TRAP_PF, boot_pf_handler);
+
 	load_boot_idt(&boot_idt_desc);
 }
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 3d86ab35ef52..bfb3fc5aa144 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -73,3 +73,5 @@ SYM_FUNC_END(\name)
 
 	.text
 	.code64
+
+EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 3a030a878d53..eff4ed0b1cea 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -37,6 +37,9 @@
 #define memptr unsigned
 #endif
 
+/* boot/compressed/vmlinux start and end markers */
+extern char _head[], _end[];
+
 /* misc.c */
 extern memptr free_mem_ptr;
 extern memptr free_mem_end_ptr;
@@ -146,4 +149,7 @@ extern pteval_t __default_kernel_pte_mask;
 extern gate_desc boot_idt[BOOT_IDT_ENTRIES];
 extern struct desc_ptr boot_idt_desc;
 
+/* IDT Entry Points */
+void boot_pf_handler(void);
+
 #endif /* BOOT_COMPRESSED_MISC_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 14/70] x86/boot/compressed/64: Add page-fault handler
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Install a page-fault handler to add an identity mapping to addresses
not yet mapped. Also do some checking whether the error code is sane.

This makes non SEV-ES machines use the exception handling
infrastructure in the pre-decompressions boot code too, making it less
likely to break in the future.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c    | 38 ++++++++++++++++++++++
 arch/x86/boot/compressed/idt_64.c          |  2 ++
 arch/x86/boot/compressed/idt_handlers_64.S |  2 ++
 arch/x86/boot/compressed/misc.h            |  6 ++++
 4 files changed, 48 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 3a2115582920..0865d181b85d 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -19,11 +19,13 @@
 /* No PAGE_TABLE_ISOLATION support needed either: */
 #undef CONFIG_PAGE_TABLE_ISOLATION
 
+#include "error.h"
 #include "misc.h"
 
 /* These actually do the work of building the kernel identity maps. */
 #include <asm/init.h>
 #include <asm/pgtable.h>
+#include <asm/trap_defs.h>
 /* Use the static base for this part of the boot process */
 #undef __PAGE_OFFSET
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
@@ -163,3 +165,39 @@ void finalize_identity_maps(void)
 {
 	write_cr3(top_level_pgt);
 }
+
+static void pf_error(unsigned long error_code, unsigned long address,
+		     struct pt_regs *regs)
+{
+	error_putstr("Unexpected page-fault:");
+	error_putstr("\nError Code: ");
+	error_puthex(error_code);
+	error_putstr("\nCR2: 0x");
+	error_puthex(address);
+	error_putstr("\nRIP relative to _head: 0x");
+	error_puthex(regs->ip - (unsigned long)_head);
+	error_putstr("\n");
+
+	error("Stopping.\n");
+}
+
+void do_boot_page_fault(struct pt_regs *regs)
+{
+	unsigned long address = native_read_cr2();
+	unsigned long error_code = regs->orig_ax;
+
+	/*
+	 * Check for unexpected error codes. Unexpected are:
+	 *	- Faults on present pages
+	 *	- User faults
+	 *	- Reserved bits set
+	 */
+	if (error_code & (X86_PF_PROT | X86_PF_USER | X86_PF_RSVD))
+		pf_error(error_code, address, regs);
+
+	/*
+	 * Error code is sane - now identity map the 2M region around
+	 * the faulting address.
+	 */
+	add_identity_map(address & PMD_MASK, PMD_SIZE);
+}
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 46ecea671b90..84ba57d9d436 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -39,5 +39,7 @@ void load_stage2_idt(void)
 {
 	boot_idt_desc.address = (unsigned long)boot_idt;
 
+	set_idt_entry(X86_TRAP_PF, boot_pf_handler);
+
 	load_boot_idt(&boot_idt_desc);
 }
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 3d86ab35ef52..bfb3fc5aa144 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -73,3 +73,5 @@ SYM_FUNC_END(\name)
 
 	.text
 	.code64
+
+EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 3a030a878d53..eff4ed0b1cea 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -37,6 +37,9 @@
 #define memptr unsigned
 #endif
 
+/* boot/compressed/vmlinux start and end markers */
+extern char _head[], _end[];
+
 /* misc.c */
 extern memptr free_mem_ptr;
 extern memptr free_mem_end_ptr;
@@ -146,4 +149,7 @@ extern pteval_t __default_kernel_pte_mask;
 extern gate_desc boot_idt[BOOT_IDT_ENTRIES];
 extern struct desc_ptr boot_idt_desc;
 
+/* IDT Entry Points */
+void boot_pf_handler(void);
+
 #endif /* BOOT_COMPRESSED_MISC_H */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 15/70] x86/boot/compressed/64: Always switch to own page-table
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (13 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-04-06 11:56   ` Borislav Petkov
  2020-03-19  9:13 ` [PATCH 16/70] x86/boot/compressed/64: Don't pre-map memory in KASLR code Joerg Roedel
                   ` (54 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

When booted through startup_64 the kernel keeps running on the EFI
page-table until the KASLR code sets up its own page-table. Without
KASLR the pre-decompression boot code never switches off the EFI
page-table. Change that by unconditionally switching to our own
page-table once the kernel is relocated.

This makes sure we can make changes to the mapping when necessary, for
example map pages unencrypted in SEV and SEV-ES guests.

Also remove the debug_putstr() calls in initialize_identity_maps()
because the function now runs before console_init() is called.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/head_64.S      |  3 +-
 arch/x86/boot/compressed/ident_map_64.c | 51 +++++++++++++++----------
 arch/x86/boot/compressed/kaslr.c        |  3 --
 3 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d27a9ce1bcb0..5164d2e8631a 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -491,10 +491,11 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 	rep	stosq
 
 /*
- * Load stage2 IDT
+ * Load stage2 IDT and switch to our own page-table
  */
 	pushq	%rsi
 	call	load_stage2_idt
+	call	initialize_identity_maps
 	popq	%rsi
 
 /*
diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 0865d181b85d..6a3890caaa19 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -88,9 +88,31 @@ phys_addr_t physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
  */
 static struct x86_mapping_info mapping_info;
 
+/*
+ * Adds the specified range to what will become the new identity mappings.
+ * Once all ranges have been added, the new mapping is activated by calling
+ * finalize_identity_maps() below.
+ */
+void add_identity_map(unsigned long start, unsigned long size)
+{
+	unsigned long end = start + size;
+
+	/* Align boundary to 2M. */
+	start = round_down(start, PMD_SIZE);
+	end = round_up(end, PMD_SIZE);
+	if (start >= end)
+		return;
+
+	/* Build the mapping. */
+	kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
+				  start, end);
+}
+
 /* Locates and clears a region for a new top level page table. */
 void initialize_identity_maps(void)
 {
+	unsigned long start, size;
+
 	/* If running as an SEV guest, the encryption mask is required. */
 	set_sev_encryption_mask();
 
@@ -123,37 +145,24 @@ void initialize_identity_maps(void)
 	 */
 	top_level_pgt = read_cr3_pa();
 	if (p4d_offset((pgd_t *)top_level_pgt, 0) == (p4d_t *)_pgtable) {
-		debug_putstr("booted via startup_32()\n");
 		pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
 		pgt_data.pgt_buf_size = BOOT_PGT_SIZE - BOOT_INIT_PGT_SIZE;
 		memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size);
 	} else {
-		debug_putstr("booted via startup_64()\n");
 		pgt_data.pgt_buf = _pgtable;
 		pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
 		memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size);
 		top_level_pgt = (unsigned long)alloc_pgt_page(&pgt_data);
 	}
-}
 
-/*
- * Adds the specified range to what will become the new identity mappings.
- * Once all ranges have been added, the new mapping is activated by calling
- * finalize_identity_maps() below.
- */
-void add_identity_map(unsigned long start, unsigned long size)
-{
-	unsigned long end = start + size;
-
-	/* Align boundary to 2M. */
-	start = round_down(start, PMD_SIZE);
-	end = round_up(end, PMD_SIZE);
-	if (start >= end)
-		return;
-
-	/* Build the mapping. */
-	kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
-				  start, end);
+	/*
+	 * New page-table is set up - map the kernel image and load it
+	 * into cr3.
+	 */
+	start = (unsigned long)_head;
+	size  = _end - _head;
+	add_identity_map(start, size);
+	write_cr3(top_level_pgt);
 }
 
 /*
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 7c61a8c5b9cf..856dc1c9bb0d 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -903,9 +903,6 @@ void choose_random_location(unsigned long input,
 
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
-	/* Prepare to add new identity pagetables on demand. */
-	initialize_identity_maps();
-
 	/* Record the various known unsafe memory ranges. */
 	mem_avoid_init(input, input_size, *output);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 16/70] x86/boot/compressed/64: Don't pre-map memory in KASLR code
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (14 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 15/70] x86/boot/compressed/64: Always switch to own page-table Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (53 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

With the page-fault handler in place the identity mapping can be built
on-demand. So remove the code which manually creates the mappings and
unexport/remove the functions used for it.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c | 16 ++--------------
 arch/x86/boot/compressed/kaslr.c        | 24 +-----------------------
 arch/x86/boot/compressed/misc.h         | 10 ----------
 3 files changed, 3 insertions(+), 47 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 6a3890caaa19..ab7a3d9705c0 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -89,11 +89,9 @@ phys_addr_t physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
 static struct x86_mapping_info mapping_info;
 
 /*
- * Adds the specified range to what will become the new identity mappings.
- * Once all ranges have been added, the new mapping is activated by calling
- * finalize_identity_maps() below.
+ * Adds the specified range to the identity mappings.
  */
-void add_identity_map(unsigned long start, unsigned long size)
+static void add_identity_map(unsigned long start, unsigned long size)
 {
 	unsigned long end = start + size;
 
@@ -165,16 +163,6 @@ void initialize_identity_maps(void)
 	write_cr3(top_level_pgt);
 }
 
-/*
- * This switches the page tables to the new level4 that has been built
- * via calls to add_identity_map() above. If booted via startup_32(),
- * this is effectively a no-op.
- */
-void finalize_identity_maps(void)
-{
-	write_cr3(top_level_pgt);
-}
-
 static void pf_error(unsigned long error_code, unsigned long address,
 		     struct pt_regs *regs)
 {
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 856dc1c9bb0d..c466fb738de0 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -399,8 +399,6 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 	 */
 	mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
 	mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
-	add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
-			 mem_avoid[MEM_AVOID_ZO_RANGE].size);
 
 	/* Avoid initrd. */
 	initrd_start  = (u64)boot_params->ext_ramdisk_image << 32;
@@ -420,14 +418,10 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 		;
 	mem_avoid[MEM_AVOID_CMDLINE].start = cmd_line;
 	mem_avoid[MEM_AVOID_CMDLINE].size = cmd_line_size;
-	add_identity_map(mem_avoid[MEM_AVOID_CMDLINE].start,
-			 mem_avoid[MEM_AVOID_CMDLINE].size);
 
 	/* Avoid boot parameters. */
 	mem_avoid[MEM_AVOID_BOOTPARAMS].start = (unsigned long)boot_params;
 	mem_avoid[MEM_AVOID_BOOTPARAMS].size = sizeof(*boot_params);
-	add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start,
-			 mem_avoid[MEM_AVOID_BOOTPARAMS].size);
 
 	/* We don't need to set a mapping for setup_data. */
 
@@ -436,11 +430,6 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
 
 	/* Enumerate the immovable memory regions */
 	num_immovable_mem = count_immovable_mem_regions();
-
-#ifdef CONFIG_X86_VERBOSE_BOOTUP
-	/* Make sure video RAM can be used. */
-	add_identity_map(0, PMD_SIZE);
-#endif
 }
 
 /*
@@ -919,19 +908,8 @@ void choose_random_location(unsigned long input,
 		warn("Physical KASLR disabled: no suitable memory region!");
 	} else {
 		/* Update the new physical address location. */
-		if (*output != random_addr) {
-			add_identity_map(random_addr, output_size);
+		if (*output != random_addr)
 			*output = random_addr;
-		}
-
-		/*
-		 * This loads the identity mapping page table.
-		 * This should only be done if a new physical address
-		 * is found for the kernel, otherwise we should keep
-		 * the old page table to make it be like the "nokaslr"
-		 * case.
-		 */
-		finalize_identity_maps();
 	}
 
 
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index eff4ed0b1cea..4e5bc688f467 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -98,17 +98,7 @@ static inline void choose_random_location(unsigned long input,
 #endif
 
 #ifdef CONFIG_X86_64
-void initialize_identity_maps(void);
-void add_identity_map(unsigned long start, unsigned long size);
-void finalize_identity_maps(void);
 extern unsigned char _pgtable[];
-#else
-static inline void initialize_identity_maps(void)
-{ }
-static inline void add_identity_map(unsigned long start, unsigned long size)
-{ }
-static inline void finalize_identity_maps(void)
-{ }
 #endif
 
 #ifdef CONFIG_EARLY_PRINTK
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 17/70] x86/boot/compressed/64: Change add_identity_map() to take start and end
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Changing the function to take start and end as parameters instead of
start and size simplifies the callers, which don't need to calculate
the size if they already have start and end.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index ab7a3d9705c0..ba5b88189220 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -91,10 +91,8 @@ static struct x86_mapping_info mapping_info;
 /*
  * Adds the specified range to the identity mappings.
  */
-static void add_identity_map(unsigned long start, unsigned long size)
+static void add_identity_map(unsigned long start, unsigned long end)
 {
-	unsigned long end = start + size;
-
 	/* Align boundary to 2M. */
 	start = round_down(start, PMD_SIZE);
 	end = round_up(end, PMD_SIZE);
@@ -109,8 +107,6 @@ static void add_identity_map(unsigned long start, unsigned long size)
 /* Locates and clears a region for a new top level page table. */
 void initialize_identity_maps(void)
 {
-	unsigned long start, size;
-
 	/* If running as an SEV guest, the encryption mask is required. */
 	set_sev_encryption_mask();
 
@@ -157,9 +153,7 @@ void initialize_identity_maps(void)
 	 * New page-table is set up - map the kernel image and load it
 	 * into cr3.
 	 */
-	start = (unsigned long)_head;
-	size  = _end - _head;
-	add_identity_map(start, size);
+	add_identity_map((unsigned long)_head, (unsigned long)_end);
 	write_cr3(top_level_pgt);
 }
 
@@ -180,7 +174,8 @@ static void pf_error(unsigned long error_code, unsigned long address,
 
 void do_boot_page_fault(struct pt_regs *regs)
 {
-	unsigned long address = native_read_cr2();
+	unsigned long address = native_read_cr2() & PMD_MASK;
+	unsigned long end = address + PMD_SIZE;
 	unsigned long error_code = regs->orig_ax;
 
 	/*
@@ -196,5 +191,5 @@ void do_boot_page_fault(struct pt_regs *regs)
 	 * Error code is sane - now identity map the 2M region around
 	 * the faulting address.
 	 */
-	add_identity_map(address & PMD_MASK, PMD_SIZE);
+	add_identity_map(address, end);
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 17/70] x86/boot/compressed/64: Change add_identity_map() to take start and end
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Changing the function to take start and end as parameters instead of
start and size simplifies the callers, which don't need to calculate
the size if they already have start and end.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index ab7a3d9705c0..ba5b88189220 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -91,10 +91,8 @@ static struct x86_mapping_info mapping_info;
 /*
  * Adds the specified range to the identity mappings.
  */
-static void add_identity_map(unsigned long start, unsigned long size)
+static void add_identity_map(unsigned long start, unsigned long end)
 {
-	unsigned long end = start + size;
-
 	/* Align boundary to 2M. */
 	start = round_down(start, PMD_SIZE);
 	end = round_up(end, PMD_SIZE);
@@ -109,8 +107,6 @@ static void add_identity_map(unsigned long start, unsigned long size)
 /* Locates and clears a region for a new top level page table. */
 void initialize_identity_maps(void)
 {
-	unsigned long start, size;
-
 	/* If running as an SEV guest, the encryption mask is required. */
 	set_sev_encryption_mask();
 
@@ -157,9 +153,7 @@ void initialize_identity_maps(void)
 	 * New page-table is set up - map the kernel image and load it
 	 * into cr3.
 	 */
-	start = (unsigned long)_head;
-	size  = _end - _head;
-	add_identity_map(start, size);
+	add_identity_map((unsigned long)_head, (unsigned long)_end);
 	write_cr3(top_level_pgt);
 }
 
@@ -180,7 +174,8 @@ static void pf_error(unsigned long error_code, unsigned long address,
 
 void do_boot_page_fault(struct pt_regs *regs)
 {
-	unsigned long address = native_read_cr2();
+	unsigned long address = native_read_cr2() & PMD_MASK;
+	unsigned long end = address + PMD_SIZE;
 	unsigned long error_code = regs->orig_ax;
 
 	/*
@@ -196,5 +191,5 @@ void do_boot_page_fault(struct pt_regs *regs)
 	 * Error code is sane - now identity map the 2M region around
 	 * the faulting address.
 	 */
-	add_identity_map(address & PMD_MASK, PMD_SIZE);
+	add_identity_map(address, end);
 }
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 18/70] x86/boot/compressed/64: Add stage1 #VC handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add the first handler for #VC exceptions. At stage 1 there is no GHCB
yet becaue we might still be on the EFI page table and thus can't map
memory unencrypted.

The stage 1 handler is limited to the MSR based protocol to talk to
the hypervisor and can only support CPUID exit-codes, but that is
enough to get to stage 2.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/Makefile          |  1 +
 arch/x86/boot/compressed/idt_64.c          |  4 ++
 arch/x86/boot/compressed/idt_handlers_64.S |  4 ++
 arch/x86/boot/compressed/misc.h            |  1 +
 arch/x86/boot/compressed/sev-es.c          | 42 ++++++++++++++
 arch/x86/include/asm/msr-index.h           |  1 +
 arch/x86/include/asm/sev-es.h              | 45 +++++++++++++++
 arch/x86/include/asm/trap_defs.h           |  1 +
 arch/x86/kernel/sev-es-shared.c            | 65 ++++++++++++++++++++++
 9 files changed, 164 insertions(+)
 create mode 100644 arch/x86/boot/compressed/sev-es.c
 create mode 100644 arch/x86/include/asm/sev-es.h
 create mode 100644 arch/x86/kernel/sev-es-shared.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e6b3e0fc48de..583678c78e1b 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -84,6 +84,7 @@ ifdef CONFIG_X86_64
 	vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
 	vmlinux-objs-y += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
+	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev-es.o
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 84ba57d9d436..bdd20dfd1fd0 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -31,6 +31,10 @@ void load_stage1_idt(void)
 {
 	boot_idt_desc.address = (unsigned long)boot_idt;
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	set_idt_entry(X86_TRAP_VC, boot_stage1_vc_handler);
+#endif
+
 	load_boot_idt(&boot_idt_desc);
 }
 
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index bfb3fc5aa144..67ddafab2943 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -75,3 +75,7 @@ SYM_FUNC_END(\name)
 	.code64
 
 EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+EXCEPTION_HANDLER	boot_stage1_vc_handler vc_no_ghcb_handler error_code=1
+#endif
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 4e5bc688f467..0e3508c5c15c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -141,5 +141,6 @@ extern struct desc_ptr boot_idt_desc;
 
 /* IDT Entry Points */
 void boot_pf_handler(void);
+void boot_stage1_vc_handler(void);
 
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
new file mode 100644
index 000000000000..eeeb3553547c
--- /dev/null
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Encrypted Register State Support
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#include <linux/kernel.h>
+
+#include <asm/sev-es.h>
+#include <asm/msr-index.h>
+#include <asm/ptrace.h>
+#include <asm/svm.h>
+
+#include "misc.h"
+
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+	unsigned long low, high;
+
+	asm volatile("rdmsr\n" : "=a" (low), "=d" (high) :
+			"c" (MSR_AMD64_SEV_ES_GHCB));
+
+	return ((high << 32) | low);
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+	u32 low, high;
+
+	low  = val & 0xffffffffUL;
+	high = val >> 32;
+
+	asm volatile("wrmsr\n" : : "c" (MSR_AMD64_SEV_ES_GHCB),
+			"a"(low), "d" (high) : "memory");
+}
+
+#undef __init
+#define __init
+
+/* Include code for early handlers */
+#include "../../kernel/sev-es-shared.c"
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d5e517d1c3dd..9eb279927fc2 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -432,6 +432,7 @@
 #define MSR_AMD64_IBSBRTARGET		0xc001103b
 #define MSR_AMD64_IBSOPDATA4		0xc001103d
 #define MSR_AMD64_IBS_REG_COUNT_MAX	8 /* includes MSR_AMD64_IBSBRTARGET */
+#define MSR_AMD64_SEV_ES_GHCB		0xc0010130
 #define MSR_AMD64_SEV			0xc0010131
 #define MSR_AMD64_SEV_ENABLED_BIT	0
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
new file mode 100644
index 000000000000..f524b40aef07
--- /dev/null
+++ b/arch/x86/include/asm/sev-es.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD Encrypted Register State Support
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#ifndef __ASM_ENCRYPTED_STATE_H
+#define __ASM_ENCRYPTED_STATE_H
+
+#include <linux/types.h>
+
+#define GHCB_SEV_CPUID_REQ	0x004UL
+#define		GHCB_CPUID_REQ_EAX	0
+#define		GHCB_CPUID_REQ_EBX	1
+#define		GHCB_CPUID_REQ_ECX	2
+#define		GHCB_CPUID_REQ_EDX	3
+#define		GHCB_CPUID_REQ(fn, reg) (GHCB_SEV_CPUID_REQ | \
+					(((unsigned long)reg & 3) << 30) | \
+					(((unsigned long)fn) << 32))
+
+#define GHCB_SEV_CPUID_RESP	0x005UL
+#define GHCB_SEV_TERMINATE	0x100UL
+
+#define	GHCB_SEV_GHCB_RESP_CODE(v)	((v) & 0xfff)
+#define	VMGEXIT()			{ asm volatile("rep; vmmcall\n\r"); }
+
+static inline u64 lower_bits(u64 val, unsigned int bits)
+{
+	u64 mask = (1ULL << bits) - 1;
+
+	return (val & mask);
+}
+
+static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
+{
+	u64 mask = (1ULL << bits) - 1;
+
+	out &= ~mask;
+	out |= lower_bits(in, bits);
+
+	return out;
+}
+
+#endif
diff --git a/arch/x86/include/asm/trap_defs.h b/arch/x86/include/asm/trap_defs.h
index 488f82ac36da..af45d65f0458 100644
--- a/arch/x86/include/asm/trap_defs.h
+++ b/arch/x86/include/asm/trap_defs.h
@@ -24,6 +24,7 @@ enum {
 	X86_TRAP_AC,		/* 17, Alignment Check */
 	X86_TRAP_MC,		/* 18, Machine Check */
 	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
+	X86_TRAP_VC = 29,	/* 29, VMM Communication Exception */
 	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
 };
 
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
new file mode 100644
index 000000000000..e963b48d3e86
--- /dev/null
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Encrypted Register State Support
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ *
+ * This file is not compiled stand-alone. It contains code shared
+ * between the pre-decompression boot code and the running Linux kernel
+ * and is included directly into both code-bases.
+ */
+
+/*
+ * Boot VC Handler - This is the first VC handler during boot, there is no GHCB
+ * page yet, so it only supports the MSR based communication with the
+ * hypervisor and only the CPUID exit-code.
+ */
+void __init vc_no_ghcb_handler(struct pt_regs *regs, unsigned long exit_code)
+{
+	unsigned int fn = lower_bits(regs->ax, 32);
+	unsigned long val;
+
+	/* Only CPUID is supported via MSR protocol */
+	if (exit_code != SVM_EXIT_CPUID)
+		goto fail;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EAX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->ax = val >> 32;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EBX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->bx = val >> 32;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_ECX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->cx = val >> 32;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EDX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->dx = val >> 32;
+
+	regs->ip += 2;
+
+	return;
+
+fail:
+	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
+	VMGEXIT();
+
+	/* Shouldn't get here - if we do halt the machine */
+	while (true)
+		asm volatile("hlt\n");
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 18/70] x86/boot/compressed/64: Add stage1 #VC handler
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Add the first handler for #VC exceptions. At stage 1 there is no GHCB
yet becaue we might still be on the EFI page table and thus can't map
memory unencrypted.

The stage 1 handler is limited to the MSR based protocol to talk to
the hypervisor and can only support CPUID exit-codes, but that is
enough to get to stage 2.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/Makefile          |  1 +
 arch/x86/boot/compressed/idt_64.c          |  4 ++
 arch/x86/boot/compressed/idt_handlers_64.S |  4 ++
 arch/x86/boot/compressed/misc.h            |  1 +
 arch/x86/boot/compressed/sev-es.c          | 42 ++++++++++++++
 arch/x86/include/asm/msr-index.h           |  1 +
 arch/x86/include/asm/sev-es.h              | 45 +++++++++++++++
 arch/x86/include/asm/trap_defs.h           |  1 +
 arch/x86/kernel/sev-es-shared.c            | 65 ++++++++++++++++++++++
 9 files changed, 164 insertions(+)
 create mode 100644 arch/x86/boot/compressed/sev-es.c
 create mode 100644 arch/x86/include/asm/sev-es.h
 create mode 100644 arch/x86/kernel/sev-es-shared.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e6b3e0fc48de..583678c78e1b 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -84,6 +84,7 @@ ifdef CONFIG_X86_64
 	vmlinux-objs-y += $(obj)/idt_64.o $(obj)/idt_handlers_64.o
 	vmlinux-objs-y += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
+	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev-es.o
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 84ba57d9d436..bdd20dfd1fd0 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -31,6 +31,10 @@ void load_stage1_idt(void)
 {
 	boot_idt_desc.address = (unsigned long)boot_idt;
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	set_idt_entry(X86_TRAP_VC, boot_stage1_vc_handler);
+#endif
+
 	load_boot_idt(&boot_idt_desc);
 }
 
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index bfb3fc5aa144..67ddafab2943 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -75,3 +75,7 @@ SYM_FUNC_END(\name)
 	.code64
 
 EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+EXCEPTION_HANDLER	boot_stage1_vc_handler vc_no_ghcb_handler error_code=1
+#endif
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 4e5bc688f467..0e3508c5c15c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -141,5 +141,6 @@ extern struct desc_ptr boot_idt_desc;
 
 /* IDT Entry Points */
 void boot_pf_handler(void);
+void boot_stage1_vc_handler(void);
 
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
new file mode 100644
index 000000000000..eeeb3553547c
--- /dev/null
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Encrypted Register State Support
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#include <linux/kernel.h>
+
+#include <asm/sev-es.h>
+#include <asm/msr-index.h>
+#include <asm/ptrace.h>
+#include <asm/svm.h>
+
+#include "misc.h"
+
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+	unsigned long low, high;
+
+	asm volatile("rdmsr\n" : "=a" (low), "=d" (high) :
+			"c" (MSR_AMD64_SEV_ES_GHCB));
+
+	return ((high << 32) | low);
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+	u32 low, high;
+
+	low  = val & 0xffffffffUL;
+	high = val >> 32;
+
+	asm volatile("wrmsr\n" : : "c" (MSR_AMD64_SEV_ES_GHCB),
+			"a"(low), "d" (high) : "memory");
+}
+
+#undef __init
+#define __init
+
+/* Include code for early handlers */
+#include "../../kernel/sev-es-shared.c"
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d5e517d1c3dd..9eb279927fc2 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -432,6 +432,7 @@
 #define MSR_AMD64_IBSBRTARGET		0xc001103b
 #define MSR_AMD64_IBSOPDATA4		0xc001103d
 #define MSR_AMD64_IBS_REG_COUNT_MAX	8 /* includes MSR_AMD64_IBSBRTARGET */
+#define MSR_AMD64_SEV_ES_GHCB		0xc0010130
 #define MSR_AMD64_SEV			0xc0010131
 #define MSR_AMD64_SEV_ENABLED_BIT	0
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
new file mode 100644
index 000000000000..f524b40aef07
--- /dev/null
+++ b/arch/x86/include/asm/sev-es.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD Encrypted Register State Support
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#ifndef __ASM_ENCRYPTED_STATE_H
+#define __ASM_ENCRYPTED_STATE_H
+
+#include <linux/types.h>
+
+#define GHCB_SEV_CPUID_REQ	0x004UL
+#define		GHCB_CPUID_REQ_EAX	0
+#define		GHCB_CPUID_REQ_EBX	1
+#define		GHCB_CPUID_REQ_ECX	2
+#define		GHCB_CPUID_REQ_EDX	3
+#define		GHCB_CPUID_REQ(fn, reg) (GHCB_SEV_CPUID_REQ | \
+					(((unsigned long)reg & 3) << 30) | \
+					(((unsigned long)fn) << 32))
+
+#define GHCB_SEV_CPUID_RESP	0x005UL
+#define GHCB_SEV_TERMINATE	0x100UL
+
+#define	GHCB_SEV_GHCB_RESP_CODE(v)	((v) & 0xfff)
+#define	VMGEXIT()			{ asm volatile("rep; vmmcall\n\r"); }
+
+static inline u64 lower_bits(u64 val, unsigned int bits)
+{
+	u64 mask = (1ULL << bits) - 1;
+
+	return (val & mask);
+}
+
+static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
+{
+	u64 mask = (1ULL << bits) - 1;
+
+	out &= ~mask;
+	out |= lower_bits(in, bits);
+
+	return out;
+}
+
+#endif
diff --git a/arch/x86/include/asm/trap_defs.h b/arch/x86/include/asm/trap_defs.h
index 488f82ac36da..af45d65f0458 100644
--- a/arch/x86/include/asm/trap_defs.h
+++ b/arch/x86/include/asm/trap_defs.h
@@ -24,6 +24,7 @@ enum {
 	X86_TRAP_AC,		/* 17, Alignment Check */
 	X86_TRAP_MC,		/* 18, Machine Check */
 	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
+	X86_TRAP_VC = 29,	/* 29, VMM Communication Exception */
 	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
 };
 
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
new file mode 100644
index 000000000000..e963b48d3e86
--- /dev/null
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * AMD Encrypted Register State Support
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ *
+ * This file is not compiled stand-alone. It contains code shared
+ * between the pre-decompression boot code and the running Linux kernel
+ * and is included directly into both code-bases.
+ */
+
+/*
+ * Boot VC Handler - This is the first VC handler during boot, there is no GHCB
+ * page yet, so it only supports the MSR based communication with the
+ * hypervisor and only the CPUID exit-code.
+ */
+void __init vc_no_ghcb_handler(struct pt_regs *regs, unsigned long exit_code)
+{
+	unsigned int fn = lower_bits(regs->ax, 32);
+	unsigned long val;
+
+	/* Only CPUID is supported via MSR protocol */
+	if (exit_code != SVM_EXIT_CPUID)
+		goto fail;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EAX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->ax = val >> 32;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EBX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->bx = val >> 32;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_ECX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->cx = val >> 32;
+
+	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EDX));
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
+		goto fail;
+	regs->dx = val >> 32;
+
+	regs->ip += 2;
+
+	return;
+
+fail:
+	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
+	VMGEXIT();
+
+	/* Shouldn't get here - if we do halt the machine */
+	while (true)
+		asm volatile("hlt\n");
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 19/70] x86/boot/compressed/64: Call set_sev_encryption_mask earlier
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Call set_sev_encryption_mask() while still on the stage 1 #VC-handler,
because the stage 2 handler needs our own page-tables to be set up, to
which calling set_sev_encryption_mask() is a prerequisite.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/head_64.S      | 8 +++++++-
 arch/x86/boot/compressed/ident_map_64.c | 3 ---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 5164d2e8631a..fdebbfafe5a2 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -491,9 +491,15 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 	rep	stosq
 
 /*
- * Load stage2 IDT and switch to our own page-table
+ * If running as an SEV guest, the encryption mask is required in the
+ * page-table setup code below. When the guest also has SEV-ES enabled
+ * set_sev_encryption_mask() will cause #VC exceptions, but the stage2
+ * handler can't map its GHCB because the page-table is not set up yet.
+ * So set up the encryption mask here while still on the stage1 #VC
+ * handler. Then load stage2 IDT and switch to our own page-table.
  */
 	pushq	%rsi
+	call	set_sev_encryption_mask
 	call	load_stage2_idt
 	call	initialize_identity_maps
 	popq	%rsi
diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index ba5b88189220..5b720736a789 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -107,9 +107,6 @@ static void add_identity_map(unsigned long start, unsigned long end)
 /* Locates and clears a region for a new top level page table. */
 void initialize_identity_maps(void)
 {
-	/* If running as an SEV guest, the encryption mask is required. */
-	set_sev_encryption_mask();
-
 	/* Exclude the encryption mask from __PHYSICAL_MASK */
 	physical_mask &= ~sme_me_mask;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 19/70] x86/boot/compressed/64: Call set_sev_encryption_mask earlier
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Call set_sev_encryption_mask() while still on the stage 1 #VC-handler,
because the stage 2 handler needs our own page-tables to be set up, to
which calling set_sev_encryption_mask() is a prerequisite.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/head_64.S      | 8 +++++++-
 arch/x86/boot/compressed/ident_map_64.c | 3 ---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 5164d2e8631a..fdebbfafe5a2 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -491,9 +491,15 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
 	rep	stosq
 
 /*
- * Load stage2 IDT and switch to our own page-table
+ * If running as an SEV guest, the encryption mask is required in the
+ * page-table setup code below. When the guest also has SEV-ES enabled
+ * set_sev_encryption_mask() will cause #VC exceptions, but the stage2
+ * handler can't map its GHCB because the page-table is not set up yet.
+ * So set up the encryption mask here while still on the stage1 #VC
+ * handler. Then load stage2 IDT and switch to our own page-table.
  */
 	pushq	%rsi
+	call	set_sev_encryption_mask
 	call	load_stage2_idt
 	call	initialize_identity_maps
 	popq	%rsi
diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index ba5b88189220..5b720736a789 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -107,9 +107,6 @@ static void add_identity_map(unsigned long start, unsigned long end)
 /* Locates and clears a region for a new top level page table. */
 void initialize_identity_maps(void)
 {
-	/* If running as an SEV guest, the encryption mask is required. */
-	set_sev_encryption_mask();
-
 	/* Exclude the encryption mask from __PHYSICAL_MASK */
 	physical_mask &= ~sme_me_mask;
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 20/70] x86/boot/compressed/64: Check return value of kernel_ident_mapping_init()
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (18 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (49 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The function can fail to create an identity mapping, check for that
and bail out if it happens.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 5b720736a789..feb180cced28 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -93,6 +93,8 @@ static struct x86_mapping_info mapping_info;
  */
 static void add_identity_map(unsigned long start, unsigned long end)
 {
+	int ret;
+
 	/* Align boundary to 2M. */
 	start = round_down(start, PMD_SIZE);
 	end = round_up(end, PMD_SIZE);
@@ -100,8 +102,9 @@ static void add_identity_map(unsigned long start, unsigned long end)
 		return;
 
 	/* Build the mapping. */
-	kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
-				  start, end);
+	ret = kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt, start, end);
+	if (ret)
+		error("Error: kernel_ident_mapping_init() failed\n");
 }
 
 /* Locates and clears a region for a new top level page table. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

This function is needed to map the GHCB for SEV-ES guests. The GHCB is
used for communication with the hypervisor, so its content must not be
encrypted.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c | 125 ++++++++++++++++++++++++
 arch/x86/boot/compressed/misc.h         |   1 +
 2 files changed, 126 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index feb180cced28..04a5ff4bda66 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -26,6 +26,7 @@
 #include <asm/init.h>
 #include <asm/pgtable.h>
 #include <asm/trap_defs.h>
+#include <asm/cmpxchg.h>
 /* Use the static base for this part of the boot process */
 #undef __PAGE_OFFSET
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
@@ -157,6 +158,130 @@ void initialize_identity_maps(void)
 	write_cr3(top_level_pgt);
 }
 
+static pte_t *split_large_pmd(struct x86_mapping_info *info,
+			      pmd_t *pmdp, unsigned long __address)
+{
+	unsigned long page_flags;
+	unsigned long address;
+	pte_t *pte;
+	pmd_t pmd;
+	int i;
+
+	pte = (pte_t *)info->alloc_pgt_page(info->context);
+	if (!pte)
+		return NULL;
+
+	address     = __address & PMD_MASK;
+	/* No large page - clear PSE flag */
+	page_flags  = info->page_flag & ~_PAGE_PSE;
+
+	/* Populate the PTEs */
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		set_pte(&pte[i], __pte(address | page_flags));
+		address += PAGE_SIZE;
+	}
+
+	/*
+	 * Ideally we need to clear the large PMD first and do a TLB
+	 * flush before we write the new PMD. But the 2M range of the
+	 * PMD might contain the code we execute and/or the stack
+	 * we are on, so we can't do that. But that should be safe here
+	 * because we are going from large to small mappings and we are
+	 * also the only user of the page-table, so there is no chance
+	 * of a TLB multihit.
+	 */
+	pmd = __pmd((unsigned long)pte | info->kernpg_flag);
+	set_pmd(pmdp, pmd);
+	/* Flush TLB to establish the new PMD */
+	write_cr3(top_level_pgt);
+
+	return pte + pte_index(__address);
+}
+
+static void clflush_page(unsigned long address)
+{
+	unsigned int flush_size;
+	char *cl, *start, *end;
+
+	/*
+	 * Hardcode cl-size to 64 - CPUID can't be used here because that might
+	 * cause another #VC exception and the GHCB is not ready to use yet.
+	 */
+	flush_size = 64;
+	start      = (char *)(address & PAGE_MASK);
+	end        = start + PAGE_SIZE;
+
+	/*
+	 * First make sure there are no pending writes on the cache-lines to
+	 * flush.
+	 */
+	asm volatile("mfence" : : : "memory");
+
+	for (cl = start; cl != end; cl += flush_size)
+		clflush(cl);
+}
+
+static int __set_page_decrypted(struct x86_mapping_info *info,
+				unsigned long address)
+{
+	unsigned long scratch, *target;
+	pgd_t *pgdp = (pgd_t *)top_level_pgt;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep, pte;
+
+	/*
+	 * First make sure there is a PMD mapping for 'address'.
+	 * It should already exist, but keep things generic.
+	 *
+	 * To map the page just read from it and fault it in if there is no
+	 * mapping yet. add_identity_map() can't be called here because that
+	 * would unconditionally map the address on PMD level, destroying any
+	 * PTE-level mappings that might already exist.  Also do something
+	 * useless with 'scratch' so the access won't be optimized away.
+	 */
+	target = (unsigned long *)address;
+	scratch = *target;
+	arch_cmpxchg(target, scratch, scratch);
+
+	/*
+	 * The page is mapped at least with PMD size - so skip checks and walk
+	 * directly to the PMD.
+	 */
+	p4dp = p4d_offset(pgdp, address);
+	pudp = pud_offset(p4dp, address);
+	pmdp = pmd_offset(pudp, address);
+
+	if (pmd_large(*pmdp))
+		ptep = split_large_pmd(info, pmdp, address);
+	else
+		ptep = pte_offset_kernel(pmdp, address);
+
+	if (!ptep)
+		return -ENOMEM;
+
+	/* Clear encryption flag and write new pte */
+	pte = pte_clear_flags(*ptep, _PAGE_ENC);
+	set_pte(ptep, pte);
+
+	/* Flush TLB to map the page unencrypted */
+	write_cr3(top_level_pgt);
+
+	/*
+	 * Changing encryption attributes of a page requires to flush it from
+	 * the caches.
+	 */
+	clflush_page(address);
+
+	return 0;
+}
+
+int set_page_decrypted(unsigned long address)
+{
+	return __set_page_decrypted(&mapping_info, address);
+}
+
 static void pf_error(unsigned long error_code, unsigned long address,
 		     struct pt_regs *regs)
 {
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0e3508c5c15c..42f68a858a35 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -98,6 +98,7 @@ static inline void choose_random_location(unsigned long input,
 #endif
 
 #ifdef CONFIG_X86_64
+extern int set_page_decrypted(unsigned long address);
 extern unsigned char _pgtable[];
 #endif
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

This function is needed to map the GHCB for SEV-ES guests. The GHCB is
used for communication with the hypervisor, so its content must not be
encrypted.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/ident_map_64.c | 125 ++++++++++++++++++++++++
 arch/x86/boot/compressed/misc.h         |   1 +
 2 files changed, 126 insertions(+)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index feb180cced28..04a5ff4bda66 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -26,6 +26,7 @@
 #include <asm/init.h>
 #include <asm/pgtable.h>
 #include <asm/trap_defs.h>
+#include <asm/cmpxchg.h>
 /* Use the static base for this part of the boot process */
 #undef __PAGE_OFFSET
 #define __PAGE_OFFSET __PAGE_OFFSET_BASE
@@ -157,6 +158,130 @@ void initialize_identity_maps(void)
 	write_cr3(top_level_pgt);
 }
 
+static pte_t *split_large_pmd(struct x86_mapping_info *info,
+			      pmd_t *pmdp, unsigned long __address)
+{
+	unsigned long page_flags;
+	unsigned long address;
+	pte_t *pte;
+	pmd_t pmd;
+	int i;
+
+	pte = (pte_t *)info->alloc_pgt_page(info->context);
+	if (!pte)
+		return NULL;
+
+	address     = __address & PMD_MASK;
+	/* No large page - clear PSE flag */
+	page_flags  = info->page_flag & ~_PAGE_PSE;
+
+	/* Populate the PTEs */
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		set_pte(&pte[i], __pte(address | page_flags));
+		address += PAGE_SIZE;
+	}
+
+	/*
+	 * Ideally we need to clear the large PMD first and do a TLB
+	 * flush before we write the new PMD. But the 2M range of the
+	 * PMD might contain the code we execute and/or the stack
+	 * we are on, so we can't do that. But that should be safe here
+	 * because we are going from large to small mappings and we are
+	 * also the only user of the page-table, so there is no chance
+	 * of a TLB multihit.
+	 */
+	pmd = __pmd((unsigned long)pte | info->kernpg_flag);
+	set_pmd(pmdp, pmd);
+	/* Flush TLB to establish the new PMD */
+	write_cr3(top_level_pgt);
+
+	return pte + pte_index(__address);
+}
+
+static void clflush_page(unsigned long address)
+{
+	unsigned int flush_size;
+	char *cl, *start, *end;
+
+	/*
+	 * Hardcode cl-size to 64 - CPUID can't be used here because that might
+	 * cause another #VC exception and the GHCB is not ready to use yet.
+	 */
+	flush_size = 64;
+	start      = (char *)(address & PAGE_MASK);
+	end        = start + PAGE_SIZE;
+
+	/*
+	 * First make sure there are no pending writes on the cache-lines to
+	 * flush.
+	 */
+	asm volatile("mfence" : : : "memory");
+
+	for (cl = start; cl != end; cl += flush_size)
+		clflush(cl);
+}
+
+static int __set_page_decrypted(struct x86_mapping_info *info,
+				unsigned long address)
+{
+	unsigned long scratch, *target;
+	pgd_t *pgdp = (pgd_t *)top_level_pgt;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep, pte;
+
+	/*
+	 * First make sure there is a PMD mapping for 'address'.
+	 * It should already exist, but keep things generic.
+	 *
+	 * To map the page just read from it and fault it in if there is no
+	 * mapping yet. add_identity_map() can't be called here because that
+	 * would unconditionally map the address on PMD level, destroying any
+	 * PTE-level mappings that might already exist.  Also do something
+	 * useless with 'scratch' so the access won't be optimized away.
+	 */
+	target = (unsigned long *)address;
+	scratch = *target;
+	arch_cmpxchg(target, scratch, scratch);
+
+	/*
+	 * The page is mapped at least with PMD size - so skip checks and walk
+	 * directly to the PMD.
+	 */
+	p4dp = p4d_offset(pgdp, address);
+	pudp = pud_offset(p4dp, address);
+	pmdp = pmd_offset(pudp, address);
+
+	if (pmd_large(*pmdp))
+		ptep = split_large_pmd(info, pmdp, address);
+	else
+		ptep = pte_offset_kernel(pmdp, address);
+
+	if (!ptep)
+		return -ENOMEM;
+
+	/* Clear encryption flag and write new pte */
+	pte = pte_clear_flags(*ptep, _PAGE_ENC);
+	set_pte(ptep, pte);
+
+	/* Flush TLB to map the page unencrypted */
+	write_cr3(top_level_pgt);
+
+	/*
+	 * Changing encryption attributes of a page requires to flush it from
+	 * the caches.
+	 */
+	clflush_page(address);
+
+	return 0;
+}
+
+int set_page_decrypted(unsigned long address)
+{
+	return __set_page_decrypted(&mapping_info, address);
+}
+
 static void pf_error(unsigned long error_code, unsigned long address,
 		     struct pt_regs *regs)
 {
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 0e3508c5c15c..42f68a858a35 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -98,6 +98,7 @@ static inline void choose_random_location(unsigned long input,
 #endif
 
 #ifdef CONFIG_X86_64
+extern int set_page_decrypted(unsigned long address);
 extern unsigned char _pgtable[];
 #endif
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 22/70] x86/boot/compressed/64: Setup GHCB Based VC Exception handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Install an exception handler for #VC exception that uses a GHCB. Also
add the infrastructure for handling different exit-codes by decoding
the instruction that caused the exception and error handling.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/Kconfig                           |   1 +
 arch/x86/boot/compressed/idt_64.c          |   4 +
 arch/x86/boot/compressed/idt_handlers_64.S |   1 +
 arch/x86/boot/compressed/misc.h            |   1 +
 arch/x86/boot/compressed/sev-es.c          |  94 ++++++++++++++
 arch/x86/include/asm/sev-es.h              |  33 +++++
 arch/x86/include/uapi/asm/svm.h            |   1 +
 arch/x86/kernel/sev-es-shared.c            | 142 +++++++++++++++++++++
 8 files changed, 277 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..c12347492589 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1526,6 +1526,7 @@ config AMD_MEM_ENCRYPT
 	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
 	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select INSTRUCTION_DECODER
 	---help---
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index bdd20dfd1fd0..eebb2f857dac 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -45,5 +45,9 @@ void load_stage2_idt(void)
 
 	set_idt_entry(X86_TRAP_PF, boot_pf_handler);
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	set_idt_entry(X86_TRAP_VC, boot_stage2_vc_handler);
+#endif
+
 	load_boot_idt(&boot_idt_desc);
 }
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 67ddafab2943..04edeb73d2cf 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -78,4 +78,5 @@ EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 EXCEPTION_HANDLER	boot_stage1_vc_handler vc_no_ghcb_handler error_code=1
+EXCEPTION_HANDLER	boot_stage2_vc_handler boot_vc_handler error_code=1
 #endif
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 42f68a858a35..567d71ab5ed9 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -143,5 +143,6 @@ extern struct desc_ptr boot_idt_desc;
 /* IDT Entry Points */
 void boot_pf_handler(void);
 void boot_stage1_vc_handler(void);
+void boot_stage2_vc_handler(void);
 
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index eeeb3553547c..193c970a3379 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -8,12 +8,16 @@
 #include <linux/kernel.h>
 
 #include <asm/sev-es.h>
+#include <asm/trap_defs.h>
 #include <asm/msr-index.h>
 #include <asm/ptrace.h>
 #include <asm/svm.h>
 
 #include "misc.h"
 
+struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
+struct ghcb *boot_ghcb;
+
 static inline u64 sev_es_rd_ghcb_msr(void)
 {
 	unsigned long low, high;
@@ -35,8 +39,98 @@ static inline void sev_es_wr_ghcb_msr(u64 val)
 			"a"(low), "d" (high) : "memory");
 }
 
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+	char buffer[MAX_INSN_SIZE];
+	enum es_result ret;
+
+	memcpy(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+
+	insn_init(&ctxt->insn, buffer, MAX_INSN_SIZE, 1);
+	insn_get_length(&ctxt->insn);
+
+	ret = ctxt->insn.immediate.got ? ES_OK : ES_DECODE_FAILED;
+
+	return ret;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+				   void *dst, char *buf, size_t size)
+{
+	memcpy(dst, buf, size);
+
+	return ES_OK;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+				  void *src, char *buf, size_t size)
+{
+	memcpy(buf, src, size);
+
+	return ES_OK;
+}
+
 #undef __init
+#undef __pa
 #define __init
+#define __pa(x)	((unsigned long)(x))
+
+#define __BOOT_COMPRESSED
+
+/* Basic instruction decoding support needed */
+#include "../../lib/inat.c"
+#include "../../lib/insn.c"
 
 /* Include code for early handlers */
 #include "../../kernel/sev-es-shared.c"
+
+static bool sev_es_setup_ghcb(void)
+{
+	if (!sev_es_negotiate_protocol())
+		sev_es_terminate(GHCB_SEV_ES_REASON_PROTOCOL_UNSUPPORTED);
+
+	if (set_page_decrypted((unsigned long)&boot_ghcb_page))
+		return false;
+
+	/* Page is now mapped decrypted, clear it */
+	memset(&boot_ghcb_page, 0, sizeof(boot_ghcb_page));
+
+	boot_ghcb = &boot_ghcb_page;
+
+	/* Initialize lookup tables for the instruction decoder */
+	inat_init_tables();
+
+	return true;
+}
+
+void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
+{
+	struct es_em_ctxt ctxt;
+	enum es_result result;
+
+	if (!boot_ghcb && !sev_es_setup_ghcb())
+		sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);
+
+	vc_ghcb_invalidate(boot_ghcb);
+	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+	if (result != ES_OK)
+		goto finish;
+
+	switch (exit_code) {
+	default:
+		result = ES_UNSUPPORTED;
+		break;
+	}
+
+finish:
+	if (result == ES_OK) {
+		vc_finish_insn(&ctxt);
+	} else if (result != ES_RETRY) {
+		/*
+		 * For now, just halt the machine. That makes debugging easier,
+		 * later we just call sev_es_terminate() here.
+		 */
+		while (true)
+			asm volatile("hlt\n");
+	}
+}
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index f524b40aef07..512d3ccb9832 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -9,7 +9,14 @@
 #define __ASM_ENCRYPTED_STATE_H
 
 #include <linux/types.h>
+#include <asm/insn.h>
 
+#define GHCB_SEV_INFO		0x001UL
+#define GHCB_SEV_INFO_REQ	0x002UL
+#define		GHCB_INFO(v)		((v) & 0xfffUL)
+#define		GHCB_PROTO_MAX(v)	(((v) >> 48) & 0xffffUL)
+#define		GHCB_PROTO_MIN(v)	(((v) >> 32) & 0xffffUL)
+#define		GHCB_PROTO_OUR		0x0001UL
 #define GHCB_SEV_CPUID_REQ	0x004UL
 #define		GHCB_CPUID_REQ_EAX	0
 #define		GHCB_CPUID_REQ_EBX	1
@@ -21,10 +28,36 @@
 
 #define GHCB_SEV_CPUID_RESP	0x005UL
 #define GHCB_SEV_TERMINATE	0x100UL
+#define		GHCB_SEV_ES_REASON_GENERAL_REQUEST	0
+#define		GHCB_SEV_ES_REASON_PROTOCOL_UNSUPPORTED	1
 
 #define	GHCB_SEV_GHCB_RESP_CODE(v)	((v) & 0xfff)
 #define	VMGEXIT()			{ asm volatile("rep; vmmcall\n\r"); }
 
+enum es_result {
+	ES_OK,			/* All good */
+	ES_UNSUPPORTED,		/* Requested operation not supported */
+	ES_VMM_ERROR,		/* Unexpected state from the VMM */
+	ES_DECODE_FAILED,	/* Instruction decoding failed */
+	ES_EXCEPTION,		/* Instruction caused exception */
+	ES_RETRY,		/* Retry instruction emulation */
+};
+
+struct es_fault_info {
+	unsigned long vector;
+	unsigned long error_code;
+	unsigned long cr2;
+};
+
+struct pt_regs;
+
+/* ES instruction emulation context */
+struct es_em_ctxt {
+	struct pt_regs *regs;
+	struct insn insn;
+	struct es_fault_info fi;
+};
+
 static inline u64 lower_bits(u64 val, unsigned int bits)
 {
 	u64 mask = (1ULL << bits) - 1;
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 2e8a30f06c74..c68d1618c9b0 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -29,6 +29,7 @@
 #define SVM_EXIT_WRITE_DR6     0x036
 #define SVM_EXIT_WRITE_DR7     0x037
 #define SVM_EXIT_EXCP_BASE     0x040
+#define SVM_EXIT_LAST_EXCP     0x05f
 #define SVM_EXIT_INTR          0x060
 #define SVM_EXIT_NMI           0x061
 #define SVM_EXIT_SMI           0x062
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index e963b48d3e86..f0947ea3c601 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -9,6 +9,106 @@
  * and is included directly into both code-bases.
  */
 
+static void sev_es_terminate(unsigned int reason)
+{
+	/* Request Guest Termination from Hypvervisor */
+	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
+	VMGEXIT();
+
+	while (true)
+		asm volatile("hlt\n" : : : "memory");
+}
+
+static bool sev_es_negotiate_protocol(void)
+{
+	u64 val;
+
+	/* Do the GHCB protocol version negotiation */
+	sev_es_wr_ghcb_msr(GHCB_SEV_INFO_REQ);
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+
+	if (GHCB_INFO(val) != GHCB_SEV_INFO)
+		return false;
+
+	if (GHCB_PROTO_MAX(val) < GHCB_PROTO_OUR ||
+	    GHCB_PROTO_MIN(val) > GHCB_PROTO_OUR)
+		return false;
+
+	return true;
+}
+
+static void vc_ghcb_invalidate(struct ghcb *ghcb)
+{
+	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
+}
+
+static bool vc_decoding_needed(unsigned long exit_code)
+{
+	/* Exceptions don't require to decode the instruction */
+	return !(exit_code >= SVM_EXIT_EXCP_BASE &&
+		 exit_code <= SVM_EXIT_LAST_EXCP);
+}
+
+static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+				      struct pt_regs *regs,
+				      unsigned long exit_code)
+{
+	enum es_result ret = ES_OK;
+
+	memset(ctxt, 0, sizeof(*ctxt));
+	ctxt->regs = regs;
+
+	if (vc_decoding_needed(exit_code))
+		ret = vc_decode_insn(ctxt);
+
+	return ret;
+}
+
+static void vc_finish_insn(struct es_em_ctxt *ctxt)
+{
+	ctxt->regs->ip += ctxt->insn.length;
+}
+
+static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+					  struct es_em_ctxt *ctxt,
+					  u64 exit_code, u64 exit_info_1,
+					  u64 exit_info_2)
+{
+	enum es_result ret;
+
+	ghcb_set_sw_exit_code(ghcb, exit_code);
+	ghcb_set_sw_exit_info_1(ghcb, exit_info_1);
+	ghcb_set_sw_exit_info_2(ghcb, exit_info_2);
+
+	sev_es_wr_ghcb_msr(__pa(ghcb));
+	VMGEXIT();
+
+	if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1) {
+		u64 info = ghcb->save.sw_exit_info_2;
+		unsigned long v;
+
+		info = ghcb->save.sw_exit_info_2;
+		v = info & SVM_EVTINJ_VEC_MASK;
+
+		/* Check if exception information from hypervisor is sane. */
+		if ((info & SVM_EVTINJ_VALID) &&
+		    ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
+		    ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
+			ctxt->fi.vector = v;
+			if (info & SVM_EVTINJ_VALID_ERR)
+				ctxt->fi.error_code = info >> 32;
+			ret = ES_EXCEPTION;
+		} else {
+			ret = ES_VMM_ERROR;
+		}
+	} else {
+		ret = ES_OK;
+	}
+
+	return ret;
+}
+
 /*
  * Boot VC Handler - This is the first VC handler during boot, there is no GHCB
  * page yet, so it only supports the MSR based communication with the
@@ -63,3 +163,45 @@ void __init vc_no_ghcb_handler(struct pt_regs *regs, unsigned long exit_code)
 	while (true)
 		asm volatile("hlt\n");
 }
+
+static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
+					  void *src, char *buf,
+					  unsigned int data_size,
+					  unsigned int count,
+					  bool backwards)
+{
+	int i, b = backwards ? -1 : 1;
+	enum es_result ret = ES_OK;
+
+	for (i = 0; i < count; i++) {
+		void *s = src + (i * data_size * b);
+		char *d = buf + (i * data_size);
+
+		ret = vc_read_mem(ctxt, s, d, data_size);
+		if (ret != ES_OK)
+			break;
+	}
+
+	return ret;
+}
+
+static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
+					   void *dst, char *buf,
+					   unsigned int data_size,
+					   unsigned int count,
+					   bool backwards)
+{
+	int i, s = backwards ? -1 : 1;
+	enum es_result ret = ES_OK;
+
+	for (i = 0; i < count; i++) {
+		void *d = dst + (i * data_size * s);
+		char *b = buf + (i * data_size);
+
+		ret = vc_write_mem(ctxt, d, b, data_size);
+		if (ret != ES_OK)
+			break;
+	}
+
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 22/70] x86/boot/compressed/64: Setup GHCB Based VC Exception handler
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Install an exception handler for #VC exception that uses a GHCB. Also
add the infrastructure for handling different exit-codes by decoding
the instruction that caused the exception and error handling.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/Kconfig                           |   1 +
 arch/x86/boot/compressed/idt_64.c          |   4 +
 arch/x86/boot/compressed/idt_handlers_64.S |   1 +
 arch/x86/boot/compressed/misc.h            |   1 +
 arch/x86/boot/compressed/sev-es.c          |  94 ++++++++++++++
 arch/x86/include/asm/sev-es.h              |  33 +++++
 arch/x86/include/uapi/asm/svm.h            |   1 +
 arch/x86/kernel/sev-es-shared.c            | 142 +++++++++++++++++++++
 8 files changed, 277 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..c12347492589 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1526,6 +1526,7 @@ config AMD_MEM_ENCRYPT
 	select DYNAMIC_PHYSICAL_MASK
 	select ARCH_USE_MEMREMAP_PROT
 	select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+	select INSTRUCTION_DECODER
 	---help---
 	  Say yes to enable support for the encryption of system memory.
 	  This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index bdd20dfd1fd0..eebb2f857dac 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -45,5 +45,9 @@ void load_stage2_idt(void)
 
 	set_idt_entry(X86_TRAP_PF, boot_pf_handler);
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	set_idt_entry(X86_TRAP_VC, boot_stage2_vc_handler);
+#endif
+
 	load_boot_idt(&boot_idt_desc);
 }
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 67ddafab2943..04edeb73d2cf 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -78,4 +78,5 @@ EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 EXCEPTION_HANDLER	boot_stage1_vc_handler vc_no_ghcb_handler error_code=1
+EXCEPTION_HANDLER	boot_stage2_vc_handler boot_vc_handler error_code=1
 #endif
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 42f68a858a35..567d71ab5ed9 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -143,5 +143,6 @@ extern struct desc_ptr boot_idt_desc;
 /* IDT Entry Points */
 void boot_pf_handler(void);
 void boot_stage1_vc_handler(void);
+void boot_stage2_vc_handler(void);
 
 #endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index eeeb3553547c..193c970a3379 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -8,12 +8,16 @@
 #include <linux/kernel.h>
 
 #include <asm/sev-es.h>
+#include <asm/trap_defs.h>
 #include <asm/msr-index.h>
 #include <asm/ptrace.h>
 #include <asm/svm.h>
 
 #include "misc.h"
 
+struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
+struct ghcb *boot_ghcb;
+
 static inline u64 sev_es_rd_ghcb_msr(void)
 {
 	unsigned long low, high;
@@ -35,8 +39,98 @@ static inline void sev_es_wr_ghcb_msr(u64 val)
 			"a"(low), "d" (high) : "memory");
 }
 
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+	char buffer[MAX_INSN_SIZE];
+	enum es_result ret;
+
+	memcpy(buffer, (unsigned char *)ctxt->regs->ip, MAX_INSN_SIZE);
+
+	insn_init(&ctxt->insn, buffer, MAX_INSN_SIZE, 1);
+	insn_get_length(&ctxt->insn);
+
+	ret = ctxt->insn.immediate.got ? ES_OK : ES_DECODE_FAILED;
+
+	return ret;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+				   void *dst, char *buf, size_t size)
+{
+	memcpy(dst, buf, size);
+
+	return ES_OK;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+				  void *src, char *buf, size_t size)
+{
+	memcpy(buf, src, size);
+
+	return ES_OK;
+}
+
 #undef __init
+#undef __pa
 #define __init
+#define __pa(x)	((unsigned long)(x))
+
+#define __BOOT_COMPRESSED
+
+/* Basic instruction decoding support needed */
+#include "../../lib/inat.c"
+#include "../../lib/insn.c"
 
 /* Include code for early handlers */
 #include "../../kernel/sev-es-shared.c"
+
+static bool sev_es_setup_ghcb(void)
+{
+	if (!sev_es_negotiate_protocol())
+		sev_es_terminate(GHCB_SEV_ES_REASON_PROTOCOL_UNSUPPORTED);
+
+	if (set_page_decrypted((unsigned long)&boot_ghcb_page))
+		return false;
+
+	/* Page is now mapped decrypted, clear it */
+	memset(&boot_ghcb_page, 0, sizeof(boot_ghcb_page));
+
+	boot_ghcb = &boot_ghcb_page;
+
+	/* Initialize lookup tables for the instruction decoder */
+	inat_init_tables();
+
+	return true;
+}
+
+void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
+{
+	struct es_em_ctxt ctxt;
+	enum es_result result;
+
+	if (!boot_ghcb && !sev_es_setup_ghcb())
+		sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);
+
+	vc_ghcb_invalidate(boot_ghcb);
+	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+	if (result != ES_OK)
+		goto finish;
+
+	switch (exit_code) {
+	default:
+		result = ES_UNSUPPORTED;
+		break;
+	}
+
+finish:
+	if (result == ES_OK) {
+		vc_finish_insn(&ctxt);
+	} else if (result != ES_RETRY) {
+		/*
+		 * For now, just halt the machine. That makes debugging easier,
+		 * later we just call sev_es_terminate() here.
+		 */
+		while (true)
+			asm volatile("hlt\n");
+	}
+}
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index f524b40aef07..512d3ccb9832 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -9,7 +9,14 @@
 #define __ASM_ENCRYPTED_STATE_H
 
 #include <linux/types.h>
+#include <asm/insn.h>
 
+#define GHCB_SEV_INFO		0x001UL
+#define GHCB_SEV_INFO_REQ	0x002UL
+#define		GHCB_INFO(v)		((v) & 0xfffUL)
+#define		GHCB_PROTO_MAX(v)	(((v) >> 48) & 0xffffUL)
+#define		GHCB_PROTO_MIN(v)	(((v) >> 32) & 0xffffUL)
+#define		GHCB_PROTO_OUR		0x0001UL
 #define GHCB_SEV_CPUID_REQ	0x004UL
 #define		GHCB_CPUID_REQ_EAX	0
 #define		GHCB_CPUID_REQ_EBX	1
@@ -21,10 +28,36 @@
 
 #define GHCB_SEV_CPUID_RESP	0x005UL
 #define GHCB_SEV_TERMINATE	0x100UL
+#define		GHCB_SEV_ES_REASON_GENERAL_REQUEST	0
+#define		GHCB_SEV_ES_REASON_PROTOCOL_UNSUPPORTED	1
 
 #define	GHCB_SEV_GHCB_RESP_CODE(v)	((v) & 0xfff)
 #define	VMGEXIT()			{ asm volatile("rep; vmmcall\n\r"); }
 
+enum es_result {
+	ES_OK,			/* All good */
+	ES_UNSUPPORTED,		/* Requested operation not supported */
+	ES_VMM_ERROR,		/* Unexpected state from the VMM */
+	ES_DECODE_FAILED,	/* Instruction decoding failed */
+	ES_EXCEPTION,		/* Instruction caused exception */
+	ES_RETRY,		/* Retry instruction emulation */
+};
+
+struct es_fault_info {
+	unsigned long vector;
+	unsigned long error_code;
+	unsigned long cr2;
+};
+
+struct pt_regs;
+
+/* ES instruction emulation context */
+struct es_em_ctxt {
+	struct pt_regs *regs;
+	struct insn insn;
+	struct es_fault_info fi;
+};
+
 static inline u64 lower_bits(u64 val, unsigned int bits)
 {
 	u64 mask = (1ULL << bits) - 1;
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 2e8a30f06c74..c68d1618c9b0 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -29,6 +29,7 @@
 #define SVM_EXIT_WRITE_DR6     0x036
 #define SVM_EXIT_WRITE_DR7     0x037
 #define SVM_EXIT_EXCP_BASE     0x040
+#define SVM_EXIT_LAST_EXCP     0x05f
 #define SVM_EXIT_INTR          0x060
 #define SVM_EXIT_NMI           0x061
 #define SVM_EXIT_SMI           0x062
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index e963b48d3e86..f0947ea3c601 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -9,6 +9,106 @@
  * and is included directly into both code-bases.
  */
 
+static void sev_es_terminate(unsigned int reason)
+{
+	/* Request Guest Termination from Hypvervisor */
+	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
+	VMGEXIT();
+
+	while (true)
+		asm volatile("hlt\n" : : : "memory");
+}
+
+static bool sev_es_negotiate_protocol(void)
+{
+	u64 val;
+
+	/* Do the GHCB protocol version negotiation */
+	sev_es_wr_ghcb_msr(GHCB_SEV_INFO_REQ);
+	VMGEXIT();
+	val = sev_es_rd_ghcb_msr();
+
+	if (GHCB_INFO(val) != GHCB_SEV_INFO)
+		return false;
+
+	if (GHCB_PROTO_MAX(val) < GHCB_PROTO_OUR ||
+	    GHCB_PROTO_MIN(val) > GHCB_PROTO_OUR)
+		return false;
+
+	return true;
+}
+
+static void vc_ghcb_invalidate(struct ghcb *ghcb)
+{
+	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
+}
+
+static bool vc_decoding_needed(unsigned long exit_code)
+{
+	/* Exceptions don't require to decode the instruction */
+	return !(exit_code >= SVM_EXIT_EXCP_BASE &&
+		 exit_code <= SVM_EXIT_LAST_EXCP);
+}
+
+static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+				      struct pt_regs *regs,
+				      unsigned long exit_code)
+{
+	enum es_result ret = ES_OK;
+
+	memset(ctxt, 0, sizeof(*ctxt));
+	ctxt->regs = regs;
+
+	if (vc_decoding_needed(exit_code))
+		ret = vc_decode_insn(ctxt);
+
+	return ret;
+}
+
+static void vc_finish_insn(struct es_em_ctxt *ctxt)
+{
+	ctxt->regs->ip += ctxt->insn.length;
+}
+
+static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,
+					  struct es_em_ctxt *ctxt,
+					  u64 exit_code, u64 exit_info_1,
+					  u64 exit_info_2)
+{
+	enum es_result ret;
+
+	ghcb_set_sw_exit_code(ghcb, exit_code);
+	ghcb_set_sw_exit_info_1(ghcb, exit_info_1);
+	ghcb_set_sw_exit_info_2(ghcb, exit_info_2);
+
+	sev_es_wr_ghcb_msr(__pa(ghcb));
+	VMGEXIT();
+
+	if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1) {
+		u64 info = ghcb->save.sw_exit_info_2;
+		unsigned long v;
+
+		info = ghcb->save.sw_exit_info_2;
+		v = info & SVM_EVTINJ_VEC_MASK;
+
+		/* Check if exception information from hypervisor is sane. */
+		if ((info & SVM_EVTINJ_VALID) &&
+		    ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&
+		    ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {
+			ctxt->fi.vector = v;
+			if (info & SVM_EVTINJ_VALID_ERR)
+				ctxt->fi.error_code = info >> 32;
+			ret = ES_EXCEPTION;
+		} else {
+			ret = ES_VMM_ERROR;
+		}
+	} else {
+		ret = ES_OK;
+	}
+
+	return ret;
+}
+
 /*
  * Boot VC Handler - This is the first VC handler during boot, there is no GHCB
  * page yet, so it only supports the MSR based communication with the
@@ -63,3 +163,45 @@ void __init vc_no_ghcb_handler(struct pt_regs *regs, unsigned long exit_code)
 	while (true)
 		asm volatile("hlt\n");
 }
+
+static enum es_result vc_insn_string_read(struct es_em_ctxt *ctxt,
+					  void *src, char *buf,
+					  unsigned int data_size,
+					  unsigned int count,
+					  bool backwards)
+{
+	int i, b = backwards ? -1 : 1;
+	enum es_result ret = ES_OK;
+
+	for (i = 0; i < count; i++) {
+		void *s = src + (i * data_size * b);
+		char *d = buf + (i * data_size);
+
+		ret = vc_read_mem(ctxt, s, d, data_size);
+		if (ret != ES_OK)
+			break;
+	}
+
+	return ret;
+}
+
+static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
+					   void *dst, char *buf,
+					   unsigned int data_size,
+					   unsigned int count,
+					   bool backwards)
+{
+	int i, s = backwards ? -1 : 1;
+	enum es_result ret = ES_OK;
+
+	for (i = 0; i < count; i++) {
+		void *d = dst + (i * data_size * s);
+		char *b = buf + (i * data_size);
+
+		ret = vc_write_mem(ctxt, d, b, data_size);
+		if (ret != ES_OK)
+			break;
+	}
+
+	return ret;
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 23/70] x86/sev-es: Add support for handling IOIO exceptions
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Add support for decoding and handling #VC exceptions for IOIO events.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapted code to #VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/sev-es.c |  32 +++++
 arch/x86/kernel/sev-es-shared.c   | 202 ++++++++++++++++++++++++++++++
 2 files changed, 234 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 193c970a3379..ae5fbd371fd9 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -18,6 +18,35 @@
 struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
 struct ghcb *boot_ghcb;
 
+/*
+ * Copy a version of this function here - insn-eval.c can't be used in
+ * pre-decompression code.
+ */
+static bool insn_rep_prefix(struct insn *insn)
+{
+	int i;
+
+	insn_get_prefixes(insn);
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		insn_byte_t p = insn->prefixes.bytes[i];
+
+		if (p == 0xf2 || p == 0xf3)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
+ * doesn't use segments.
+ */
+static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
+{
+	return 0UL;
+}
+
 static inline u64 sev_es_rd_ghcb_msr(void)
 {
 	unsigned long low, high;
@@ -117,6 +146,9 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
 		goto finish;
 
 	switch (exit_code) {
+	case SVM_EXIT_IOIO:
+		result = vc_handle_ioio(boot_ghcb, &ctxt);
+		break;
 	default:
 		result = ES_UNSUPPORTED;
 		break;
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index f0947ea3c601..46fc5318d1d7 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -205,3 +205,205 @@ static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
 
 	return ret;
 }
+
+#define IOIO_TYPE_STR  BIT(2)
+#define IOIO_TYPE_IN   1
+#define IOIO_TYPE_INS  (IOIO_TYPE_IN | IOIO_TYPE_STR)
+#define IOIO_TYPE_OUT  0
+#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
+
+#define IOIO_REP       BIT(3)
+
+#define IOIO_ADDR_64   BIT(9)
+#define IOIO_ADDR_32   BIT(8)
+#define IOIO_ADDR_16   BIT(7)
+
+#define IOIO_DATA_32   BIT(6)
+#define IOIO_DATA_16   BIT(5)
+#define IOIO_DATA_8    BIT(4)
+
+#define IOIO_SEG_ES    (0 << 10)
+#define IOIO_SEG_DS    (3 << 10)
+
+static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
+{
+	struct insn *insn = &ctxt->insn;
+	*exitinfo = 0;
+
+	switch (insn->opcode.bytes[0]) {
+	/* INS opcodes */
+	case 0x6c:
+	case 0x6d:
+		*exitinfo |= IOIO_TYPE_INS;
+		*exitinfo |= IOIO_SEG_ES;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	/* OUTS opcodes */
+	case 0x6e:
+	case 0x6f:
+		*exitinfo |= IOIO_TYPE_OUTS;
+		*exitinfo |= IOIO_SEG_DS;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	/* IN immediate opcodes */
+	case 0xe4:
+	case 0xe5:
+		*exitinfo |= IOIO_TYPE_IN;
+		*exitinfo |= insn->immediate.value << 16;
+		break;
+
+	/* OUT immediate opcodes */
+	case 0xe6:
+	case 0xe7:
+		*exitinfo |= IOIO_TYPE_OUT;
+		*exitinfo |= insn->immediate.value << 16;
+		break;
+
+	/* IN register opcodes */
+	case 0xec:
+	case 0xed:
+		*exitinfo |= IOIO_TYPE_IN;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	/* OUT register opcodes */
+	case 0xee:
+	case 0xef:
+		*exitinfo |= IOIO_TYPE_OUT;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	default:
+		return ES_DECODE_FAILED;
+	}
+
+	switch (insn->opcode.bytes[0]) {
+	case 0x6c:
+	case 0x6e:
+	case 0xe4:
+	case 0xe6:
+	case 0xec:
+	case 0xee:
+		/* Single byte opcodes */
+		*exitinfo |= IOIO_DATA_8;
+		break;
+	default:
+		/* Length determined by instruction parsing */
+		*exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
+						     : IOIO_DATA_32;
+	}
+	switch (insn->addr_bytes) {
+	case 2:
+		*exitinfo |= IOIO_ADDR_16;
+		break;
+	case 4:
+		*exitinfo |= IOIO_ADDR_32;
+		break;
+	case 8:
+		*exitinfo |= IOIO_ADDR_64;
+		break;
+	}
+
+	if (insn_rep_prefix(insn))
+		*exitinfo |= IOIO_REP;
+
+	return ES_OK;
+}
+
+static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+	struct pt_regs *regs = ctxt->regs;
+	u64 exit_info_1, exit_info_2;
+	enum es_result ret;
+
+	ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
+	if (ret != ES_OK)
+		return ret;
+
+	if (exit_info_1 & IOIO_TYPE_STR) {
+		int df = (regs->flags & X86_EFLAGS_DF) ? -1 : 1;
+		unsigned int io_bytes, exit_bytes;
+		unsigned int ghcb_count, op_count;
+		unsigned long es_base;
+		u64 sw_scratch;
+
+		/*
+		 * For the string variants with rep prefix the amount of in/out
+		 * operations per #VC exception is limited so that the kernel
+		 * has a chance to take interrupts an re-schedule while the
+		 * instruction is emulated.
+		 */
+		io_bytes   = (exit_info_1 >> 4) & 0x7;
+		ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
+
+		op_count    = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
+		exit_info_2 = min(op_count, ghcb_count);
+		exit_bytes  = exit_info_2 * io_bytes;
+
+		es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+		if (!(exit_info_1 & IOIO_TYPE_IN)) {
+			ret = vc_insn_string_read(ctxt,
+					       (void *)(es_base + regs->si),
+					       ghcb->shared_buffer, io_bytes,
+					       exit_info_2, df);
+			if (ret)
+				return ret;
+		}
+
+		sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
+		ghcb_set_sw_scratch(ghcb, sw_scratch);
+		ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
+				   exit_info_1, exit_info_2);
+		if (ret != ES_OK)
+			return ret;
+
+		/* Everything went well, write back results */
+		if (exit_info_1 & IOIO_TYPE_IN) {
+			ret = vc_insn_string_write(ctxt,
+						(void *)(es_base + regs->di),
+						ghcb->shared_buffer, io_bytes,
+						exit_info_2, df);
+			if (ret)
+				return ret;
+
+			if (df)
+				regs->di -= exit_bytes;
+			else
+				regs->di += exit_bytes;
+		} else {
+			if (df)
+				regs->si -= exit_bytes;
+			else
+				regs->si += exit_bytes;
+		}
+
+		if (exit_info_1 & IOIO_REP)
+			regs->cx -= exit_info_2;
+
+		ret = regs->cx ? ES_RETRY : ES_OK;
+
+	} else {
+		int bits = (exit_info_1 & 0x70) >> 1;
+		u64 rax = 0;
+
+		if (!(exit_info_1 & IOIO_TYPE_IN))
+			rax = lower_bits(regs->ax, bits);
+
+		ghcb_set_rax(ghcb, rax);
+
+		ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
+		if (ret != ES_OK)
+			return ret;
+
+		if (exit_info_1 & IOIO_TYPE_IN) {
+			if (!ghcb_is_valid_rax(ghcb))
+				return ES_VMM_ERROR;
+			regs->ax = lower_bits(ghcb->save.rax, bits);
+		}
+	}
+
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 23/70] x86/sev-es: Add support for handling IOIO exceptions
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Tom Lendacky <thomas.lendacky@amd.com>

Add support for decoding and handling #VC exceptions for IOIO events.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapted code to #VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/sev-es.c |  32 +++++
 arch/x86/kernel/sev-es-shared.c   | 202 ++++++++++++++++++++++++++++++
 2 files changed, 234 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 193c970a3379..ae5fbd371fd9 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -18,6 +18,35 @@
 struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
 struct ghcb *boot_ghcb;
 
+/*
+ * Copy a version of this function here - insn-eval.c can't be used in
+ * pre-decompression code.
+ */
+static bool insn_rep_prefix(struct insn *insn)
+{
+	int i;
+
+	insn_get_prefixes(insn);
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		insn_byte_t p = insn->prefixes.bytes[i];
+
+		if (p == 0xf2 || p == 0xf3)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
+ * doesn't use segments.
+ */
+static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
+{
+	return 0UL;
+}
+
 static inline u64 sev_es_rd_ghcb_msr(void)
 {
 	unsigned long low, high;
@@ -117,6 +146,9 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
 		goto finish;
 
 	switch (exit_code) {
+	case SVM_EXIT_IOIO:
+		result = vc_handle_ioio(boot_ghcb, &ctxt);
+		break;
 	default:
 		result = ES_UNSUPPORTED;
 		break;
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index f0947ea3c601..46fc5318d1d7 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -205,3 +205,205 @@ static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
 
 	return ret;
 }
+
+#define IOIO_TYPE_STR  BIT(2)
+#define IOIO_TYPE_IN   1
+#define IOIO_TYPE_INS  (IOIO_TYPE_IN | IOIO_TYPE_STR)
+#define IOIO_TYPE_OUT  0
+#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
+
+#define IOIO_REP       BIT(3)
+
+#define IOIO_ADDR_64   BIT(9)
+#define IOIO_ADDR_32   BIT(8)
+#define IOIO_ADDR_16   BIT(7)
+
+#define IOIO_DATA_32   BIT(6)
+#define IOIO_DATA_16   BIT(5)
+#define IOIO_DATA_8    BIT(4)
+
+#define IOIO_SEG_ES    (0 << 10)
+#define IOIO_SEG_DS    (3 << 10)
+
+static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
+{
+	struct insn *insn = &ctxt->insn;
+	*exitinfo = 0;
+
+	switch (insn->opcode.bytes[0]) {
+	/* INS opcodes */
+	case 0x6c:
+	case 0x6d:
+		*exitinfo |= IOIO_TYPE_INS;
+		*exitinfo |= IOIO_SEG_ES;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	/* OUTS opcodes */
+	case 0x6e:
+	case 0x6f:
+		*exitinfo |= IOIO_TYPE_OUTS;
+		*exitinfo |= IOIO_SEG_DS;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	/* IN immediate opcodes */
+	case 0xe4:
+	case 0xe5:
+		*exitinfo |= IOIO_TYPE_IN;
+		*exitinfo |= insn->immediate.value << 16;
+		break;
+
+	/* OUT immediate opcodes */
+	case 0xe6:
+	case 0xe7:
+		*exitinfo |= IOIO_TYPE_OUT;
+		*exitinfo |= insn->immediate.value << 16;
+		break;
+
+	/* IN register opcodes */
+	case 0xec:
+	case 0xed:
+		*exitinfo |= IOIO_TYPE_IN;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	/* OUT register opcodes */
+	case 0xee:
+	case 0xef:
+		*exitinfo |= IOIO_TYPE_OUT;
+		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
+		break;
+
+	default:
+		return ES_DECODE_FAILED;
+	}
+
+	switch (insn->opcode.bytes[0]) {
+	case 0x6c:
+	case 0x6e:
+	case 0xe4:
+	case 0xe6:
+	case 0xec:
+	case 0xee:
+		/* Single byte opcodes */
+		*exitinfo |= IOIO_DATA_8;
+		break;
+	default:
+		/* Length determined by instruction parsing */
+		*exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
+						     : IOIO_DATA_32;
+	}
+	switch (insn->addr_bytes) {
+	case 2:
+		*exitinfo |= IOIO_ADDR_16;
+		break;
+	case 4:
+		*exitinfo |= IOIO_ADDR_32;
+		break;
+	case 8:
+		*exitinfo |= IOIO_ADDR_64;
+		break;
+	}
+
+	if (insn_rep_prefix(insn))
+		*exitinfo |= IOIO_REP;
+
+	return ES_OK;
+}
+
+static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+	struct pt_regs *regs = ctxt->regs;
+	u64 exit_info_1, exit_info_2;
+	enum es_result ret;
+
+	ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
+	if (ret != ES_OK)
+		return ret;
+
+	if (exit_info_1 & IOIO_TYPE_STR) {
+		int df = (regs->flags & X86_EFLAGS_DF) ? -1 : 1;
+		unsigned int io_bytes, exit_bytes;
+		unsigned int ghcb_count, op_count;
+		unsigned long es_base;
+		u64 sw_scratch;
+
+		/*
+		 * For the string variants with rep prefix the amount of in/out
+		 * operations per #VC exception is limited so that the kernel
+		 * has a chance to take interrupts an re-schedule while the
+		 * instruction is emulated.
+		 */
+		io_bytes   = (exit_info_1 >> 4) & 0x7;
+		ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
+
+		op_count    = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
+		exit_info_2 = min(op_count, ghcb_count);
+		exit_bytes  = exit_info_2 * io_bytes;
+
+		es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+		if (!(exit_info_1 & IOIO_TYPE_IN)) {
+			ret = vc_insn_string_read(ctxt,
+					       (void *)(es_base + regs->si),
+					       ghcb->shared_buffer, io_bytes,
+					       exit_info_2, df);
+			if (ret)
+				return ret;
+		}
+
+		sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
+		ghcb_set_sw_scratch(ghcb, sw_scratch);
+		ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
+				   exit_info_1, exit_info_2);
+		if (ret != ES_OK)
+			return ret;
+
+		/* Everything went well, write back results */
+		if (exit_info_1 & IOIO_TYPE_IN) {
+			ret = vc_insn_string_write(ctxt,
+						(void *)(es_base + regs->di),
+						ghcb->shared_buffer, io_bytes,
+						exit_info_2, df);
+			if (ret)
+				return ret;
+
+			if (df)
+				regs->di -= exit_bytes;
+			else
+				regs->di += exit_bytes;
+		} else {
+			if (df)
+				regs->si -= exit_bytes;
+			else
+				regs->si += exit_bytes;
+		}
+
+		if (exit_info_1 & IOIO_REP)
+			regs->cx -= exit_info_2;
+
+		ret = regs->cx ? ES_RETRY : ES_OK;
+
+	} else {
+		int bits = (exit_info_1 & 0x70) >> 1;
+		u64 rax = 0;
+
+		if (!(exit_info_1 & IOIO_TYPE_IN))
+			rax = lower_bits(regs->ax, bits);
+
+		ghcb_set_rax(ghcb, rax);
+
+		ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
+		if (ret != ES_OK)
+			return ret;
+
+		if (exit_info_1 & IOIO_TYPE_IN) {
+			if (!ghcb_is_valid_rax(ghcb))
+				return ES_VMM_ERROR;
+			regs->ax = lower_bits(ghcb->save.rax, bits);
+		}
+	}
+
+	return ret;
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 24/70] x86/fpu: Move xgetbv()/xsetbv() into separate header
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The xgetbv() function is needed in pre-decompression boot code, but
asm/fpu/internal.h can't be included there directly. Doing so opens
the door to include-hell due to various include-magic in
boot/compressed/misc.h.

Avoid that by moving xgetbv()/xsetbv() to a separate header file and
include this instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/fpu/internal.h | 29 +-------------------------
 arch/x86/include/asm/fpu/xcr.h      | 32 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/fpu/xcr.h

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 44c48e34d799..795fc049988e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -19,6 +19,7 @@
 #include <asm/user.h>
 #include <asm/fpu/api.h>
 #include <asm/fpu/xstate.h>
+#include <asm/fpu/xcr.h>
 #include <asm/cpufeature.h>
 #include <asm/trace/fpu.h>
 
@@ -614,32 +615,4 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 	}
 	__write_pkru(pkru_val);
 }
-
-/*
- * MXCSR and XCR definitions:
- */
-
-extern unsigned int mxcsr_feature_mask;
-
-#define XCR_XFEATURE_ENABLED_MASK	0x00000000
-
-static inline u64 xgetbv(u32 index)
-{
-	u32 eax, edx;
-
-	asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
-		     : "=a" (eax), "=d" (edx)
-		     : "c" (index));
-	return eax + ((u64)edx << 32);
-}
-
-static inline void xsetbv(u32 index, u64 value)
-{
-	u32 eax = value;
-	u32 edx = value >> 32;
-
-	asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
-		     : : "a" (eax), "d" (edx), "c" (index));
-}
-
 #endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/fpu/xcr.h b/arch/x86/include/asm/fpu/xcr.h
new file mode 100644
index 000000000000..91ee45712737
--- /dev/null
+++ b/arch/x86/include/asm/fpu/xcr.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_FPU_XCR_H
+#define _ASM_X86_FPU_XCR_H
+
+/*
+ * MXCSR and XCR definitions:
+ */
+
+extern unsigned int mxcsr_feature_mask;
+
+#define XCR_XFEATURE_ENABLED_MASK	0x00000000
+
+static inline u64 xgetbv(u32 index)
+{
+	u32 eax, edx;
+
+	asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
+		     : "=a" (eax), "=d" (edx)
+		     : "c" (index));
+	return eax + ((u64)edx << 32);
+}
+
+static inline void xsetbv(u32 index, u64 value)
+{
+	u32 eax = value;
+	u32 edx = value >> 32;
+
+	asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
+		     : : "a" (eax), "d" (edx), "c" (index));
+}
+
+#endif /* _ASM_X86_FPU_XCR_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 24/70] x86/fpu: Move xgetbv()/xsetbv() into separate header
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

The xgetbv() function is needed in pre-decompression boot code, but
asm/fpu/internal.h can't be included there directly. Doing so opens
the door to include-hell due to various include-magic in
boot/compressed/misc.h.

Avoid that by moving xgetbv()/xsetbv() to a separate header file and
include this instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/fpu/internal.h | 29 +-------------------------
 arch/x86/include/asm/fpu/xcr.h      | 32 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/fpu/xcr.h

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 44c48e34d799..795fc049988e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -19,6 +19,7 @@
 #include <asm/user.h>
 #include <asm/fpu/api.h>
 #include <asm/fpu/xstate.h>
+#include <asm/fpu/xcr.h>
 #include <asm/cpufeature.h>
 #include <asm/trace/fpu.h>
 
@@ -614,32 +615,4 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 	}
 	__write_pkru(pkru_val);
 }
-
-/*
- * MXCSR and XCR definitions:
- */
-
-extern unsigned int mxcsr_feature_mask;
-
-#define XCR_XFEATURE_ENABLED_MASK	0x00000000
-
-static inline u64 xgetbv(u32 index)
-{
-	u32 eax, edx;
-
-	asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
-		     : "=a" (eax), "=d" (edx)
-		     : "c" (index));
-	return eax + ((u64)edx << 32);
-}
-
-static inline void xsetbv(u32 index, u64 value)
-{
-	u32 eax = value;
-	u32 edx = value >> 32;
-
-	asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
-		     : : "a" (eax), "d" (edx), "c" (index));
-}
-
 #endif /* _ASM_X86_FPU_INTERNAL_H */
diff --git a/arch/x86/include/asm/fpu/xcr.h b/arch/x86/include/asm/fpu/xcr.h
new file mode 100644
index 000000000000..91ee45712737
--- /dev/null
+++ b/arch/x86/include/asm/fpu/xcr.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_FPU_XCR_H
+#define _ASM_X86_FPU_XCR_H
+
+/*
+ * MXCSR and XCR definitions:
+ */
+
+extern unsigned int mxcsr_feature_mask;
+
+#define XCR_XFEATURE_ENABLED_MASK	0x00000000
+
+static inline u64 xgetbv(u32 index)
+{
+	u32 eax, edx;
+
+	asm volatile(".byte 0x0f,0x01,0xd0" /* xgetbv */
+		     : "=a" (eax), "=d" (edx)
+		     : "c" (index));
+	return eax + ((u64)edx << 32);
+}
+
+static inline void xsetbv(u32 index, u64 value)
+{
+	u32 eax = value;
+	u32 edx = value >> 32;
+
+	asm volatile(".byte 0x0f,0x01,0xd1" /* xsetbv */
+		     : : "a" (eax), "d" (edx), "c" (index));
+}
+
+#endif /* _ASM_X86_FPU_XCR_H */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 25/70] x86/sev-es: Add CPUID handling to #VC handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Handle #VC exceptions caused by CPUID instructions. These happen in
early boot code when the KASLR code checks for RDTSC.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/sev-es.c |  4 ++++
 arch/x86/kernel/sev-es-shared.c   | 35 +++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index ae5fbd371fd9..40eaf24db641 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -10,6 +10,7 @@
 #include <asm/sev-es.h>
 #include <asm/trap_defs.h>
 #include <asm/msr-index.h>
+#include <asm/fpu/xcr.h>
 #include <asm/ptrace.h>
 #include <asm/svm.h>
 
@@ -149,6 +150,9 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
 	case SVM_EXIT_IOIO:
 		result = vc_handle_ioio(boot_ghcb, &ctxt);
 		break;
+	case SVM_EXIT_CPUID:
+		result = vc_handle_cpuid(boot_ghcb, &ctxt);
+		break;
 	default:
 		result = ES_UNSUPPORTED;
 		break;
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index 46fc5318d1d7..a632b8f041ec 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -407,3 +407,38 @@ static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 
 	return ret;
 }
+
+static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt)
+{
+	struct pt_regs *regs = ctxt->regs;
+	u32 cr4 = native_read_cr4();
+	enum es_result ret;
+
+	ghcb_set_rax(ghcb, regs->ax);
+	ghcb_set_rcx(ghcb, regs->cx);
+
+	if (cr4 & X86_CR4_OSXSAVE)
+		/* Safe to read xcr0 */
+		ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
+	else
+		/* xgetbv will cause #GP - use reset value for xcr0 */
+		ghcb_set_xcr0(ghcb, 1);
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!(ghcb_is_valid_rax(ghcb) &&
+	      ghcb_is_valid_rbx(ghcb) &&
+	      ghcb_is_valid_rcx(ghcb) &&
+	      ghcb_is_valid_rdx(ghcb)))
+		return ES_VMM_ERROR;
+
+	regs->ax = ghcb->save.rax;
+	regs->bx = ghcb->save.rbx;
+	regs->cx = ghcb->save.rcx;
+	regs->dx = ghcb->save.rdx;
+
+	return ES_OK;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 25/70] x86/sev-es: Add CPUID handling to #VC handler
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Tom Lendacky <thomas.lendacky@amd.com>

Handle #VC exceptions caused by CPUID instructions. These happen in
early boot code when the KASLR code checks for RDTSC.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/sev-es.c |  4 ++++
 arch/x86/kernel/sev-es-shared.c   | 35 +++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index ae5fbd371fd9..40eaf24db641 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -10,6 +10,7 @@
 #include <asm/sev-es.h>
 #include <asm/trap_defs.h>
 #include <asm/msr-index.h>
+#include <asm/fpu/xcr.h>
 #include <asm/ptrace.h>
 #include <asm/svm.h>
 
@@ -149,6 +150,9 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
 	case SVM_EXIT_IOIO:
 		result = vc_handle_ioio(boot_ghcb, &ctxt);
 		break;
+	case SVM_EXIT_CPUID:
+		result = vc_handle_cpuid(boot_ghcb, &ctxt);
+		break;
 	default:
 		result = ES_UNSUPPORTED;
 		break;
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index 46fc5318d1d7..a632b8f041ec 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -407,3 +407,38 @@ static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 
 	return ret;
 }
+
+static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt)
+{
+	struct pt_regs *regs = ctxt->regs;
+	u32 cr4 = native_read_cr4();
+	enum es_result ret;
+
+	ghcb_set_rax(ghcb, regs->ax);
+	ghcb_set_rcx(ghcb, regs->cx);
+
+	if (cr4 & X86_CR4_OSXSAVE)
+		/* Safe to read xcr0 */
+		ghcb_set_xcr0(ghcb, xgetbv(XCR_XFEATURE_ENABLED_MASK));
+	else
+		/* xgetbv will cause #GP - use reset value for xcr0 */
+		ghcb_set_xcr0(ghcb, 1);
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_CPUID, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!(ghcb_is_valid_rax(ghcb) &&
+	      ghcb_is_valid_rbx(ghcb) &&
+	      ghcb_is_valid_rcx(ghcb) &&
+	      ghcb_is_valid_rdx(ghcb)))
+		return ES_VMM_ERROR;
+
+	regs->ax = ghcb->save.rax;
+	regs->bx = ghcb->save.rbx;
+	regs->cx = ghcb->save.rcx;
+	regs->dx = ghcb->save.rdx;
+
+	return ES_OK;
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 26/70] x86/idt: Move IDT to data segment
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

With SEV-ES, exception handling is needed very early, even before the
kernel has cleared the bss segment. In order to prevent clearing the
currently used IDT, move the IDT to the data segment.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/idt.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 87ef69a72c52..a8fc01ea602a 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -165,8 +165,12 @@ static const __initconst struct idt_data dbg_idts[] = {
 };
 #endif
 
-/* Must be page-aligned because the real IDT is used in a fixmap. */
-gate_desc idt_table[IDT_ENTRIES] __page_aligned_bss;
+/*
+ * Must be page-aligned because the real IDT is used in a fixmap.
+ * Also needs to be in the .data segment, because the idt_table is
+ * needed before the kernel clears the .bss segment.
+ */
+gate_desc idt_table[IDT_ENTRIES] __page_aligned_data;
 
 struct desc_ptr idt_descr __ro_after_init = {
 	.size		= (IDT_ENTRIES * 2 * sizeof(unsigned long)) - 1,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 26/70] x86/idt: Move IDT to data segment
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

With SEV-ES, exception handling is needed very early, even before the
kernel has cleared the bss segment. In order to prevent clearing the
currently used IDT, move the IDT to the data segment.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/idt.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 87ef69a72c52..a8fc01ea602a 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -165,8 +165,12 @@ static const __initconst struct idt_data dbg_idts[] = {
 };
 #endif
 
-/* Must be page-aligned because the real IDT is used in a fixmap. */
-gate_desc idt_table[IDT_ENTRIES] __page_aligned_bss;
+/*
+ * Must be page-aligned because the real IDT is used in a fixmap.
+ * Also needs to be in the .data segment, because the idt_table is
+ * needed before the kernel clears the .bss segment.
+ */
+gate_desc idt_table[IDT_ENTRIES] __page_aligned_data;
 
 struct desc_ptr idt_descr __ro_after_init = {
 	.size		= (IDT_ENTRIES * 2 * sizeof(unsigned long)) - 1,
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 27/70] x86/idt: Split idt_data setup out of set_intr_gate()
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The code to setup idt_data is needed for early exception handling, but
set_intr_gate() can't be used that early because it has pv-ops in its
code path, which don't work that early.

Split out the idt_data initialization part from set_intr_gate() so
that it can be used separatly.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/idt.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a8fc01ea602a..c752027abc9e 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -231,18 +231,24 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
 	}
 }
 
+static void init_idt_data(struct idt_data *data, unsigned int n,
+			  const void *addr)
+{
+	BUG_ON(n > 0xFF);
+
+	memset(data, 0, sizeof(*data));
+	data->vector	= n;
+	data->addr	= addr;
+	data->segment	= __KERNEL_CS;
+	data->bits.type	= GATE_INTERRUPT;
+	data->bits.p	= 1;
+}
+
 static void set_intr_gate(unsigned int n, const void *addr)
 {
 	struct idt_data data;
 
-	BUG_ON(n > 0xFF);
-
-	memset(&data, 0, sizeof(data));
-	data.vector	= n;
-	data.addr	= addr;
-	data.segment	= __KERNEL_CS;
-	data.bits.type	= GATE_INTERRUPT;
-	data.bits.p	= 1;
+	init_idt_data(&data, n, addr);
 
 	idt_setup_from_table(idt_table, &data, 1, false);
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 27/70] x86/idt: Split idt_data setup out of set_intr_gate()
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

The code to setup idt_data is needed for early exception handling, but
set_intr_gate() can't be used that early because it has pv-ops in its
code path, which don't work that early.

Split out the idt_data initialization part from set_intr_gate() so
that it can be used separatly.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/idt.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a8fc01ea602a..c752027abc9e 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -231,18 +231,24 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
 	}
 }
 
+static void init_idt_data(struct idt_data *data, unsigned int n,
+			  const void *addr)
+{
+	BUG_ON(n > 0xFF);
+
+	memset(data, 0, sizeof(*data));
+	data->vector	= n;
+	data->addr	= addr;
+	data->segment	= __KERNEL_CS;
+	data->bits.type	= GATE_INTERRUPT;
+	data->bits.p	= 1;
+}
+
 static void set_intr_gate(unsigned int n, const void *addr)
 {
 	struct idt_data data;
 
-	BUG_ON(n > 0xFF);
-
-	memset(&data, 0, sizeof(data));
-	data.vector	= n;
-	data.addr	= addr;
-	data.segment	= __KERNEL_CS;
-	data.bits.type	= GATE_INTERRUPT;
-	data.bits.p	= 1;
+	init_idt_data(&data, n, addr);
 
 	idt_setup_from_table(idt_table, &data, 1, false);
 }
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 28/70] x86/idt: Move two function from k/idt.c to i/a/desc.h
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Move these two functions from kernel/idt.c to include/asm/desc.h:

	* init_idt_data()
	* idt_init_desc()

These functions are needed to setup IDT entries very early and need to
be called from head64.c. To be usable this early these functions need to
be compiled without instrumentation and the stack-protector feature.
These features need to be kept enabled for kernel/idt.c, so head64.c
must use its own versions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/desc.h      | 27 +++++++++++++++++++++++++
 arch/x86/include/asm/desc_defs.h |  7 +++++++
 arch/x86/kernel/idt.c            | 34 --------------------------------
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 68a99d2a5f33..80bf63c08007 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -389,6 +389,33 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
 void update_intr_gate(unsigned int n, const void *addr);
 void alloc_intr_gate(unsigned int n, const void *addr);
 
+static inline void init_idt_data(struct idt_data *data, unsigned int n,
+				 const void *addr)
+{
+	BUG_ON(n > 0xFF);
+
+	memset(data, 0, sizeof(*data));
+	data->vector	= n;
+	data->addr	= addr;
+	data->segment	= __KERNEL_CS;
+	data->bits.type	= GATE_INTERRUPT;
+	data->bits.p	= 1;
+}
+
+static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
+{
+	unsigned long addr = (unsigned long) d->addr;
+
+	gate->offset_low	= (u16) addr;
+	gate->segment		= (u16) d->segment;
+	gate->bits		= d->bits;
+	gate->offset_middle	= (u16) (addr >> 16);
+#ifdef CONFIG_X86_64
+	gate->offset_high	= (u32) (addr >> 32);
+	gate->reserved		= 0;
+#endif
+}
+
 extern unsigned long system_vectors[];
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/desc_defs.h b/arch/x86/include/asm/desc_defs.h
index 5621fb3f2d1a..f7e7099af595 100644
--- a/arch/x86/include/asm/desc_defs.h
+++ b/arch/x86/include/asm/desc_defs.h
@@ -74,6 +74,13 @@ struct idt_bits {
 			p	: 1;
 } __attribute__((packed));
 
+struct idt_data {
+	unsigned int	vector;
+	unsigned int	segment;
+	struct idt_bits	bits;
+	const void	*addr;
+};
+
 struct gate_struct {
 	u16		offset_low;
 	u16		segment;
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index c752027abc9e..4a2c7791c697 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -9,13 +9,6 @@
 #include <asm/desc.h>
 #include <asm/hw_irq.h>
 
-struct idt_data {
-	unsigned int	vector;
-	unsigned int	segment;
-	struct idt_bits	bits;
-	const void	*addr;
-};
-
 #define DPL0		0x0
 #define DPL3		0x3
 
@@ -204,20 +197,6 @@ const struct desc_ptr debug_idt_descr = {
 };
 #endif
 
-static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
-{
-	unsigned long addr = (unsigned long) d->addr;
-
-	gate->offset_low	= (u16) addr;
-	gate->segment		= (u16) d->segment;
-	gate->bits		= d->bits;
-	gate->offset_middle	= (u16) (addr >> 16);
-#ifdef CONFIG_X86_64
-	gate->offset_high	= (u32) (addr >> 32);
-	gate->reserved		= 0;
-#endif
-}
-
 static void
 idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
 {
@@ -231,19 +210,6 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
 	}
 }
 
-static void init_idt_data(struct idt_data *data, unsigned int n,
-			  const void *addr)
-{
-	BUG_ON(n > 0xFF);
-
-	memset(data, 0, sizeof(*data));
-	data->vector	= n;
-	data->addr	= addr;
-	data->segment	= __KERNEL_CS;
-	data->bits.type	= GATE_INTERRUPT;
-	data->bits.p	= 1;
-}
-
 static void set_intr_gate(unsigned int n, const void *addr)
 {
 	struct idt_data data;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 28/70] x86/idt: Move two function from k/idt.c to i/a/desc.h
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Move these two functions from kernel/idt.c to include/asm/desc.h:

	* init_idt_data()
	* idt_init_desc()

These functions are needed to setup IDT entries very early and need to
be called from head64.c. To be usable this early these functions need to
be compiled without instrumentation and the stack-protector feature.
These features need to be kept enabled for kernel/idt.c, so head64.c
must use its own versions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/desc.h      | 27 +++++++++++++++++++++++++
 arch/x86/include/asm/desc_defs.h |  7 +++++++
 arch/x86/kernel/idt.c            | 34 --------------------------------
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 68a99d2a5f33..80bf63c08007 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -389,6 +389,33 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
 void update_intr_gate(unsigned int n, const void *addr);
 void alloc_intr_gate(unsigned int n, const void *addr);
 
+static inline void init_idt_data(struct idt_data *data, unsigned int n,
+				 const void *addr)
+{
+	BUG_ON(n > 0xFF);
+
+	memset(data, 0, sizeof(*data));
+	data->vector	= n;
+	data->addr	= addr;
+	data->segment	= __KERNEL_CS;
+	data->bits.type	= GATE_INTERRUPT;
+	data->bits.p	= 1;
+}
+
+static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
+{
+	unsigned long addr = (unsigned long) d->addr;
+
+	gate->offset_low	= (u16) addr;
+	gate->segment		= (u16) d->segment;
+	gate->bits		= d->bits;
+	gate->offset_middle	= (u16) (addr >> 16);
+#ifdef CONFIG_X86_64
+	gate->offset_high	= (u32) (addr >> 32);
+	gate->reserved		= 0;
+#endif
+}
+
 extern unsigned long system_vectors[];
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/desc_defs.h b/arch/x86/include/asm/desc_defs.h
index 5621fb3f2d1a..f7e7099af595 100644
--- a/arch/x86/include/asm/desc_defs.h
+++ b/arch/x86/include/asm/desc_defs.h
@@ -74,6 +74,13 @@ struct idt_bits {
 			p	: 1;
 } __attribute__((packed));
 
+struct idt_data {
+	unsigned int	vector;
+	unsigned int	segment;
+	struct idt_bits	bits;
+	const void	*addr;
+};
+
 struct gate_struct {
 	u16		offset_low;
 	u16		segment;
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index c752027abc9e..4a2c7791c697 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -9,13 +9,6 @@
 #include <asm/desc.h>
 #include <asm/hw_irq.h>
 
-struct idt_data {
-	unsigned int	vector;
-	unsigned int	segment;
-	struct idt_bits	bits;
-	const void	*addr;
-};
-
 #define DPL0		0x0
 #define DPL3		0x3
 
@@ -204,20 +197,6 @@ const struct desc_ptr debug_idt_descr = {
 };
 #endif
 
-static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
-{
-	unsigned long addr = (unsigned long) d->addr;
-
-	gate->offset_low	= (u16) addr;
-	gate->segment		= (u16) d->segment;
-	gate->bits		= d->bits;
-	gate->offset_middle	= (u16) (addr >> 16);
-#ifdef CONFIG_X86_64
-	gate->offset_high	= (u32) (addr >> 32);
-	gate->reserved		= 0;
-#endif
-}
-
 static void
 idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
 {
@@ -231,19 +210,6 @@ idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sy
 	}
 }
 
-static void init_idt_data(struct idt_data *data, unsigned int n,
-			  const void *addr)
-{
-	BUG_ON(n > 0xFF);
-
-	memset(data, 0, sizeof(*data));
-	data->vector	= n;
-	data->addr	= addr;
-	data->segment	= __KERNEL_CS;
-	data->bits.type	= GATE_INTERRUPT;
-	data->bits.p	= 1;
-}
-
 static void set_intr_gate(unsigned int n, const void *addr)
 {
 	struct idt_data data;
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 29/70] x86/head/64: Install boot GDT
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (27 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (40 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Handling exceptions during boot requires a working GDT. The kernel GDT
is not yet ready for use, so install a temporary boot GDT.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 4bbc770af632..5219a70b3fb4 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,6 +72,26 @@ SYM_CODE_START_NOALIGN(startup_64)
 	/* Set up the stack for verify_cpu(), similar to initial_stack below */
 	leaq	(__end_init_task - SIZEOF_PTREGS)(%rip), %rsp
 
+	/* Setup boot GDT descriptor and load boot GDT */
+	leaq	boot_gdt(%rip), %rax
+	movq	%rax, boot_gdt_base(%rip)
+	lgdt	boot_gdt_descr(%rip)
+
+	/* New GDT is live - reload data segment registers */
+	movl	$__KERNEL_DS, %eax
+	movl	%eax, %ds
+	movl	%eax, %ss
+	movl	%eax, %es
+
+	/* Now switch to __KERNEL_CS so IRET works reliably */
+	pushq	$__KERNEL_CS
+	leaq	.Lon_kernel_cs(%rip), %rax
+	pushq	%rax
+	lretq
+
+.Lon_kernel_cs:
+	UNWIND_HINT_EMPTY
+
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
@@ -480,6 +500,18 @@ SYM_DATA_LOCAL(early_gdt_descr_base,	.quad INIT_PER_CPU_VAR(gdt_page))
 SYM_DATA(phys_base, .quad 0x0)
 EXPORT_SYMBOL(phys_base)
 
+/* Boot GDT used when kernel addresses are not mapped yet */
+SYM_DATA_LOCAL(boot_gdt_descr,		.word boot_gdt_end - boot_gdt)
+SYM_DATA_LOCAL(boot_gdt_base,		.quad 0)
+SYM_DATA_START(boot_gdt)
+	.quad	0
+	.quad   0x00cf9a000000ffff      /* __KERNEL32_CS */
+	.quad   0x00af9a000000ffff      /* __KERNEL_CS */
+	.quad   0x00cf92000000ffff      /* __KERNEL_DS */
+	.quad   0x0080890000000000      /* TS descriptor */
+	.quad   0x0000000000000000      /* TS continued */
+SYM_DATA_END_LABEL(boot_gdt, SYM_L_LOCAL, boot_gdt_end)
+
 #include "../../x86/xen/xen-head.S"
 
 	__PAGE_ALIGNED_BSS
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 30/70] x86/head/64: Reload GDT after switch to virtual addresses
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Reload the GDT after switching to virtual addresses to make sure it will
not go away when the lower mappings are removed.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5219a70b3fb4..ebb7d512c9d3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -163,6 +163,11 @@ SYM_CODE_START(secondary_startup_64)
 1:
 	UNWIND_HINT_EMPTY
 
+	/* Setup boot GDT descriptor and load boot GDT */
+	leaq	boot_gdt(%rip), %rax
+	movq	%rax, boot_gdt_base(%rip)
+	lgdt	boot_gdt_descr(%rip)
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 30/70] x86/head/64: Reload GDT after switch to virtual addresses
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Reload the GDT after switching to virtual addresses to make sure it will
not go away when the lower mappings are removed.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5219a70b3fb4..ebb7d512c9d3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -163,6 +163,11 @@ SYM_CODE_START(secondary_startup_64)
 1:
 	UNWIND_HINT_EMPTY
 
+	/* Setup boot GDT descriptor and load boot GDT */
+	leaq	boot_gdt(%rip), %rax
+	movq	%rax, boot_gdt_base(%rip)
+	lgdt	boot_gdt_descr(%rip)
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 31/70] x86/head/64: Load segment registers earlier
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Make sure segments are properly set up before setting up an IDT and
doing anything that might cause a #VC exception. This is later needed
for early exception handling.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 52 +++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ebb7d512c9d3..1be178be1566 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -168,6 +168,32 @@ SYM_CODE_START(secondary_startup_64)
 	movq	%rax, boot_gdt_base(%rip)
 	lgdt	boot_gdt_descr(%rip)
 
+	/* set up data segments */
+	xorl %eax,%eax
+	movl %eax,%ds
+	movl %eax,%ss
+	movl %eax,%es
+
+	/*
+	 * We don't really need to load %fs or %gs, but load them anyway
+	 * to kill any stale realmode selectors.  This allows execution
+	 * under VT hardware.
+	 */
+	movl %eax,%fs
+	movl %eax,%gs
+
+	/* Set up %gs.
+	 *
+	 * The base of %gs always points to fixed_percpu_data. If the
+	 * stack protector canary is enabled, it is located at %gs:40.
+	 * Note that, on SMP, the boot cpu uses init data section until
+	 * the per cpu areas are set up.
+	 */
+	movl	$MSR_GS_BASE,%ecx
+	movl	initial_gs(%rip),%eax
+	movl	initial_gs+4(%rip),%edx
+	wrmsr
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
@@ -203,32 +229,6 @@ SYM_CODE_START(secondary_startup_64)
 	 */
 	lgdt	early_gdt_descr(%rip)
 
-	/* set up data segments */
-	xorl %eax,%eax
-	movl %eax,%ds
-	movl %eax,%ss
-	movl %eax,%es
-
-	/*
-	 * We don't really need to load %fs or %gs, but load them anyway
-	 * to kill any stale realmode selectors.  This allows execution
-	 * under VT hardware.
-	 */
-	movl %eax,%fs
-	movl %eax,%gs
-
-	/* Set up %gs.
-	 *
-	 * The base of %gs always points to fixed_percpu_data. If the
-	 * stack protector canary is enabled, it is located at %gs:40.
-	 * Note that, on SMP, the boot cpu uses init data section until
-	 * the per cpu areas are set up.
-	 */
-	movl	$MSR_GS_BASE,%ecx
-	movl	initial_gs(%rip),%eax
-	movl	initial_gs+4(%rip),%edx
-	wrmsr
-
 	/* rsi is pointer to real mode structure with interesting info.
 	   pass it to C */
 	movq	%rsi, %rdi
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 31/70] x86/head/64: Load segment registers earlier
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Make sure segments are properly set up before setting up an IDT and
doing anything that might cause a #VC exception. This is later needed
for early exception handling.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 52 +++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ebb7d512c9d3..1be178be1566 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -168,6 +168,32 @@ SYM_CODE_START(secondary_startup_64)
 	movq	%rax, boot_gdt_base(%rip)
 	lgdt	boot_gdt_descr(%rip)
 
+	/* set up data segments */
+	xorl %eax,%eax
+	movl %eax,%ds
+	movl %eax,%ss
+	movl %eax,%es
+
+	/*
+	 * We don't really need to load %fs or %gs, but load them anyway
+	 * to kill any stale realmode selectors.  This allows execution
+	 * under VT hardware.
+	 */
+	movl %eax,%fs
+	movl %eax,%gs
+
+	/* Set up %gs.
+	 *
+	 * The base of %gs always points to fixed_percpu_data. If the
+	 * stack protector canary is enabled, it is located at %gs:40.
+	 * Note that, on SMP, the boot cpu uses init data section until
+	 * the per cpu areas are set up.
+	 */
+	movl	$MSR_GS_BASE,%ecx
+	movl	initial_gs(%rip),%eax
+	movl	initial_gs+4(%rip),%edx
+	wrmsr
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
@@ -203,32 +229,6 @@ SYM_CODE_START(secondary_startup_64)
 	 */
 	lgdt	early_gdt_descr(%rip)
 
-	/* set up data segments */
-	xorl %eax,%eax
-	movl %eax,%ds
-	movl %eax,%ss
-	movl %eax,%es
-
-	/*
-	 * We don't really need to load %fs or %gs, but load them anyway
-	 * to kill any stale realmode selectors.  This allows execution
-	 * under VT hardware.
-	 */
-	movl %eax,%fs
-	movl %eax,%gs
-
-	/* Set up %gs.
-	 *
-	 * The base of %gs always points to fixed_percpu_data. If the
-	 * stack protector canary is enabled, it is located at %gs:40.
-	 * Note that, on SMP, the boot cpu uses init data section until
-	 * the per cpu areas are set up.
-	 */
-	movl	$MSR_GS_BASE,%ecx
-	movl	initial_gs(%rip),%eax
-	movl	initial_gs+4(%rip),%edx
-	wrmsr
-
 	/* rsi is pointer to real mode structure with interesting info.
 	   pass it to C */
 	movq	%rsi, %rdi
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 32/70] x86/head/64: Switch to initial stack earlier
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Make sure there is a stack once the kernel runs from virual addresses.
At this stage any secondary CPU which boots will have lost its stack
because the kernel switched to a new page-table which does not map the
real-mode stack anymore.

This is also needed for handling early #VC exceptions caused by
instructions like CPUID.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1be178be1566..b8ba72f31be9 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -194,6 +194,12 @@ SYM_CODE_START(secondary_startup_64)
 	movl	initial_gs+4(%rip),%edx
 	wrmsr
 
+	/*
+	 * Setup a boot time stack - Any secondary CPU will have lost its stack
+	 * by now because the cr3-switch above unmaps the real-mode stack
+	 */
+	movq initial_stack(%rip), %rsp
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
@@ -214,9 +220,6 @@ SYM_CODE_START(secondary_startup_64)
 	/* Make changes effective */
 	movq	%rax, %cr0
 
-	/* Setup a boot time stack */
-	movq initial_stack(%rip), %rsp
-
 	/* zero EFLAGS after setting rsp */
 	pushq $0
 	popfq
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 32/70] x86/head/64: Switch to initial stack earlier
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Make sure there is a stack once the kernel runs from virual addresses.
At this stage any secondary CPU which boots will have lost its stack
because the kernel switched to a new page-table which does not map the
real-mode stack anymore.

This is also needed for handling early #VC exceptions caused by
instructions like CPUID.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_64.S | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1be178be1566..b8ba72f31be9 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -194,6 +194,12 @@ SYM_CODE_START(secondary_startup_64)
 	movl	initial_gs+4(%rip),%edx
 	wrmsr
 
+	/*
+	 * Setup a boot time stack - Any secondary CPU will have lost its stack
+	 * by now because the cr3-switch above unmaps the real-mode stack
+	 */
+	movq initial_stack(%rip), %rsp
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
@@ -214,9 +220,6 @@ SYM_CODE_START(secondary_startup_64)
 	/* Make changes effective */
 	movq	%rax, %cr0
 
-	/* Setup a boot time stack */
-	movq initial_stack(%rip), %rsp
-
 	/* zero EFLAGS after setting rsp */
 	pushq $0
 	popfq
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 33/70] x86/head/64: Build k/head64.c with -fno-stack-protector
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The code inserted by the stack protector does not work in the early
boot environment because it uses the GS segment, at least with memory
encryption enabled. Make sure the early code is compiled without this
feature enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/Makefile | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b294c13809a..9b0ebcf4b9f3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -36,6 +36,10 @@ ifdef CONFIG_FRAME_POINTER
 OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o		:= y
 endif
 
+# make sure head64.c is built without stack protector
+nostackp := $(call cc-option, -fno-stack-protector)
+CFLAGS_head64.o		:= $(nostackp)
+
 # If instrumentation of this dir is enabled, boot hangs during first second.
 # Probably could be more selective here, but note that files related to irqs,
 # boot, dumpstack/stacktrace, etc are either non-interesting or can lead to
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 33/70] x86/head/64: Build k/head64.c with -fno-stack-protector
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

The code inserted by the stack protector does not work in the early
boot environment because it uses the GS segment, at least with memory
encryption enabled. Make sure the early code is compiled without this
feature enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/Makefile | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b294c13809a..9b0ebcf4b9f3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -36,6 +36,10 @@ ifdef CONFIG_FRAME_POINTER
 OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o		:= y
 endif
 
+# make sure head64.c is built without stack protector
+nostackp := $(call cc-option, -fno-stack-protector)
+CFLAGS_head64.o		:= $(nostackp)
+
 # If instrumentation of this dir is enabled, boot hangs during first second.
 # Probably could be more selective here, but note that files related to irqs,
 # boot, dumpstack/stacktrace, etc are either non-interesting or can lead to
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 34/70] x86/head/64: Load IDT earlier
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (32 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13   ` Joerg Roedel
                   ` (35 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Load the IDT right after switching to virtual addresses in head_64.S
so that the kernel can handle #VC exceptions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head64.c  | 15 +++++++++++++++
 arch/x86/kernel/head_64.S | 17 +++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 206a4b6144c2..0ecdf28291fc 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -489,3 +489,18 @@ void __init x86_64_start_reservations(char *real_mode_data)
 
 	start_kernel();
 }
+
+void __head early_idt_setup_early_handler(unsigned long physaddr)
+{
+	gate_desc *idt = fixup_pointer(idt_table, physaddr);
+	int i;
+
+	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
+		struct idt_data data;
+		gate_desc desc;
+
+		init_idt_data(&data, i, early_idt_handler_array[i]);
+		idt_init_desc(&desc, &data);
+		native_write_idt_entry(idt, i, &desc);
+	}
+}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b8ba72f31be9..8465290a1eb3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -104,6 +104,20 @@ SYM_CODE_START_NOALIGN(startup_64)
 	leaq	_text(%rip), %rdi
 	pushq	%rsi
 	call	__startup_64
+	/* Save return value */
+	pushq	%rax
+
+	/*
+	 * Load IDT with early handlers - needed for SEV-ES
+	 * Do this here because this must only happen on the boot CPU
+	 * and the code below is shared with secondary CPU bringup.
+	 */
+	leaq	_text(%rip), %rdi
+	call	early_idt_setup_early_handler
+
+	/* Restore __startup_64 return value*/
+	popq	%rax
+	/* Restore pointer to real_mode_data */
 	popq	%rsi
 
 	/* Form the CR3 value being sure to include the CR3 modifier */
@@ -200,6 +214,9 @@ SYM_CODE_START(secondary_startup_64)
 	 */
 	movq initial_stack(%rip), %rsp
 
+	/* Load IDT */
+	lidt	idt_descr(%rip)
+
 	/* Check if nx is implemented */
 	movl	$0x80000001, %eax
 	cpuid
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 35/70] x86/head/64: Move early exception dispatch to C code
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Move the assembly coded dispatch between page-faults and all other
exceptions to C code to make it easier to maintain and extend.

Also change the return-type of early_make_pgtable() to bool and make it
static.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h |  2 +-
 arch/x86/include/asm/setup.h   |  1 -
 arch/x86/kernel/head64.c       | 19 +++++++++++++++----
 arch/x86/kernel/head_64.S      | 11 +----------
 4 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7e118660bbd9..9327eeeb1de1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -27,7 +27,7 @@
 #include <asm/fpu/api.h>
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
-int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
+bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ed8ec011a9fd..d8a39d45f182 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -48,7 +48,6 @@ extern void reserve_standard_io_resources(void);
 extern void i386_reserve_resources(void);
 extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
 extern unsigned long __startup_secondary_64(void);
-extern int early_make_pgtable(unsigned long address);
 
 #ifdef CONFIG_X86_INTEL_MID
 extern void x86_intel_mid_early_setup(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 0ecdf28291fc..8ccca109750d 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -36,6 +36,8 @@
 #include <asm/microcode.h>
 #include <asm/kasan.h>
 #include <asm/fixmap.h>
+#include <asm/extable.h>
+#include <asm/trap_defs.h>
 
 /*
  * Manage page tables very early on.
@@ -297,7 +299,7 @@ static void __init reset_early_page_tables(void)
 }
 
 /* Create a new PMD entry */
-int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
+bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
@@ -307,7 +309,7 @@ int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 
 	/* Invalid address or early pgt is done ?  */
 	if (physaddr >= MAXMEM || read_cr3_pa() != __pa_nodebug(early_top_pgt))
-		return -1;
+		return false;
 
 again:
 	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
@@ -364,10 +366,10 @@ int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 	}
 	pmd_p[pmd_index(address)] = pmd;
 
-	return 0;
+	return true;
 }
 
-int __init early_make_pgtable(unsigned long address)
+static bool __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pmdval_t pmd;
@@ -377,6 +379,15 @@ int __init early_make_pgtable(unsigned long address)
 	return __early_make_pgtable(address, pmd);
 }
 
+void __init early_exception(struct pt_regs *regs, int trapnr)
+{
+	if (trapnr == X86_TRAP_PF &&
+	    early_make_pgtable(native_read_cr2()))
+		return;
+
+	early_fixup_exception(regs, trapnr);
+}
+
 /* Don't add a printk in there. printk relies on the PDA which is not initialized 
    yet. */
 static void __init clear_bss(void)
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 8465290a1eb3..bc0622a72d6d 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -363,18 +363,9 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	pushq %r15				/* pt_regs->r15 */
 	UNWIND_HINT_REGS
 
-	cmpq $14,%rsi		/* Page fault? */
-	jnz 10f
-	GET_CR2_INTO(%rdi)	/* can clobber %rax if pv */
-	call early_make_pgtable
-	andl %eax,%eax
-	jz 20f			/* All good */
-
-10:
 	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
-	call early_fixup_exception
+	call early_exception
 
-20:
 	decl early_recursion_flag(%rip)
 	jmp restore_regs_and_return_to_kernel
 SYM_CODE_END(early_idt_handler_common)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 35/70] x86/head/64: Move early exception dispatch to C code
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Move the assembly coded dispatch between page-faults and all other
exceptions to C code to make it easier to maintain and extend.

Also change the return-type of early_make_pgtable() to bool and make it
static.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h |  2 +-
 arch/x86/include/asm/setup.h   |  1 -
 arch/x86/kernel/head64.c       | 19 +++++++++++++++----
 arch/x86/kernel/head_64.S      | 11 +----------
 4 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7e118660bbd9..9327eeeb1de1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -27,7 +27,7 @@
 #include <asm/fpu/api.h>
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
-int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
+bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ed8ec011a9fd..d8a39d45f182 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -48,7 +48,6 @@ extern void reserve_standard_io_resources(void);
 extern void i386_reserve_resources(void);
 extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
 extern unsigned long __startup_secondary_64(void);
-extern int early_make_pgtable(unsigned long address);
 
 #ifdef CONFIG_X86_INTEL_MID
 extern void x86_intel_mid_early_setup(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 0ecdf28291fc..8ccca109750d 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -36,6 +36,8 @@
 #include <asm/microcode.h>
 #include <asm/kasan.h>
 #include <asm/fixmap.h>
+#include <asm/extable.h>
+#include <asm/trap_defs.h>
 
 /*
  * Manage page tables very early on.
@@ -297,7 +299,7 @@ static void __init reset_early_page_tables(void)
 }
 
 /* Create a new PMD entry */
-int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
+bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
@@ -307,7 +309,7 @@ int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 
 	/* Invalid address or early pgt is done ?  */
 	if (physaddr >= MAXMEM || read_cr3_pa() != __pa_nodebug(early_top_pgt))
-		return -1;
+		return false;
 
 again:
 	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
@@ -364,10 +366,10 @@ int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
 	}
 	pmd_p[pmd_index(address)] = pmd;
 
-	return 0;
+	return true;
 }
 
-int __init early_make_pgtable(unsigned long address)
+static bool __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pmdval_t pmd;
@@ -377,6 +379,15 @@ int __init early_make_pgtable(unsigned long address)
 	return __early_make_pgtable(address, pmd);
 }
 
+void __init early_exception(struct pt_regs *regs, int trapnr)
+{
+	if (trapnr == X86_TRAP_PF &&
+	    early_make_pgtable(native_read_cr2()))
+		return;
+
+	early_fixup_exception(regs, trapnr);
+}
+
 /* Don't add a printk in there. printk relies on the PDA which is not initialized 
    yet. */
 static void __init clear_bss(void)
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 8465290a1eb3..bc0622a72d6d 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -363,18 +363,9 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	pushq %r15				/* pt_regs->r15 */
 	UNWIND_HINT_REGS
 
-	cmpq $14,%rsi		/* Page fault? */
-	jnz 10f
-	GET_CR2_INTO(%rdi)	/* can clobber %rax if pv */
-	call early_make_pgtable
-	andl %eax,%eax
-	jz 20f			/* All good */
-
-10:
 	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
-	call early_fixup_exception
+	call early_exception
 
-20:
 	decl early_recursion_flag(%rip)
 	jmp restore_regs_and_return_to_kernel
 SYM_CODE_END(early_idt_handler_common)
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 36/70] x86/sev-es: Add SEV-ES Feature Detection
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add the sev_es_active function for checking whether SEV-ES is enabled.
Also cache the value of MSR_AMD64_SEV at boot to speed up the feature
checking in the running code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/mem_encrypt.h |  3 +++
 arch/x86/include/asm/msr-index.h   |  2 ++
 arch/x86/mm/mem_encrypt.c          | 11 ++++++++++-
 arch/x86/mm/mem_encrypt_identity.c |  3 +++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 848ce43b9040..6f61bb93366a 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -19,6 +19,7 @@
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 
 extern u64 sme_me_mask;
+extern u64 sev_status;
 extern bool sev_enabled;
 
 void sme_encrypt_execute(unsigned long encrypted_kernel_vaddr,
@@ -49,6 +50,7 @@ void __init mem_encrypt_free_decrypted_mem(void);
 
 bool sme_active(void);
 bool sev_active(void);
+bool sev_es_active(void);
 
 #define __bss_decrypted __attribute__((__section__(".bss..decrypted")))
 
@@ -71,6 +73,7 @@ static inline void __init sme_enable(struct boot_params *bp) { }
 
 static inline bool sme_active(void) { return false; }
 static inline bool sev_active(void) { return false; }
+static inline bool sev_es_active(void) { return false; }
 
 static inline int __init
 early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 0; }
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9eb279927fc2..e69743be869b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -435,7 +435,9 @@
 #define MSR_AMD64_SEV_ES_GHCB		0xc0010130
 #define MSR_AMD64_SEV			0xc0010131
 #define MSR_AMD64_SEV_ENABLED_BIT	0
+#define MSR_AMD64_SEV_ES_ENABLED_BIT	1
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
+#define MSR_AMD64_SEV_ES_ENABLED	BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
 
 #define MSR_AMD64_VIRT_SPEC_CTRL	0xc001011f
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index a03614bd3e1a..a35fcba24866 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -38,7 +38,9 @@
  * section is later cleared.
  */
 u64 sme_me_mask __section(.data) = 0;
+u64 sev_status __section(.data) = 0;
 EXPORT_SYMBOL(sme_me_mask);
+EXPORT_SYMBOL(sev_status);
 DEFINE_STATIC_KEY_FALSE(sev_enable_key);
 EXPORT_SYMBOL_GPL(sev_enable_key);
 
@@ -347,9 +349,16 @@ bool sme_active(void)
 
 bool sev_active(void)
 {
-	return sme_me_mask && sev_enabled;
+	return !!(sev_status & MSR_AMD64_SEV_ENABLED);
 }
 
+bool sev_es_active(void)
+{
+	return !!(sev_status & MSR_AMD64_SEV_ES_ENABLED);
+}
+EXPORT_SYMBOL_GPL(sev_es_active);
+
+
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index e2b0e2ac07bb..68d75379e06a 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -540,6 +540,9 @@ void __init sme_enable(struct boot_params *bp)
 		if (!(msr & MSR_AMD64_SEV_ENABLED))
 			return;
 
+		/* Save SEV_STATUS to avoid reading MSR again */
+		sev_status = msr;
+
 		/* SEV state cannot be controlled by a command line option */
 		sme_me_mask = me_mask;
 		sev_enabled = true;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 36/70] x86/sev-es: Add SEV-ES Feature Detection
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Add the sev_es_active function for checking whether SEV-ES is enabled.
Also cache the value of MSR_AMD64_SEV at boot to speed up the feature
checking in the running code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/mem_encrypt.h |  3 +++
 arch/x86/include/asm/msr-index.h   |  2 ++
 arch/x86/mm/mem_encrypt.c          | 11 ++++++++++-
 arch/x86/mm/mem_encrypt_identity.c |  3 +++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 848ce43b9040..6f61bb93366a 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -19,6 +19,7 @@
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 
 extern u64 sme_me_mask;
+extern u64 sev_status;
 extern bool sev_enabled;
 
 void sme_encrypt_execute(unsigned long encrypted_kernel_vaddr,
@@ -49,6 +50,7 @@ void __init mem_encrypt_free_decrypted_mem(void);
 
 bool sme_active(void);
 bool sev_active(void);
+bool sev_es_active(void);
 
 #define __bss_decrypted __attribute__((__section__(".bss..decrypted")))
 
@@ -71,6 +73,7 @@ static inline void __init sme_enable(struct boot_params *bp) { }
 
 static inline bool sme_active(void) { return false; }
 static inline bool sev_active(void) { return false; }
+static inline bool sev_es_active(void) { return false; }
 
 static inline int __init
 early_set_memory_decrypted(unsigned long vaddr, unsigned long size) { return 0; }
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9eb279927fc2..e69743be869b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -435,7 +435,9 @@
 #define MSR_AMD64_SEV_ES_GHCB		0xc0010130
 #define MSR_AMD64_SEV			0xc0010131
 #define MSR_AMD64_SEV_ENABLED_BIT	0
+#define MSR_AMD64_SEV_ES_ENABLED_BIT	1
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
+#define MSR_AMD64_SEV_ES_ENABLED	BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
 
 #define MSR_AMD64_VIRT_SPEC_CTRL	0xc001011f
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index a03614bd3e1a..a35fcba24866 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -38,7 +38,9 @@
  * section is later cleared.
  */
 u64 sme_me_mask __section(.data) = 0;
+u64 sev_status __section(.data) = 0;
 EXPORT_SYMBOL(sme_me_mask);
+EXPORT_SYMBOL(sev_status);
 DEFINE_STATIC_KEY_FALSE(sev_enable_key);
 EXPORT_SYMBOL_GPL(sev_enable_key);
 
@@ -347,9 +349,16 @@ bool sme_active(void)
 
 bool sev_active(void)
 {
-	return sme_me_mask && sev_enabled;
+	return !!(sev_status & MSR_AMD64_SEV_ENABLED);
 }
 
+bool sev_es_active(void)
+{
+	return !!(sev_status & MSR_AMD64_SEV_ES_ENABLED);
+}
+EXPORT_SYMBOL_GPL(sev_es_active);
+
+
 /* Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED */
 bool force_dma_unencrypted(struct device *dev)
 {
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index e2b0e2ac07bb..68d75379e06a 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -540,6 +540,9 @@ void __init sme_enable(struct boot_params *bp)
 		if (!(msr & MSR_AMD64_SEV_ENABLED))
 			return;
 
+		/* Save SEV_STATUS to avoid reading MSR again */
+		sev_status = msr;
+
 		/* SEV state cannot be controlled by a command line option */
 		sme_me_mask = me_mask;
 		sev_enabled = true;
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 37/70] x86/sev-es: Compile early handler code into kernel image
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Setup sev-es.c and include the code from the
pre-decompression stage to also build it into the image of the running
kernel. Temporarily add __maybe_unused annotations to avoid build
warnings until the functions get used.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/Makefile        |   1 +
 arch/x86/kernel/sev-es-shared.c |  21 +++--
 arch/x86/kernel/sev-es.c        | 162 ++++++++++++++++++++++++++++++++
 3 files changed, 174 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/kernel/sev-es.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b0ebcf4b9f3..28b4a2ebba25 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -147,6 +147,7 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
+obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev-es.o
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index a632b8f041ec..7a6e4db669f0 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -9,7 +9,7 @@
  * and is included directly into both code-bases.
  */
 
-static void sev_es_terminate(unsigned int reason)
+static void __maybe_unused sev_es_terminate(unsigned int reason)
 {
 	/* Request Guest Termination from Hypvervisor */
 	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
@@ -19,7 +19,7 @@ static void sev_es_terminate(unsigned int reason)
 		asm volatile("hlt\n" : : : "memory");
 }
 
-static bool sev_es_negotiate_protocol(void)
+static bool __maybe_unused sev_es_negotiate_protocol(void)
 {
 	u64 val;
 
@@ -38,7 +38,7 @@ static bool sev_es_negotiate_protocol(void)
 	return true;
 }
 
-static void vc_ghcb_invalidate(struct ghcb *ghcb)
+static void __maybe_unused vc_ghcb_invalidate(struct ghcb *ghcb)
 {
 	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
 }
@@ -50,9 +50,9 @@ static bool vc_decoding_needed(unsigned long exit_code)
 		 exit_code <= SVM_EXIT_LAST_EXCP);
 }
 
-static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
-				      struct pt_regs *regs,
-				      unsigned long exit_code)
+static enum es_result __maybe_unused vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+						     struct pt_regs *regs,
+						     unsigned long exit_code)
 {
 	enum es_result ret = ES_OK;
 
@@ -65,7 +65,7 @@ static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
 	return ret;
 }
 
-static void vc_finish_insn(struct es_em_ctxt *ctxt)
+static void __maybe_unused vc_finish_insn(struct es_em_ctxt *ctxt)
 {
 	ctxt->regs->ip += ctxt->insn.length;
 }
@@ -312,7 +312,8 @@ static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
 	return ES_OK;
 }
 
-static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+static enum es_result __maybe_unused
+vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
 	u64 exit_info_1, exit_info_2;
@@ -408,8 +409,8 @@ static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 	return ret;
 }
 
-static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
-				      struct es_em_ctxt *ctxt)
+static enum es_result __maybe_unused vc_handle_cpuid(struct ghcb *ghcb,
+						     struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
 	u32 cr4 = native_read_cr4();
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
new file mode 100644
index 000000000000..27fdef6b3700
--- /dev/null
+++ b/arch/x86/kernel/sev-es.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2019 SUSE
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/trap_defs.h>
+#include <asm/sev-es.h>
+#include <asm/insn-eval.h>
+#include <asm/fpu/internal.h>
+#include <asm/processor.h>
+#include <asm/svm.h>
+
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+	return native_read_msr(MSR_AMD64_SEV_ES_GHCB);
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+	u32 low, high;
+
+	low  = (u32)(val);
+	high = (u32)(val >> 32);
+
+	native_write_msr(MSR_AMD64_SEV_ES_GHCB, low, high);
+}
+
+static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
+				unsigned char *buffer)
+{
+	return probe_kernel_read(buffer, (unsigned char *)ctxt->regs->ip,
+				 MAX_INSN_SIZE);
+}
+
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+	char buffer[MAX_INSN_SIZE];
+	enum es_result ret;
+	int res;
+
+	res = vc_fetch_insn_kernel(ctxt, buffer);
+	if (unlikely(res == -EFAULT)) {
+		ctxt->fi.vector     = X86_TRAP_PF;
+		ctxt->fi.error_code = 0;
+		ctxt->fi.cr2        = ctxt->regs->ip;
+		return ES_EXCEPTION;
+	}
+
+	insn_init(&ctxt->insn, buffer, MAX_INSN_SIZE - res, 1);
+	insn_get_length(&ctxt->insn);
+
+	ret = ctxt->insn.immediate.got ? ES_OK : ES_DECODE_FAILED;
+
+	return ret;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+				   char *dst, char *buf, size_t size)
+{
+	unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+	unsigned char *target = dst;
+	u64 d8;
+	u32 d4;
+	u16 d2;
+	u8  d1;
+
+	switch (size) {
+	case 1:
+		memcpy(&d1, buf, 1);
+		if (put_user(d1, target))
+			goto fault;
+		break;
+	case 2:
+		memcpy(&d2, buf, 2);
+		if (put_user(d2, target))
+			goto fault;
+		break;
+	case 4:
+		memcpy(&d4, buf, 4);
+		if (put_user(d4, target))
+			goto fault;
+		break;
+	case 8:
+		memcpy(&d8, buf, 8);
+		if (put_user(d8, target))
+			goto fault;
+		break;
+	default:
+		WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+		return ES_UNSUPPORTED;
+	}
+
+	return ES_OK;
+
+fault:
+	if (user_mode(ctxt->regs))
+		error_code |= X86_PF_USER;
+
+	ctxt->fi.vector = X86_TRAP_PF;
+	ctxt->fi.error_code = error_code;
+	ctxt->fi.cr2 = (unsigned long)dst;
+
+	return ES_EXCEPTION;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+				  char *src, char *buf, size_t size)
+{
+	unsigned long error_code = X86_PF_PROT;
+	u64 d8;
+	u32 d4;
+	u16 d2;
+	u8  d1;
+
+	switch (size) {
+	case 1:
+		if (get_user(d1, src))
+			goto fault;
+		memcpy(buf, &d1, 1);
+		break;
+	case 2:
+		if (get_user(d2, src))
+			goto fault;
+		memcpy(buf, &d2, 2);
+		break;
+	case 4:
+		if (get_user(d4, src))
+			goto fault;
+		memcpy(buf, &d4, 4);
+		break;
+	case 8:
+		if (get_user(d8, src))
+			goto fault;
+		memcpy(buf, &d8, 8);
+		break;
+	default:
+		WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+		return ES_UNSUPPORTED;
+	}
+
+	return ES_OK;
+
+fault:
+	if (user_mode(ctxt->regs))
+		error_code |= X86_PF_USER;
+
+	ctxt->fi.vector = X86_TRAP_PF;
+	ctxt->fi.error_code = error_code;
+	ctxt->fi.cr2 = (unsigned long)src;
+
+	return ES_EXCEPTION;
+}
+
+/* Include code shared with pre-decompression boot stage */
+#include "sev-es-shared.c"
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 37/70] x86/sev-es: Compile early handler code into kernel image
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Setup sev-es.c and include the code from the
pre-decompression stage to also build it into the image of the running
kernel. Temporarily add __maybe_unused annotations to avoid build
warnings until the functions get used.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/Makefile        |   1 +
 arch/x86/kernel/sev-es-shared.c |  21 +++--
 arch/x86/kernel/sev-es.c        | 162 ++++++++++++++++++++++++++++++++
 3 files changed, 174 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/kernel/sev-es.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b0ebcf4b9f3..28b4a2ebba25 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -147,6 +147,7 @@ obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
 obj-$(CONFIG_UNWINDER_FRAME_POINTER)	+= unwind_frame.o
 obj-$(CONFIG_UNWINDER_GUESS)		+= unwind_guess.o
 
+obj-$(CONFIG_AMD_MEM_ENCRYPT)		+= sev-es.o
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index a632b8f041ec..7a6e4db669f0 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -9,7 +9,7 @@
  * and is included directly into both code-bases.
  */
 
-static void sev_es_terminate(unsigned int reason)
+static void __maybe_unused sev_es_terminate(unsigned int reason)
 {
 	/* Request Guest Termination from Hypvervisor */
 	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
@@ -19,7 +19,7 @@ static void sev_es_terminate(unsigned int reason)
 		asm volatile("hlt\n" : : : "memory");
 }
 
-static bool sev_es_negotiate_protocol(void)
+static bool __maybe_unused sev_es_negotiate_protocol(void)
 {
 	u64 val;
 
@@ -38,7 +38,7 @@ static bool sev_es_negotiate_protocol(void)
 	return true;
 }
 
-static void vc_ghcb_invalidate(struct ghcb *ghcb)
+static void __maybe_unused vc_ghcb_invalidate(struct ghcb *ghcb)
 {
 	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
 }
@@ -50,9 +50,9 @@ static bool vc_decoding_needed(unsigned long exit_code)
 		 exit_code <= SVM_EXIT_LAST_EXCP);
 }
 
-static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
-				      struct pt_regs *regs,
-				      unsigned long exit_code)
+static enum es_result __maybe_unused vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+						     struct pt_regs *regs,
+						     unsigned long exit_code)
 {
 	enum es_result ret = ES_OK;
 
@@ -65,7 +65,7 @@ static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
 	return ret;
 }
 
-static void vc_finish_insn(struct es_em_ctxt *ctxt)
+static void __maybe_unused vc_finish_insn(struct es_em_ctxt *ctxt)
 {
 	ctxt->regs->ip += ctxt->insn.length;
 }
@@ -312,7 +312,8 @@ static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
 	return ES_OK;
 }
 
-static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+static enum es_result __maybe_unused
+vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
 	u64 exit_info_1, exit_info_2;
@@ -408,8 +409,8 @@ static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 	return ret;
 }
 
-static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
-				      struct es_em_ctxt *ctxt)
+static enum es_result __maybe_unused vc_handle_cpuid(struct ghcb *ghcb,
+						     struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
 	u32 cr4 = native_read_cr4();
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
new file mode 100644
index 000000000000..27fdef6b3700
--- /dev/null
+++ b/arch/x86/kernel/sev-es.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD Memory Encryption Support
+ *
+ * Copyright (C) 2019 SUSE
+ *
+ * Author: Joerg Roedel <jroedel@suse.de>
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/trap_defs.h>
+#include <asm/sev-es.h>
+#include <asm/insn-eval.h>
+#include <asm/fpu/internal.h>
+#include <asm/processor.h>
+#include <asm/svm.h>
+
+static inline u64 sev_es_rd_ghcb_msr(void)
+{
+	return native_read_msr(MSR_AMD64_SEV_ES_GHCB);
+}
+
+static inline void sev_es_wr_ghcb_msr(u64 val)
+{
+	u32 low, high;
+
+	low  = (u32)(val);
+	high = (u32)(val >> 32);
+
+	native_write_msr(MSR_AMD64_SEV_ES_GHCB, low, high);
+}
+
+static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
+				unsigned char *buffer)
+{
+	return probe_kernel_read(buffer, (unsigned char *)ctxt->regs->ip,
+				 MAX_INSN_SIZE);
+}
+
+static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
+{
+	char buffer[MAX_INSN_SIZE];
+	enum es_result ret;
+	int res;
+
+	res = vc_fetch_insn_kernel(ctxt, buffer);
+	if (unlikely(res == -EFAULT)) {
+		ctxt->fi.vector     = X86_TRAP_PF;
+		ctxt->fi.error_code = 0;
+		ctxt->fi.cr2        = ctxt->regs->ip;
+		return ES_EXCEPTION;
+	}
+
+	insn_init(&ctxt->insn, buffer, MAX_INSN_SIZE - res, 1);
+	insn_get_length(&ctxt->insn);
+
+	ret = ctxt->insn.immediate.got ? ES_OK : ES_DECODE_FAILED;
+
+	return ret;
+}
+
+static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
+				   char *dst, char *buf, size_t size)
+{
+	unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
+	unsigned char *target = dst;
+	u64 d8;
+	u32 d4;
+	u16 d2;
+	u8  d1;
+
+	switch (size) {
+	case 1:
+		memcpy(&d1, buf, 1);
+		if (put_user(d1, target))
+			goto fault;
+		break;
+	case 2:
+		memcpy(&d2, buf, 2);
+		if (put_user(d2, target))
+			goto fault;
+		break;
+	case 4:
+		memcpy(&d4, buf, 4);
+		if (put_user(d4, target))
+			goto fault;
+		break;
+	case 8:
+		memcpy(&d8, buf, 8);
+		if (put_user(d8, target))
+			goto fault;
+		break;
+	default:
+		WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+		return ES_UNSUPPORTED;
+	}
+
+	return ES_OK;
+
+fault:
+	if (user_mode(ctxt->regs))
+		error_code |= X86_PF_USER;
+
+	ctxt->fi.vector = X86_TRAP_PF;
+	ctxt->fi.error_code = error_code;
+	ctxt->fi.cr2 = (unsigned long)dst;
+
+	return ES_EXCEPTION;
+}
+
+static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
+				  char *src, char *buf, size_t size)
+{
+	unsigned long error_code = X86_PF_PROT;
+	u64 d8;
+	u32 d4;
+	u16 d2;
+	u8  d1;
+
+	switch (size) {
+	case 1:
+		if (get_user(d1, src))
+			goto fault;
+		memcpy(buf, &d1, 1);
+		break;
+	case 2:
+		if (get_user(d2, src))
+			goto fault;
+		memcpy(buf, &d2, 2);
+		break;
+	case 4:
+		if (get_user(d4, src))
+			goto fault;
+		memcpy(buf, &d4, 4);
+		break;
+	case 8:
+		if (get_user(d8, src))
+			goto fault;
+		memcpy(buf, &d8, 8);
+		break;
+	default:
+		WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
+		return ES_UNSUPPORTED;
+	}
+
+	return ES_OK;
+
+fault:
+	if (user_mode(ctxt->regs))
+		error_code |= X86_PF_USER;
+
+	ctxt->fi.vector = X86_TRAP_PF;
+	ctxt->fi.error_code = error_code;
+	ctxt->fi.cr2 = (unsigned long)src;
+
+	return ES_EXCEPTION;
+}
+
+/* Include code shared with pre-decompression boot stage */
+#include "sev-es-shared.c"
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 38/70] x86/sev-es: Setup early #VC handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Setup an early handler for #VC exceptions. There is no GHCB mapped
yet, so just re-use the vc_no_ghcb_handler. It can only handle CPUID
exit-codes, but that should be enough to get the kernel through
verify_cpu() and __startup_64() until it runs on virtual addresses.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/desc.h      |  1 +
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/include/asm/sev-es.h    |  2 ++
 arch/x86/kernel/head64.c         | 17 +++++++++++++++
 arch/x86/kernel/head_64.S        | 36 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/idt.c            | 10 +++++++++
 6 files changed, 67 insertions(+)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 80bf63c08007..30e2a0e863b6 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -388,6 +388,7 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
 
 void update_intr_gate(unsigned int n, const void *addr);
 void alloc_intr_gate(unsigned int n, const void *addr);
+void set_early_idt_handler(gate_desc *idt, int n, void *handler);
 
 static inline void init_idt_data(struct idt_data *data, unsigned int n,
 				 const void *addr)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 09705ccc393c..4622427d01d4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -768,6 +768,7 @@ extern int sysenter_setup(void);
 
 /* Defined in head.S */
 extern struct desc_ptr		early_gdt_descr;
+extern struct desc_ptr		early_idt_descr;
 
 extern void switch_to_new_gdt(int);
 extern void load_direct_gdt(int);
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index 512d3ccb9832..caa29f75ce41 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -75,4 +75,6 @@ static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
 	return out;
 }
 
+extern void early_vc_handler(void);
+
 #endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 8ccca109750d..b8613fc0a364 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -38,6 +38,7 @@
 #include <asm/fixmap.h>
 #include <asm/extable.h>
 #include <asm/trap_defs.h>
+#include <asm/sev-es.h>
 
 /*
  * Manage page tables very early on.
@@ -515,3 +516,19 @@ void __head early_idt_setup_early_handler(unsigned long physaddr)
 		native_write_idt_entry(idt, i, &desc);
 	}
 }
+
+void __head early_idt_setup(unsigned long physbase)
+{
+	gate_desc *idt = fixup_pointer(idt_table, physbase);
+	void __maybe_unused *handler;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	/* VMM Communication Exception */
+	handler = fixup_pointer(early_vc_handler, physbase);
+	set_early_idt_handler(idt, X86_TRAP_VC, handler);
+#endif
+
+	/* Initialize IDT descriptor and load IDT */
+	early_idt_descr.address = (unsigned long)idt;
+	native_load_idt(&early_idt_descr);
+}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index bc0622a72d6d..b3acecdabd34 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -92,6 +92,12 @@ SYM_CODE_START_NOALIGN(startup_64)
 .Lon_kernel_cs:
 	UNWIND_HINT_EMPTY
 
+	/* Setup IDT - Needed for SEV-ES */
+	leaq	_text(%rip), %rdi
+	pushq	%rsi
+	call	early_idt_setup
+	popq	%rsi
+
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
@@ -370,6 +376,33 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	jmp restore_regs_and_return_to_kernel
 SYM_CODE_END(early_idt_handler_common)
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+/*
+ * VC Exception handler used during very early boot. The
+ * early_idt_handler_array can't be used because it returns via the
+ * paravirtualized INTERRUPT_RETURN and pv-ops don't work that early.
+ */
+SYM_CODE_START_NOALIGN(early_vc_handler)
+	UNWIND_HINT_IRET_REGS offset=8
+
+	/* Build pt_regs */
+	PUSH_AND_CLEAR_REGS
+
+	/* Call C handler */
+	movq    %rsp, %rdi
+	movq	ORIG_RAX(%rsp), %rsi
+	call    vc_no_ghcb_handler
+
+	/* Unwind pt_regs */
+	POP_REGS
+
+	/* Remove Error Code */
+	addq    $8, %rsp
+
+	/* Pure iret required here - don't use INTERRUPT_RETURN */
+	iretq
+SYM_CODE_END(early_vc_handler)
+#endif
 
 #define SYM_DATA_START_PAGE_ALIGNED(name)			\
 	SYM_START(name, SYM_L_GLOBAL, .balign PAGE_SIZE)
@@ -511,6 +544,9 @@ SYM_DATA_END(level1_fixmap_pgt)
 SYM_DATA(early_gdt_descr,		.word GDT_ENTRIES*8-1)
 SYM_DATA_LOCAL(early_gdt_descr_base,	.quad INIT_PER_CPU_VAR(gdt_page))
 
+SYM_DATA(early_idt_descr,		.word NUM_EXCEPTION_VECTORS * 16)
+SYM_DATA_LOCAL(early_idt_descr_base,	.quad 0)
+
 	.align 16
 /* This must match the first entry in level2_kernel_pgt */
 SYM_DATA(phys_base, .quad 0x0)
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 4a2c7791c697..135d208a2d38 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -341,3 +341,13 @@ void alloc_intr_gate(unsigned int n, const void *addr)
 	if (!test_and_set_bit(n, system_vectors))
 		set_intr_gate(n, addr);
 }
+
+void set_early_idt_handler(gate_desc *idt, int n, void *handler)
+{
+	struct idt_data data;
+	gate_desc desc;
+
+	init_idt_data(&data, n, handler);
+	idt_init_desc(&desc, &data);
+	native_write_idt_entry(idt, n, &desc);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 38/70] x86/sev-es: Setup early #VC handler
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Joerg Roedel <jroedel@suse.de>

Setup an early handler for #VC exceptions. There is no GHCB mapped
yet, so just re-use the vc_no_ghcb_handler. It can only handle CPUID
exit-codes, but that should be enough to get the kernel through
verify_cpu() and __startup_64() until it runs on virtual addresses.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/desc.h      |  1 +
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/include/asm/sev-es.h    |  2 ++
 arch/x86/kernel/head64.c         | 17 +++++++++++++++
 arch/x86/kernel/head_64.S        | 36 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/idt.c            | 10 +++++++++
 6 files changed, 67 insertions(+)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 80bf63c08007..30e2a0e863b6 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -388,6 +388,7 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
 
 void update_intr_gate(unsigned int n, const void *addr);
 void alloc_intr_gate(unsigned int n, const void *addr);
+void set_early_idt_handler(gate_desc *idt, int n, void *handler);
 
 static inline void init_idt_data(struct idt_data *data, unsigned int n,
 				 const void *addr)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 09705ccc393c..4622427d01d4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -768,6 +768,7 @@ extern int sysenter_setup(void);
 
 /* Defined in head.S */
 extern struct desc_ptr		early_gdt_descr;
+extern struct desc_ptr		early_idt_descr;
 
 extern void switch_to_new_gdt(int);
 extern void load_direct_gdt(int);
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index 512d3ccb9832..caa29f75ce41 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -75,4 +75,6 @@ static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
 	return out;
 }
 
+extern void early_vc_handler(void);
+
 #endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 8ccca109750d..b8613fc0a364 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -38,6 +38,7 @@
 #include <asm/fixmap.h>
 #include <asm/extable.h>
 #include <asm/trap_defs.h>
+#include <asm/sev-es.h>
 
 /*
  * Manage page tables very early on.
@@ -515,3 +516,19 @@ void __head early_idt_setup_early_handler(unsigned long physaddr)
 		native_write_idt_entry(idt, i, &desc);
 	}
 }
+
+void __head early_idt_setup(unsigned long physbase)
+{
+	gate_desc *idt = fixup_pointer(idt_table, physbase);
+	void __maybe_unused *handler;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	/* VMM Communication Exception */
+	handler = fixup_pointer(early_vc_handler, physbase);
+	set_early_idt_handler(idt, X86_TRAP_VC, handler);
+#endif
+
+	/* Initialize IDT descriptor and load IDT */
+	early_idt_descr.address = (unsigned long)idt;
+	native_load_idt(&early_idt_descr);
+}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index bc0622a72d6d..b3acecdabd34 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -92,6 +92,12 @@ SYM_CODE_START_NOALIGN(startup_64)
 .Lon_kernel_cs:
 	UNWIND_HINT_EMPTY
 
+	/* Setup IDT - Needed for SEV-ES */
+	leaq	_text(%rip), %rdi
+	pushq	%rsi
+	call	early_idt_setup
+	popq	%rsi
+
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
@@ -370,6 +376,33 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	jmp restore_regs_and_return_to_kernel
 SYM_CODE_END(early_idt_handler_common)
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+/*
+ * VC Exception handler used during very early boot. The
+ * early_idt_handler_array can't be used because it returns via the
+ * paravirtualized INTERRUPT_RETURN and pv-ops don't work that early.
+ */
+SYM_CODE_START_NOALIGN(early_vc_handler)
+	UNWIND_HINT_IRET_REGS offset=8
+
+	/* Build pt_regs */
+	PUSH_AND_CLEAR_REGS
+
+	/* Call C handler */
+	movq    %rsp, %rdi
+	movq	ORIG_RAX(%rsp), %rsi
+	call    vc_no_ghcb_handler
+
+	/* Unwind pt_regs */
+	POP_REGS
+
+	/* Remove Error Code */
+	addq    $8, %rsp
+
+	/* Pure iret required here - don't use INTERRUPT_RETURN */
+	iretq
+SYM_CODE_END(early_vc_handler)
+#endif
 
 #define SYM_DATA_START_PAGE_ALIGNED(name)			\
 	SYM_START(name, SYM_L_GLOBAL, .balign PAGE_SIZE)
@@ -511,6 +544,9 @@ SYM_DATA_END(level1_fixmap_pgt)
 SYM_DATA(early_gdt_descr,		.word GDT_ENTRIES*8-1)
 SYM_DATA_LOCAL(early_gdt_descr_base,	.quad INIT_PER_CPU_VAR(gdt_page))
 
+SYM_DATA(early_idt_descr,		.word NUM_EXCEPTION_VECTORS * 16)
+SYM_DATA_LOCAL(early_idt_descr_base,	.quad 0)
+
 	.align 16
 /* This must match the first entry in level2_kernel_pgt */
 SYM_DATA(phys_base, .quad 0x0)
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 4a2c7791c697..135d208a2d38 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -341,3 +341,13 @@ void alloc_intr_gate(unsigned int n, const void *addr)
 	if (!test_and_set_bit(n, system_vectors))
 		set_intr_gate(n, addr);
 }
+
+void set_early_idt_handler(gate_desc *idt, int n, void *handler)
+{
+	struct idt_data data;
+	gate_desc desc;
+
+	init_idt_data(&data, n, handler);
+	idt_init_desc(&desc, &data);
+	native_write_idt_entry(idt, n, &desc);
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 39/70] x86/sev-es: Setup GHCB based boot #VC handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (37 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler Joerg Roedel
                   ` (30 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add the infrastructure to handle #VC exceptions when the kernel runs
on virtual addresses and has a GHCB mapped. This handler will be used
until the runtime #VC handler takes over.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/segment.h  |   2 +-
 arch/x86/include/asm/sev-es.h   |   1 +
 arch/x86/kernel/head64.c        |   6 ++
 arch/x86/kernel/sev-es-shared.c |  14 ++--
 arch/x86/kernel/sev-es.c        | 116 ++++++++++++++++++++++++++++++++
 arch/x86/mm/extable.c           |   1 +
 6 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 6669164abadc..5b648066504c 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -230,7 +230,7 @@
 #define NUM_EXCEPTION_VECTORS		32
 
 /* Bitmask of exception vectors which push an error code on the stack: */
-#define EXCEPTION_ERRCODE_MASK		0x00027d00
+#define EXCEPTION_ERRCODE_MASK		0x20027d00
 
 #define GDT_SIZE			(GDT_ENTRIES*8)
 #define GDT_ENTRY_TLS_ENTRIES		3
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index caa29f75ce41..122b3e71a788 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -76,5 +76,6 @@ static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
 }
 
 extern void early_vc_handler(void);
+extern bool boot_vc_exception(struct pt_regs *regs);
 
 #endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b8613fc0a364..850c435b9fa5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -386,6 +386,12 @@ void __init early_exception(struct pt_regs *regs, int trapnr)
 	    early_make_pgtable(native_read_cr2()))
 		return;
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	if (trapnr == X86_TRAP_VC &&
+	    boot_vc_exception(regs))
+		return;
+#endif
+
 	early_fixup_exception(regs, trapnr);
 }
 
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index 7a6e4db669f0..b178a2db61a7 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -9,7 +9,7 @@
  * and is included directly into both code-bases.
  */
 
-static void __maybe_unused sev_es_terminate(unsigned int reason)
+static void sev_es_terminate(unsigned int reason)
 {
 	/* Request Guest Termination from Hypvervisor */
 	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
@@ -19,7 +19,7 @@ static void __maybe_unused sev_es_terminate(unsigned int reason)
 		asm volatile("hlt\n" : : : "memory");
 }
 
-static bool __maybe_unused sev_es_negotiate_protocol(void)
+static bool sev_es_negotiate_protocol(void)
 {
 	u64 val;
 
@@ -38,7 +38,7 @@ static bool __maybe_unused sev_es_negotiate_protocol(void)
 	return true;
 }
 
-static void __maybe_unused vc_ghcb_invalidate(struct ghcb *ghcb)
+static void vc_ghcb_invalidate(struct ghcb *ghcb)
 {
 	memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
 }
@@ -50,9 +50,9 @@ static bool vc_decoding_needed(unsigned long exit_code)
 		 exit_code <= SVM_EXIT_LAST_EXCP);
 }
 
-static enum es_result __maybe_unused vc_init_em_ctxt(struct es_em_ctxt *ctxt,
-						     struct pt_regs *regs,
-						     unsigned long exit_code)
+static enum es_result vc_init_em_ctxt(struct es_em_ctxt *ctxt,
+				      struct pt_regs *regs,
+				      unsigned long exit_code)
 {
 	enum es_result ret = ES_OK;
 
@@ -65,7 +65,7 @@ static enum es_result __maybe_unused vc_init_em_ctxt(struct es_em_ctxt *ctxt,
 	return ret;
 }
 
-static void __maybe_unused vc_finish_insn(struct es_em_ctxt *ctxt)
+static void vc_finish_insn(struct es_em_ctxt *ctxt)
 {
 	ctxt->regs->ip += ctxt->insn.length;
 }
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 27fdef6b3700..c17980e8db78 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -7,7 +7,9 @@
  * Author: Joerg Roedel <jroedel@suse.de>
  */
 
+#include <linux/sched/debug.h>	/* For show_regs() */
 #include <linux/kernel.h>
+#include <linux/printk.h>
 #include <linux/mm.h>
 
 #include <asm/trap_defs.h>
@@ -15,8 +17,21 @@
 #include <asm/insn-eval.h>
 #include <asm/fpu/internal.h>
 #include <asm/processor.h>
+#include <asm/trap_defs.h>
 #include <asm/svm.h>
 
+/* For early boot hypervisor communication in SEV-ES enabled guests */
+struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
+
+/*
+ * Needs to be in the .data section because we need it NULL before bss is
+ * cleared
+ */
+struct ghcb __initdata *boot_ghcb;
+
+/* Needed in vc_early_vc_forward_exception */
+extern void early_exception(struct pt_regs *regs, int trapnr);
+
 static inline u64 sev_es_rd_ghcb_msr(void)
 {
 	return native_read_msr(MSR_AMD64_SEV_ES_GHCB);
@@ -160,3 +175,104 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
+
+/*
+ * This function runs on the first #VC exception after the kernel
+ * switched to virtual addresses.
+ */
+static bool __init sev_es_setup_ghcb(void)
+{
+	/* First make sure the hypervisor talks a supported protocol. */
+	if (!sev_es_negotiate_protocol())
+		return false;
+	/*
+	 * Clear the boot_ghcb. The first exception comes in before the bss
+	 * section is cleared.
+	 */
+	memset(&boot_ghcb_page, 0, PAGE_SIZE);
+
+	/* Alright - Make the boot-ghcb public */
+	boot_ghcb = &boot_ghcb_page;
+
+	return true;
+}
+
+static void __init vc_early_vc_forward_exception(struct es_em_ctxt *ctxt)
+{
+	int trapnr = ctxt->fi.vector;
+
+	if (trapnr == X86_TRAP_PF)
+		native_write_cr2(ctxt->fi.cr2);
+
+	ctxt->regs->orig_ax = ctxt->fi.error_code;
+	early_exception(ctxt->regs, trapnr);
+}
+
+static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
+		struct ghcb *ghcb,
+		unsigned long exit_code)
+{
+	enum es_result result;
+
+	switch (exit_code) {
+	default:
+		/*
+		 * Unexpected #VC exception
+		 */
+		result = ES_UNSUPPORTED;
+	}
+
+	return result;
+}
+
+bool __init boot_vc_exception(struct pt_regs *regs)
+{
+	unsigned long exit_code = regs->orig_ax;
+	struct es_em_ctxt ctxt;
+	enum es_result result;
+
+	/* Do initial setup or terminate the guest */
+	if (unlikely(boot_ghcb == NULL && !sev_es_setup_ghcb()))
+		sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);
+
+	vc_ghcb_invalidate(boot_ghcb);
+	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+
+	if (result == ES_OK)
+		result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
+
+	/* Done - now check the result */
+	switch (result) {
+	case ES_OK:
+		vc_finish_insn(&ctxt);
+		break;
+	case ES_UNSUPPORTED:
+		early_printk("PANIC: Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+				exit_code, regs->ip);
+		goto fail;
+	case ES_VMM_ERROR:
+		early_printk("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+				exit_code, regs->ip);
+		goto fail;
+	case ES_DECODE_FAILED:
+		early_printk("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+				exit_code, regs->ip);
+		goto fail;
+	case ES_EXCEPTION:
+		vc_early_vc_forward_exception(&ctxt);
+		break;
+	case ES_RETRY:
+		/* Nothing to do */
+		break;
+	default:
+		BUG();
+	}
+
+	return true;
+
+fail:
+	show_regs(regs);
+
+	while (true)
+		halt();
+}
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 30bb0bd3b1b8..cd440a9cf422 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -5,6 +5,7 @@
 #include <xen/xen.h>
 
 #include <asm/fpu/internal.h>
+#include <asm/sev-es.h>
 #include <asm/traps.h>
 #include <asm/kdebug.h>
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (38 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 39/70] x86/sev-es: Setup GHCB based boot " Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-04-14 19:03     ` Mike Stunes
  2020-03-19  9:13   ` Joerg Roedel
                   ` (29 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

The runtime handler needs a GHCB per CPU. Set them up and map them
unencrypted.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/mem_encrypt.h |  2 ++
 arch/x86/kernel/sev-es.c           | 28 +++++++++++++++++++++++++++-
 arch/x86/kernel/traps.c            |  3 +++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index 6f61bb93366a..8b69b389688f 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -48,6 +48,7 @@ int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
 void __init mem_encrypt_init(void);
 void __init mem_encrypt_free_decrypted_mem(void);
 
+void __init sev_es_init_ghcbs(void);
 bool sme_active(void);
 bool sev_active(void);
 bool sev_es_active(void);
@@ -71,6 +72,7 @@ static inline void __init sme_early_init(void) { }
 static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
 static inline void __init sme_enable(struct boot_params *bp) { }
 
+static inline void sev_es_init_ghcbs(void) { }
 static inline bool sme_active(void) { return false; }
 static inline bool sev_active(void) { return false; }
 static inline bool sev_es_active(void) { return false; }
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index c17980e8db78..4bf5286310a0 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -8,8 +8,11 @@
  */
 
 #include <linux/sched/debug.h>	/* For show_regs() */
-#include <linux/kernel.h>
+#include <linux/percpu-defs.h>
+#include <linux/mem_encrypt.h>
 #include <linux/printk.h>
+#include <linux/set_memory.h>
+#include <linux/kernel.h>
 #include <linux/mm.h>
 
 #include <asm/trap_defs.h>
@@ -29,6 +32,9 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 struct ghcb __initdata *boot_ghcb;
 
+/* Runtime GHCB pointers */
+static struct ghcb __percpu *ghcb_page;
+
 /* Needed in vc_early_vc_forward_exception */
 extern void early_exception(struct pt_regs *regs, int trapnr);
 
@@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void)
 	return true;
 }
 
+void sev_es_init_ghcbs(void)
+{
+	int cpu;
+
+	if (!sev_es_active())
+		return;
+
+	/* Allocate GHCB pages */
+	ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE);
+
+	/* Initialize per-cpu GHCB pages */
+	for_each_possible_cpu(cpu) {
+		struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu);
+
+		set_memory_decrypted((unsigned long)ghcb,
+				     sizeof(*ghcb) >> PAGE_SHIFT);
+		memset(ghcb, 0, sizeof(*ghcb));
+	}
+}
+
 static void __init vc_early_vc_forward_exception(struct es_em_ctxt *ctxt)
 {
 	int trapnr = ctxt->fi.vector;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6ef00eb6fbb9..09bebda9b053 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -918,6 +918,9 @@ void __init trap_init(void)
 	/* Init cpu_entry_area before IST entries are set up */
 	setup_cpu_entry_areas();
 
+	/* Init GHCB memory pages when running as an SEV-ES guest */
+	sev_es_init_ghcbs();
+
 	idt_setup_traps();
 
 	/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
@ 2020-03-19  9:13   ` Joerg Roedel
  2020-03-19  9:12   ` Joerg Roedel
                     ` (68 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Add the handler for #VC exceptions invoked at runtime.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_64.S    |  4 ++
 arch/x86/include/asm/traps.h |  7 ++++
 arch/x86/kernel/idt.c        |  4 +-
 arch/x86/kernel/sev-es.c     | 77 +++++++++++++++++++++++++++++++++++-
 4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f2bb91e87877..729876d368c5 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1210,6 +1210,10 @@ idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
 idtentry machine_check		do_mce			has_error_code=0	paranoid=1
 #endif
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+idtentry vmm_communication     do_vmm_communication    has_error_code=1
+#endif
+
 /*
  * Save all registers in pt_regs, and switch gs if needed.
  * Use slow, but surefire "are we in kernel?" check.
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 2aa786484bb1..1be25c065698 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -35,6 +35,9 @@ asmlinkage void alignment_check(void);
 #ifdef CONFIG_X86_MCE
 asmlinkage void machine_check(void);
 #endif /* CONFIG_X86_MCE */
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+asmlinkage void vmm_communication(void);
+#endif
 asmlinkage void simd_coprocessor_error(void);
 
 #if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PV)
@@ -93,6 +96,10 @@ dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code);
 dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code);
 #endif
 dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code);
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+dotraplinkage void do_vmm_communication_error(struct pt_regs *regs,
+					      long error_code);
+#endif
 #ifdef CONFIG_X86_32
 dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code);
 #endif
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 135d208a2d38..25fa8ba70993 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -88,8 +88,10 @@ static const __initconst struct idt_data def_idts[] = {
 #ifdef CONFIG_X86_MCE
 	INTG(X86_TRAP_MC,		&machine_check),
 #endif
-
 	SYSG(X86_TRAP_OF,		overflow),
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	INTG(X86_TRAP_VC,               vmm_communication),
+#endif
 #if defined(CONFIG_IA32_EMULATION)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_compat),
 #elif defined(CONFIG_X86_32)
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 4bf5286310a0..97241d2f0f70 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -20,7 +20,7 @@
 #include <asm/insn-eval.h>
 #include <asm/fpu/internal.h>
 #include <asm/processor.h>
-#include <asm/trap_defs.h>
+#include <asm/traps.h>
 #include <asm/svm.h>
 
 /* For early boot hypervisor communication in SEV-ES enabled guests */
@@ -251,6 +251,81 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	return result;
 }
 
+static void vc_forward_exception(struct es_em_ctxt *ctxt)
+{
+	long error_code = ctxt->fi.error_code;
+	int trapnr = ctxt->fi.vector;
+
+	ctxt->regs->orig_ax = ctxt->fi.error_code;
+
+	switch (trapnr) {
+	case X86_TRAP_GP:
+		do_general_protection(ctxt->regs, error_code);
+		break;
+	case X86_TRAP_UD:
+		do_invalid_op(ctxt->regs, 0);
+		break;
+	default:
+		BUG();
+	}
+}
+
+dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit_code)
+{
+	struct es_em_ctxt ctxt;
+	enum es_result result;
+	struct ghcb *ghcb;
+
+	/*
+	 * This is invoked through an interrupt gate, so IRQs are disabled. The
+	 * code below might walk page-tables for user or kernel addresses, so
+	 * keep the IRQs disabled to protect us against concurrent TLB flushes.
+	 */
+
+	ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
+
+	vc_ghcb_invalidate(ghcb);
+	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+
+	if (result == ES_OK)
+		result = vc_handle_exitcode(&ctxt, ghcb, exit_code);
+
+	/* Done - now check the result */
+	switch (result) {
+	case ES_OK:
+		vc_finish_insn(&ctxt);
+		break;
+	case ES_UNSUPPORTED:
+		pr_emerg("Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+			 exit_code, regs->ip);
+		goto fail;
+	case ES_VMM_ERROR:
+		pr_emerg("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+			 exit_code, regs->ip);
+		goto fail;
+	case ES_DECODE_FAILED:
+		pr_emerg("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+			 exit_code, regs->ip);
+		goto fail;
+	case ES_EXCEPTION:
+		vc_forward_exception(&ctxt);
+		break;
+	case ES_RETRY:
+		/* Nothing to do */
+		break;
+	default:
+		BUG();
+	}
+
+	return;
+
+fail:
+	show_regs(regs);
+
+	while (true)
+		halt();
+}
+
 bool __init boot_vc_exception(struct pt_regs *regs)
 {
 	unsigned long exit_code = regs->orig_ax;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler
@ 2020-03-19  9:13   ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Joerg Roedel, Dave Hansen,
	linux-kernel, virtualization, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

From: Tom Lendacky <thomas.lendacky@amd.com>

Add the handler for #VC exceptions invoked at runtime.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_64.S    |  4 ++
 arch/x86/include/asm/traps.h |  7 ++++
 arch/x86/kernel/idt.c        |  4 +-
 arch/x86/kernel/sev-es.c     | 77 +++++++++++++++++++++++++++++++++++-
 4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f2bb91e87877..729876d368c5 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1210,6 +1210,10 @@ idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
 idtentry machine_check		do_mce			has_error_code=0	paranoid=1
 #endif
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+idtentry vmm_communication     do_vmm_communication    has_error_code=1
+#endif
+
 /*
  * Save all registers in pt_regs, and switch gs if needed.
  * Use slow, but surefire "are we in kernel?" check.
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 2aa786484bb1..1be25c065698 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -35,6 +35,9 @@ asmlinkage void alignment_check(void);
 #ifdef CONFIG_X86_MCE
 asmlinkage void machine_check(void);
 #endif /* CONFIG_X86_MCE */
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+asmlinkage void vmm_communication(void);
+#endif
 asmlinkage void simd_coprocessor_error(void);
 
 #if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PV)
@@ -93,6 +96,10 @@ dotraplinkage void do_alignment_check(struct pt_regs *regs, long error_code);
 dotraplinkage void do_machine_check(struct pt_regs *regs, long error_code);
 #endif
 dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_code);
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+dotraplinkage void do_vmm_communication_error(struct pt_regs *regs,
+					      long error_code);
+#endif
 #ifdef CONFIG_X86_32
 dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code);
 #endif
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 135d208a2d38..25fa8ba70993 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -88,8 +88,10 @@ static const __initconst struct idt_data def_idts[] = {
 #ifdef CONFIG_X86_MCE
 	INTG(X86_TRAP_MC,		&machine_check),
 #endif
-
 	SYSG(X86_TRAP_OF,		overflow),
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	INTG(X86_TRAP_VC,               vmm_communication),
+#endif
 #if defined(CONFIG_IA32_EMULATION)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_compat),
 #elif defined(CONFIG_X86_32)
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 4bf5286310a0..97241d2f0f70 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -20,7 +20,7 @@
 #include <asm/insn-eval.h>
 #include <asm/fpu/internal.h>
 #include <asm/processor.h>
-#include <asm/trap_defs.h>
+#include <asm/traps.h>
 #include <asm/svm.h>
 
 /* For early boot hypervisor communication in SEV-ES enabled guests */
@@ -251,6 +251,81 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	return result;
 }
 
+static void vc_forward_exception(struct es_em_ctxt *ctxt)
+{
+	long error_code = ctxt->fi.error_code;
+	int trapnr = ctxt->fi.vector;
+
+	ctxt->regs->orig_ax = ctxt->fi.error_code;
+
+	switch (trapnr) {
+	case X86_TRAP_GP:
+		do_general_protection(ctxt->regs, error_code);
+		break;
+	case X86_TRAP_UD:
+		do_invalid_op(ctxt->regs, 0);
+		break;
+	default:
+		BUG();
+	}
+}
+
+dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit_code)
+{
+	struct es_em_ctxt ctxt;
+	enum es_result result;
+	struct ghcb *ghcb;
+
+	/*
+	 * This is invoked through an interrupt gate, so IRQs are disabled. The
+	 * code below might walk page-tables for user or kernel addresses, so
+	 * keep the IRQs disabled to protect us against concurrent TLB flushes.
+	 */
+
+	ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
+
+	vc_ghcb_invalidate(ghcb);
+	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+
+	if (result == ES_OK)
+		result = vc_handle_exitcode(&ctxt, ghcb, exit_code);
+
+	/* Done - now check the result */
+	switch (result) {
+	case ES_OK:
+		vc_finish_insn(&ctxt);
+		break;
+	case ES_UNSUPPORTED:
+		pr_emerg("Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+			 exit_code, regs->ip);
+		goto fail;
+	case ES_VMM_ERROR:
+		pr_emerg("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+			 exit_code, regs->ip);
+		goto fail;
+	case ES_DECODE_FAILED:
+		pr_emerg("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+			 exit_code, regs->ip);
+		goto fail;
+	case ES_EXCEPTION:
+		vc_forward_exception(&ctxt);
+		break;
+	case ES_RETRY:
+		/* Nothing to do */
+		break;
+	default:
+		BUG();
+	}
+
+	return;
+
+fail:
+	show_regs(regs);
+
+	while (true)
+		halt();
+}
+
 bool __init boot_vc_exception(struct pt_regs *regs)
 {
 	unsigned long exit_code = regs->orig_ax;
-- 
2.17.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 42/70] x86/sev-es: Support nested #VC exceptions
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (40 preceding siblings ...)
  2020-03-19  9:13   ` Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19 15:46     ` Andy Lutomirski
  2020-03-19  9:13 ` [PATCH 43/70] x86/sev-es: Wire up existing #VC exit-code handlers Joerg Roedel
                   ` (27 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Handle #VC exceptions that happen while the GHCB is in use. This can
happen when an NMI happens in the #VC exception handler and the NMI
handler causes a #VC exception itself. Save the contents of the GHCB
when nesting is detected and restore it when the GHCB is no longer
used.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 63 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 97241d2f0f70..3b7bbc8d841e 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -32,9 +32,57 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 struct ghcb __initdata *boot_ghcb;
 
+struct ghcb_state {
+	struct ghcb *ghcb;
+};
+
 /* Runtime GHCB pointers */
 static struct ghcb __percpu *ghcb_page;
 
+/*
+ * Mark the per-cpu GHCB as in-use to detect nested #VC exceptions.
+ * There is no need for it to be atomic, because nothing is written to the GHCB
+ * between the read and the write of ghcb_active. So it is safe to use it when a
+ * nested #VC exception happens before the write.
+ */
+static DEFINE_PER_CPU(bool, ghcb_active);
+
+static struct ghcb *sev_es_get_ghcb(struct ghcb_state *state)
+{
+	struct ghcb *ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
+	bool *active = this_cpu_ptr(&ghcb_active);
+
+	if (unlikely(*active)) {
+		/* GHCB is already in use - save its contents */
+
+		state->ghcb = kzalloc(sizeof(struct ghcb), GFP_ATOMIC);
+		if (!state->ghcb)
+			return NULL;
+
+		*state->ghcb = *ghcb;
+	} else {
+		state->ghcb = NULL;
+		*active = true;
+	}
+
+	return ghcb;
+}
+
+static void sev_es_put_ghcb(struct ghcb_state *state)
+{
+	bool *active = this_cpu_ptr(&ghcb_active);
+	struct ghcb *ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
+
+	if (state->ghcb) {
+		/* Restore saved state and free backup memory */
+		*ghcb = *state->ghcb;
+		kfree(state->ghcb);
+		state->ghcb = NULL;
+	} else {
+		*active = false;
+	}
+}
+
 /* Needed in vc_early_vc_forward_exception */
 extern void early_exception(struct pt_regs *regs, int trapnr);
 
@@ -272,6 +320,7 @@ static void vc_forward_exception(struct es_em_ctxt *ctxt)
 
 dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit_code)
 {
+	struct ghcb_state state;
 	struct es_em_ctxt ctxt;
 	enum es_result result;
 	struct ghcb *ghcb;
@@ -282,14 +331,20 @@ dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit
 	 * keep the IRQs disabled to protect us against concurrent TLB flushes.
 	 */
 
-	ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
-
-	vc_ghcb_invalidate(ghcb);
-	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+	ghcb = sev_es_get_ghcb(&state);
+	if (!ghcb) {
+		/* This can only fail on an allocation error, so just retry */
+		result = ES_RETRY;
+	} else {
+		vc_ghcb_invalidate(ghcb);
+		result = vc_init_em_ctxt(&ctxt, regs, exit_code);
+	}
 
 	if (result == ES_OK)
 		result = vc_handle_exitcode(&ctxt, ghcb, exit_code);
 
+	sev_es_put_ghcb(&state);
+
 	/* Done - now check the result */
 	switch (result) {
 	case ES_OK:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 43/70] x86/sev-es: Wire up existing #VC exit-code handlers
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (41 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 42/70] x86/sev-es: Support nested #VC exceptions Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 44/70] x86/sev-es: Handle instruction fetches from user-space Joerg Roedel
                   ` (26 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Re-use the handlers for CPUID and IOIO caused #VC exceptions in the
early boot handler.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es-shared.c | 7 +++----
 arch/x86/kernel/sev-es.c        | 6 ++++++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index b178a2db61a7..a632b8f041ec 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -312,8 +312,7 @@ static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
 	return ES_OK;
 }
 
-static enum es_result __maybe_unused
-vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
 	u64 exit_info_1, exit_info_2;
@@ -409,8 +408,8 @@ vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 	return ret;
 }
 
-static enum es_result __maybe_unused vc_handle_cpuid(struct ghcb *ghcb,
-						     struct es_em_ctxt *ctxt)
+static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
 	u32 cr4 = native_read_cr4();
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 3b7bbc8d841e..226ab0c57a09 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -289,6 +289,12 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	enum es_result result;
 
 	switch (exit_code) {
+	case SVM_EXIT_CPUID:
+		result = vc_handle_cpuid(ghcb, ctxt);
+		break;
+	case SVM_EXIT_IOIO:
+		result = vc_handle_ioio(ghcb, ctxt);
+		break;
 	default:
 		/*
 		 * Unexpected #VC exception
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 44/70] x86/sev-es: Handle instruction fetches from user-space
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (42 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 43/70] x86/sev-es: Wire up existing #VC exit-code handlers Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 45/70] x86/sev-es: Harden runtime #VC handler for exceptions " Joerg Roedel
                   ` (25 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

When a #VC exception is triggered by user-space the instruction decoder
needs to read the instruction bytes from user addresses.  Enhance
vc_decode_insn() to safely fetch kernel and user instructions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 226ab0c57a09..79a71d14a1fc 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -114,17 +114,30 @@ static enum es_result vc_decode_insn(struct es_em_ctxt *ctxt)
 	enum es_result ret;
 	int res;
 
-	res = vc_fetch_insn_kernel(ctxt, buffer);
-	if (unlikely(res == -EFAULT)) {
-		ctxt->fi.vector     = X86_TRAP_PF;
-		ctxt->fi.error_code = 0;
-		ctxt->fi.cr2        = ctxt->regs->ip;
-		return ES_EXCEPTION;
+	if (!user_mode(ctxt->regs)) {
+		res = vc_fetch_insn_kernel(ctxt, buffer);
+		if (unlikely(res == -EFAULT)) {
+			ctxt->fi.vector     = X86_TRAP_PF;
+			ctxt->fi.error_code = 0;
+			ctxt->fi.cr2        = ctxt->regs->ip;
+			return ES_EXCEPTION;
+		}
+
+		insn_init(&ctxt->insn, buffer, MAX_INSN_SIZE - res, 1);
+		insn_get_length(&ctxt->insn);
+	} else {
+		res = insn_fetch_from_user(ctxt->regs, buffer);
+		if (res == 0) {
+			ctxt->fi.vector     = X86_TRAP_PF;
+			ctxt->fi.cr2        = ctxt->regs->ip;
+			ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
+			return ES_EXCEPTION;
+		}
+
+		if (!insn_decode(ctxt->regs, &ctxt->insn, buffer, res))
+			return ES_DECODE_FAILED;
 	}
 
-	insn_init(&ctxt->insn, buffer, MAX_INSN_SIZE - res, 1);
-	insn_get_length(&ctxt->insn);
-
 	ret = ctxt->insn.immediate.got ? ES_OK : ES_DECODE_FAILED;
 
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 45/70] x86/sev-es: Harden runtime #VC handler for exceptions from user-space
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (43 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 44/70] x86/sev-es: Handle instruction fetches from user-space Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 46/70] x86/sev-es: Filter exceptions not supported " Joerg Roedel
                   ` (24 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Send SIGBUS to the user-space process that caused the #VC exception
instead of killing the machine. Also ratelimit the error messages so
that user-space can't flood the kernel log.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 79a71d14a1fc..71eee7b3667d 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -370,16 +370,16 @@ dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit
 		vc_finish_insn(&ctxt);
 		break;
 	case ES_UNSUPPORTED:
-		pr_emerg("Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
-			 exit_code, regs->ip);
+		pr_err_ratelimited("Unsupported exit-code 0x%02lx in early #VC exception (IP: 0x%lx)\n",
+				   exit_code, regs->ip);
 		goto fail;
 	case ES_VMM_ERROR:
-		pr_emerg("PANIC: Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
-			 exit_code, regs->ip);
+		pr_err_ratelimited("Failure in communication with VMM (exit-code 0x%02lx IP: 0x%lx)\n",
+				   exit_code, regs->ip);
 		goto fail;
 	case ES_DECODE_FAILED:
-		pr_emerg("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
-			 exit_code, regs->ip);
+		pr_err_ratelimited("PANIC: Failed to decode instruction (exit-code 0x%02lx IP: 0x%lx)\n",
+				   exit_code, regs->ip);
 		goto fail;
 	case ES_EXCEPTION:
 		vc_forward_exception(&ctxt);
@@ -394,10 +394,24 @@ dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit
 	return;
 
 fail:
-	show_regs(regs);
+	if (user_mode(regs)) {
+		/*
+		 * Do not kill the machine if user-space triggered the
+		 * exception. Send SIGBUS instead and let user-space deal with
+		 * it.
+		 */
+		force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
+	} else {
+		/* Show some debug info */
+		show_regs(regs);
 
-	while (true)
-		halt();
+		/* Ask hypervisor to sev_es_terminate */
+		sev_es_terminate(GHCB_SEV_ES_REASON_GENERAL_REQUEST);
+
+		/* If that fails and we get here - just halt the machine */
+		while (true)
+			halt();
+	}
 }
 
 bool __init boot_vc_exception(struct pt_regs *regs)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 46/70] x86/sev-es: Filter exceptions not supported from user-space
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (44 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 45/70] x86/sev-es: Harden runtime #VC handler for exceptions " Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 47/70] x86/sev-es: Handle MMIO events Joerg Roedel
                   ` (23 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Currently only CPUID caused #VC exceptions are supported from
user-space. Filter the others out early.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 71eee7b3667d..6d45a2499460 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -318,6 +318,26 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	return result;
 }
 
+static enum es_result vc_context_filter(struct pt_regs *regs, long exit_code)
+{
+	enum es_result r = ES_OK;
+
+	if (user_mode(regs)) {
+		switch (exit_code) {
+		/* List of #VC exit-codes we support in user-space */
+		case SVM_EXIT_EXCP_BASE ... SVM_EXIT_LAST_EXCP:
+		case SVM_EXIT_CPUID:
+			r = ES_OK;
+			break;
+		default:
+			r = ES_UNSUPPORTED;
+			break;
+		}
+	}
+
+	return r;
+}
+
 static void vc_forward_exception(struct es_em_ctxt *ctxt)
 {
 	long error_code = ctxt->fi.error_code;
@@ -359,6 +379,10 @@ dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit
 		result = vc_init_em_ctxt(&ctxt, regs, exit_code);
 	}
 
+	/* Check if the exception is supported in the context we came from. */
+	if (result == ES_OK)
+		result = vc_context_filter(regs, exit_code);
+
 	if (result == ES_OK)
 		result = vc_handle_exitcode(&ctxt, ghcb, exit_code);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 47/70] x86/sev-es: Handle MMIO events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (45 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 46/70] x86/sev-es: Filter exceptions not supported " Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 48/70] x86/sev-es: Handle MMIO String Instructions Joerg Roedel
                   ` (22 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Add handler for VC exceptions caused by MMIO intercepts. These
intercepts come along as nested page faults on pages with reserved
bits set.

TODO:
	- Add return values of helper functions
	- Check permissions on page-table walks
	- Check data segments
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/boot/compressed/sev-es.c |   5 +
 arch/x86/include/uapi/asm/svm.h   |   5 +
 arch/x86/kernel/sev-es.c          | 188 ++++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 40eaf24db641..53c65fc09341 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -100,6 +100,11 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 	return ES_OK;
 }
 
+static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
+{
+	return (phys_addr_t)vaddr;
+}
+
 #undef __init
 #undef __pa
 #define __init
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index c68d1618c9b0..8f36ae021a7f 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -81,6 +81,11 @@
 #define SVM_EXIT_AVIC_INCOMPLETE_IPI		0x401
 #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS	0x402
 
+/* SEV-ES software-defined VMGEXIT events */
+#define SVM_VMGEXIT_MMIO_READ			0x80000001
+#define SVM_VMGEXIT_MMIO_WRITE			0x80000002
+#define SVM_VMGEXIT_UNSUPPORTED_EVENT		0x8000ffff
+
 #define SVM_EXIT_ERR           -1
 
 #define SVM_EXIT_REASONS \
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 6d45a2499460..6b0f82f81229 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -240,6 +240,25 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 	return ES_EXCEPTION;
 }
 
+static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
+{
+	unsigned long va = (unsigned long)vaddr;
+	unsigned int level;
+	phys_addr_t pa;
+	pgd_t *pgd;
+	pte_t *pte;
+
+	pgd = pgd_offset(current->active_mm, va);
+	pte = lookup_address_in_pgd(pgd, va, &level);
+	if (!pte)
+		return 0;
+
+	pa = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+	pa |= va & ~page_level_mask(level);
+
+	return pa;
+}
+
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
 
@@ -295,6 +314,172 @@ static void __init vc_early_vc_forward_exception(struct es_em_ctxt *ctxt)
 	early_exception(ctxt->regs, trapnr);
 }
 
+static long *vc_insn_get_reg(struct es_em_ctxt *ctxt)
+{
+	long *reg_array;
+	int offset;
+
+	reg_array = (long *)ctxt->regs;
+	offset    = insn_get_modrm_reg_off(&ctxt->insn, ctxt->regs);
+
+	if (offset < 0)
+		return NULL;
+
+	offset /= sizeof(long);
+
+	return reg_array + offset;
+}
+
+static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
+				 unsigned int bytes, bool read)
+{
+	u64 exit_code, exit_info_1, exit_info_2;
+	unsigned long ghcb_pa = __pa(ghcb);
+	void __user *ref;
+
+	ref = insn_get_addr_ref(&ctxt->insn, ctxt->regs);
+	if (ref == (void __user *)-1L)
+		return ES_UNSUPPORTED;
+
+	exit_code = read ? SVM_VMGEXIT_MMIO_READ : SVM_VMGEXIT_MMIO_WRITE;
+
+	exit_info_1 = vc_slow_virt_to_phys(ghcb, (long)ref);
+	exit_info_2 = bytes;    /* Can never be greater than 8 */
+
+	ghcb->save.sw_scratch = ghcb_pa + offsetof(struct ghcb, shared_buffer);
+
+	return sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, exit_info_1, exit_info_2);
+}
+
+static enum es_result vc_handle_mmio_twobyte_ops(struct ghcb *ghcb,
+						 struct es_em_ctxt *ctxt)
+{
+	struct insn *insn = &ctxt->insn;
+	unsigned int bytes = 0;
+	enum es_result ret;
+	int sign_byte;
+	long *reg_data;
+
+	switch (insn->opcode.bytes[1]) {
+		/* MMIO Read w/ zero-extension */
+	case 0xb6:
+		bytes = 1;
+		/* Fallthrough */
+	case 0xb7:
+		if (!bytes)
+			bytes = 2;
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		/* Zero extend based on operand size */
+		reg_data = vc_insn_get_reg(ctxt);
+		memset(reg_data, 0, insn->opnd_bytes);
+
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+
+		/* MMIO Read w/ sign-extension */
+	case 0xbe:
+		bytes = 1;
+		/* Fallthrough */
+	case 0xbf:
+		if (!bytes)
+			bytes = 2;
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		/* Sign extend based on operand size */
+		reg_data = vc_insn_get_reg(ctxt);
+		if (bytes == 1) {
+			u8 *val = (u8 *)ghcb->shared_buffer;
+
+			sign_byte = (*val & 0x80) ? 0x00 : 0xff;
+		} else {
+			u16 *val = (u16 *)ghcb->shared_buffer;
+
+			sign_byte = (*val & 0x8000) ? 0x00 : 0xff;
+		}
+		memset(reg_data, sign_byte, insn->opnd_bytes);
+
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+
+	default:
+		ret = ES_UNSUPPORTED;
+	}
+
+	return ret;
+}
+
+static enum es_result vc_handle_mmio(struct ghcb *ghcb,
+				     struct es_em_ctxt *ctxt)
+{
+	struct insn *insn = &ctxt->insn;
+	unsigned int bytes = 0;
+	enum es_result ret;
+	long *reg_data;
+
+	switch (insn->opcode.bytes[0]) {
+	/* MMIO Write */
+	case 0x88:
+		bytes = 1;
+		/* Fallthrough */
+	case 0x89:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		reg_data = vc_insn_get_reg(ctxt);
+		memcpy(ghcb->shared_buffer, reg_data, bytes);
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+		break;
+
+	case 0xc6:
+		bytes = 1;
+		/* Fallthrough */
+	case 0xc7:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		memcpy(ghcb->shared_buffer, insn->immediate1.bytes, bytes);
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, false);
+		break;
+
+		/* MMIO Read */
+	case 0x8a:
+		bytes = 1;
+		/* Fallthrough */
+	case 0x8b:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		ret = vc_do_mmio(ghcb, ctxt, bytes, true);
+		if (ret)
+			break;
+
+		reg_data = vc_insn_get_reg(ctxt);
+		if (bytes == 4)
+			*reg_data = 0;  /* Zero-extend for 32-bit operation */
+
+		memcpy(reg_data, ghcb->shared_buffer, bytes);
+		break;
+
+		/* Two-Byte Opcodes */
+	case 0x0f:
+		ret = vc_handle_mmio_twobyte_ops(ghcb, ctxt);
+		break;
+	default:
+		ret = ES_UNSUPPORTED;
+	}
+
+	return ret;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 		struct ghcb *ghcb,
 		unsigned long exit_code)
@@ -308,6 +493,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_IOIO:
 		result = vc_handle_ioio(ghcb, ctxt);
 		break;
+	case SVM_EXIT_NPF:
+		result = vc_handle_mmio(ghcb, ctxt);
+		break;
 	default:
 		/*
 		 * Unexpected #VC exception
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 48/70] x86/sev-es: Handle MMIO String Instructions
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (46 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 47/70] x86/sev-es: Handle MMIO events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 49/70] x86/sev-es: Handle MSR events Joerg Roedel
                   ` (21 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add handling for emulation the MOVS instruction on MMIO regions, as done
by the memcpy_toio() and memcpy_fromio() functions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 78 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 6b0f82f81229..a040959e512d 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -415,6 +415,74 @@ static enum es_result vc_handle_mmio_twobyte_ops(struct ghcb *ghcb,
 	return ret;
 }
 
+/*
+ * The MOVS instruction has two memory operands, which raises the
+ * problem that it is not known whether the access to the source or the
+ * destination caused the #VC exception (and hence whether an MMIO read
+ * or write operation needs to be emulated).
+ *
+ * Instead of playing games with walking page-tables and trying to guess
+ * whether the source or destination is an MMIO range, this code splits
+ * the move into two operations, a read and a write with only one
+ * memory operand. This will cause a nested #VC exception on the MMIO
+ * address which can then be handled.
+ *
+ * This implementation has the benefit that it also supports MOVS where
+ * source _and_ destination are MMIO regions.
+ *
+ * It will slow MOVS on MMIO down a lot, but in SEV-ES guests it is a
+ * rare operation. If it turns out to be a performance problem the split
+ * operations can be moved to memcpy_fromio() and memcpy_toio().
+ */
+static enum es_result vc_handle_mmio_movs(struct es_em_ctxt *ctxt,
+					  unsigned int bytes)
+{
+	unsigned long ds_base, es_base;
+	unsigned char *src, *dst;
+	unsigned char buffer[8];
+	enum es_result ret;
+	bool rep;
+	int off;
+
+	ds_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_DS);
+	es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
+
+	if (ds_base == -1L || es_base == -1L) {
+		ctxt->fi.vector = X86_TRAP_GP;
+		ctxt->fi.error_code = 0;
+		return ES_EXCEPTION;
+	}
+
+	src = ds_base + (unsigned char *)ctxt->regs->si;
+	dst = es_base + (unsigned char *)ctxt->regs->di;
+
+	ret = vc_read_mem(ctxt, src, buffer, bytes);
+	if (ret != ES_OK)
+		return ret;
+
+	ret = vc_write_mem(ctxt, dst, buffer, bytes);
+	if (ret != ES_OK)
+		return ret;
+
+	if (ctxt->regs->flags & X86_EFLAGS_DF)
+		off = -bytes;
+	else
+		off =  bytes;
+
+	ctxt->regs->si += off;
+	ctxt->regs->di += off;
+
+	rep = insn_rep_prefix(&ctxt->insn);
+
+	if (rep)
+		ctxt->regs->cx -= 1;
+
+	if (!rep || ctxt->regs->cx == 0)
+		return ES_OK;
+	else
+		return ES_RETRY;
+}
+
 static enum es_result vc_handle_mmio(struct ghcb *ghcb,
 				     struct es_em_ctxt *ctxt)
 {
@@ -469,6 +537,16 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb,
 		memcpy(reg_data, ghcb->shared_buffer, bytes);
 		break;
 
+		/* MOVS instruction */
+	case 0xa4:
+		bytes = 1;
+		/* Fallthrough */
+	case 0xa5:
+		if (!bytes)
+			bytes = insn->opnd_bytes;
+
+		ret = vc_handle_mmio_movs(ctxt, bytes);
+		break;
 		/* Two-Byte Opcodes */
 	case 0x0f:
 		ret = vc_handle_mmio_twobyte_ops(ghcb, ctxt);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 49/70] x86/sev-es: Handle MSR events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (47 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 48/70] x86/sev-es: Handle MMIO String Instructions Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 50/70] x86/sev-es: Handle DR7 read/write events Joerg Roedel
                   ` (20 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by RDMSR/WRMSR
instructions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index a040959e512d..163b8a7f98a4 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -262,6 +262,35 @@ static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
 
+static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+	struct pt_regs *regs = ctxt->regs;
+	enum es_result ret;
+	bool write;
+	u64 exit_info_1;
+
+	write = (ctxt->insn.opcode.bytes[1] == 0x30);
+
+	ghcb_set_rcx(ghcb, regs->cx);
+	if (write) {
+		ghcb_set_rax(ghcb, regs->ax);
+		ghcb_set_rdx(ghcb, regs->dx);
+		exit_info_1 = 1;
+	} else {
+		exit_info_1 = 0;
+	}
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MSR, exit_info_1, 0);
+	if (ret != ES_OK)
+		return ret;
+	else if (!write) {
+		regs->ax = ghcb->save.rax;
+		regs->dx = ghcb->save.rdx;
+	}
+
+	return ret;
+}
+
 /*
  * This function runs on the first #VC exception after the kernel
  * switched to virtual addresses.
@@ -571,6 +600,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_IOIO:
 		result = vc_handle_ioio(ghcb, ctxt);
 		break;
+	case SVM_EXIT_MSR:
+		result = vc_handle_msr(ghcb, ctxt);
+		break;
 	case SVM_EXIT_NPF:
 		result = vc_handle_mmio(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 50/70] x86/sev-es: Handle DR7 read/write events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (48 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 49/70] x86/sev-es: Handle MSR events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 51/70] x86/sev-es: Handle WBINVD Events Joerg Roedel
                   ` (19 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Add code to handle #VC exceptions on DR7 register reads and writes.
This is needed early because show_regs() reads DR7 to print it out.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapt to #VC handling framework
                   - Support early usage ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 87 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 83 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 163b8a7f98a4..7a9cdc660637 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -23,6 +23,8 @@
 #include <asm/traps.h>
 #include <asm/svm.h>
 
+#define DR7_RESET_VALUE        0x400
+
 /* For early boot hypervisor communication in SEV-ES enabled guests */
 struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
 
@@ -31,6 +33,7 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  * cleared
  */
 struct ghcb __initdata *boot_ghcb;
+static DEFINE_PER_CPU(unsigned long, cached_dr7) = DR7_RESET_VALUE;
 
 struct ghcb_state {
 	struct ghcb *ghcb;
@@ -359,6 +362,21 @@ static long *vc_insn_get_reg(struct es_em_ctxt *ctxt)
 	return reg_array + offset;
 }
 
+static long *vc_insn_get_rm(struct es_em_ctxt *ctxt)
+{
+	long *reg_array;
+	int offset;
+
+	reg_array = (long *)ctxt->regs;
+	offset    = insn_get_modrm_rm_off(&ctxt->insn, ctxt->regs);
+
+	if (offset < 0)
+		return NULL;
+
+	offset /= sizeof(long);
+
+	return reg_array + offset;
+}
 static enum es_result vc_do_mmio(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
 				 unsigned int bytes, bool read)
 {
@@ -587,13 +605,74 @@ static enum es_result vc_handle_mmio(struct ghcb *ghcb,
 	return ret;
 }
 
+static enum es_result vc_handle_dr7_write(struct ghcb *ghcb,
+					  struct es_em_ctxt *ctxt,
+					  bool early)
+{
+	long val, *reg = vc_insn_get_rm(ctxt);
+	enum es_result ret;
+
+	if (!reg)
+		return ES_DECODE_FAILED;
+
+	val = *reg;
+
+	/* Upper 32 bits must be written as zeroes */
+	if (val >> 32) {
+		ctxt->fi.vector = X86_TRAP_GP;
+		ctxt->fi.error_code = 0;
+		return ES_EXCEPTION;
+	}
+
+	/* Clear out other reservered bits and set bit 10 */
+	val = (val & 0xffff23ffL) | BIT(10);
+
+	/* Early non-zero writes to DR7 are not supported */
+	if (early && (val & ~DR7_RESET_VALUE))
+		return ES_UNSUPPORTED;
+
+	/* Using a value of 0 for ExitInfo1 means RAX holds the value */
+	ghcb_set_rax(ghcb, val);
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WRITE_DR7, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	this_cpu_write(cached_dr7, *reg);
+
+	return ES_OK;
+}
+
+static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
+					 struct es_em_ctxt *ctxt,
+					 bool early)
+{
+	long *reg = vc_insn_get_rm(ctxt);
+
+	if (!reg)
+		return ES_DECODE_FAILED;
+
+	if (early)
+		*reg = DR7_RESET_VALUE;
+	else
+		*reg = this_cpu_read(cached_dr7);
+
+	return ES_OK;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
-		struct ghcb *ghcb,
-		unsigned long exit_code)
+					 struct ghcb *ghcb,
+					 unsigned long exit_code,
+					 bool early)
 {
 	enum es_result result;
 
 	switch (exit_code) {
+	case SVM_EXIT_READ_DR7:
+		result = vc_handle_dr7_read(ghcb, ctxt, early);
+		break;
+	case SVM_EXIT_WRITE_DR7:
+		result = vc_handle_dr7_write(ghcb, ctxt, early);
+		break;
 	case SVM_EXIT_CPUID:
 		result = vc_handle_cpuid(ghcb, ctxt);
 		break;
@@ -682,7 +761,7 @@ dotraplinkage void do_vmm_communication(struct pt_regs *regs, unsigned long exit
 		result = vc_context_filter(regs, exit_code);
 
 	if (result == ES_OK)
-		result = vc_handle_exitcode(&ctxt, ghcb, exit_code);
+		result = vc_handle_exitcode(&ctxt, ghcb, exit_code, false);
 
 	sev_es_put_ghcb(&state);
 
@@ -750,7 +829,7 @@ bool __init boot_vc_exception(struct pt_regs *regs)
 	result = vc_init_em_ctxt(&ctxt, regs, exit_code);
 
 	if (result == ES_OK)
-		result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code);
+		result = vc_handle_exitcode(&ctxt, boot_ghcb, exit_code, true);
 
 	/* Done - now check the result */
 	switch (result) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 51/70] x86/sev-es: Handle WBINVD Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (49 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 50/70] x86/sev-es: Handle DR7 read/write events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 52/70] x86/sev-es: Handle RDTSC Events Joerg Roedel
                   ` (18 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by WBINVD instructions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 7a9cdc660637..5b024ee54307 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -659,6 +659,12 @@ static enum es_result vc_handle_dr7_read(struct ghcb *ghcb,
 	return ES_OK;
 }
 
+static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
+				       struct es_em_ctxt *ctxt)
+{
+	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -682,6 +688,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_MSR:
 		result = vc_handle_msr(ghcb, ctxt);
 		break;
+	case SVM_EXIT_WBINVD:
+		result = vc_handle_wbinvd(ghcb, ctxt);
+		break;
 	case SVM_EXIT_NPF:
 		result = vc_handle_mmio(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 52/70] x86/sev-es: Handle RDTSC Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (50 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 51/70] x86/sev-es: Handle WBINVD Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 53/70] x86/sev-es: Handle RDPMC Events Joerg Roedel
                   ` (17 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by RDTSC instructions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 5b024ee54307..afbe574126f3 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -665,6 +665,23 @@ static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
 }
 
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+	enum es_result ret;
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDTSC, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb)))
+		return ES_VMM_ERROR;
+
+	ctxt->regs->ax = ghcb->save.rax;
+	ctxt->regs->dx = ghcb->save.rdx;
+
+	return ES_OK;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -679,6 +696,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_WRITE_DR7:
 		result = vc_handle_dr7_write(ghcb, ctxt, early);
 		break;
+	case SVM_EXIT_RDTSC:
+		result = vc_handle_rdtsc(ghcb, ctxt);
+		break;
 	case SVM_EXIT_CPUID:
 		result = vc_handle_cpuid(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 53/70] x86/sev-es: Handle RDPMC Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (51 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 52/70] x86/sev-es: Handle RDTSC Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 54/70] x86/sev-es: Handle INVD Events Joerg Roedel
                   ` (16 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by RDPMC instructions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index afbe574126f3..ec11088497a4 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -682,6 +682,25 @@ static enum es_result vc_handle_rdtsc(struct ghcb *ghcb, struct es_em_ctxt *ctxt
 	return ES_OK;
 }
 
+static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+{
+	enum es_result ret;
+
+	ghcb_set_rcx(ghcb, ctxt->regs->cx);
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDPMC, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb)))
+		return ES_VMM_ERROR;
+
+	ctxt->regs->ax = ghcb->save.rax;
+	ctxt->regs->dx = ghcb->save.rdx;
+
+	return ES_OK;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -699,6 +718,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_RDTSC:
 		result = vc_handle_rdtsc(ghcb, ctxt);
 		break;
+	case SVM_EXIT_RDPMC:
+		result = vc_handle_rdpmc(ghcb, ctxt);
+		break;
 	case SVM_EXIT_CPUID:
 		result = vc_handle_cpuid(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 54/70] x86/sev-es: Handle INVD Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (52 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 53/70] x86/sev-es: Handle RDPMC Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 55/70] x86/sev-es: Handle RDTSCP Events Joerg Roedel
                   ` (15 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by INVD instructions.
Since Linux should never use INVD, just mark it as unsupported.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index ec11088497a4..a4f136b2e149 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -721,6 +721,10 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_RDPMC:
 		result = vc_handle_rdpmc(ghcb, ctxt);
 		break;
+	case SVM_EXIT_INVD:
+		pr_err_ratelimited("#VC exception for INVD??? Seriously???\n");
+		result = ES_UNSUPPORTED;
+		break;
 	case SVM_EXIT_CPUID:
 		result = vc_handle_cpuid(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 55/70] x86/sev-es: Handle RDTSCP Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (53 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 54/70] x86/sev-es: Handle INVD Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-04-24 21:03     ` Mike Stunes
  2020-03-19  9:13 ` [PATCH 56/70] x86/sev-es: Handle MONITOR/MONITORX Events Joerg Roedel
                   ` (14 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Extend the RDTSC handler to also handle RDTSCP events.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index a4f136b2e149..11947b648b43 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -665,19 +665,25 @@ static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
 }
 
-static enum es_result vc_handle_rdtsc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt,
+				      unsigned long exit_code)
 {
+	bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
 	enum es_result ret;
 
-	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_RDTSC, 0, 0);
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
 	if (ret != ES_OK)
 		return ret;
 
-	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb)))
+	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) &&
+	     (!rdtscp || ghcb_is_valid_rcx(ghcb))))
 		return ES_VMM_ERROR;
 
 	ctxt->regs->ax = ghcb->save.rax;
 	ctxt->regs->dx = ghcb->save.rdx;
+	if (rdtscp)
+		ctxt->regs->cx = ghcb->save.rcx;
 
 	return ES_OK;
 }
@@ -716,7 +722,8 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 		result = vc_handle_dr7_write(ghcb, ctxt, early);
 		break;
 	case SVM_EXIT_RDTSC:
-		result = vc_handle_rdtsc(ghcb, ctxt);
+	case SVM_EXIT_RDTSCP:
+		result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
 		break;
 	case SVM_EXIT_RDPMC:
 		result = vc_handle_rdpmc(ghcb, ctxt);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 56/70] x86/sev-es: Handle MONITOR/MONITORX Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (54 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 55/70] x86/sev-es: Handle RDTSCP Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 57/70] x86/sev-es: Handle MWAIT/MWAITX Events Joerg Roedel
                   ` (13 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by MONITOR and MONITORX
instructions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 11947b648b43..6bd8bc9f2e66 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -707,6 +707,22 @@ static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt
 	return ES_OK;
 }
 
+static enum es_result vc_handle_monitor(struct ghcb *ghcb,
+					struct es_em_ctxt *ctxt)
+{
+	phys_addr_t monitor_pa;
+	pgd_t *pgd;
+
+	pgd = __va(read_cr3_pa());
+	monitor_pa = vc_slow_virt_to_phys(ghcb, ctxt->regs->ax);
+
+	ghcb_set_rax(ghcb, monitor_pa);
+	ghcb_set_rcx(ghcb, ctxt->regs->cx);
+	ghcb_set_rdx(ghcb, ctxt->regs->dx);
+
+	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MONITOR, 0, 0);
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -744,6 +760,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_WBINVD:
 		result = vc_handle_wbinvd(ghcb, ctxt);
 		break;
+	case SVM_EXIT_MONITOR:
+		result = vc_handle_monitor(ghcb, ctxt);
+		break;
 	case SVM_EXIT_NPF:
 		result = vc_handle_mmio(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 57/70] x86/sev-es: Handle MWAIT/MWAITX Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (55 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 56/70] x86/sev-es: Handle MONITOR/MONITORX Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 58/70] x86/sev-es: Handle VMMCALL Events Joerg Roedel
                   ` (12 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by MWAIT and MWAITX
instructions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 6bd8bc9f2e66..0ccbcc02e6f2 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -723,6 +723,15 @@ static enum es_result vc_handle_monitor(struct ghcb *ghcb,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MONITOR, 0, 0);
 }
 
+static enum es_result vc_handle_mwait(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt)
+{
+	ghcb_set_rax(ghcb, ctxt->regs->ax);
+	ghcb_set_rcx(ghcb, ctxt->regs->cx);
+
+	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MWAIT, 0, 0);
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -763,6 +772,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_MONITOR:
 		result = vc_handle_monitor(ghcb, ctxt);
 		break;
+	case SVM_EXIT_MWAIT:
+		result = vc_handle_mwait(ghcb, ctxt);
+		break;
 	case SVM_EXIT_NPF:
 		result = vc_handle_mmio(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 58/70] x86/sev-es: Handle VMMCALL Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (56 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 57/70] x86/sev-es: Handle MWAIT/MWAITX Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 59/70] x86/sev-es: Handle #AC Events Joerg Roedel
                   ` (11 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement a handler for #VC exceptions caused by VMMCALL instructions.
This patch is only a starting point, VMMCALL emulation under SEV-ES
needs further hypervisor-specific changes to provide additional state.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0ccbcc02e6f2..efb392e6f483 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -732,6 +732,26 @@ static enum es_result vc_handle_mwait(struct ghcb *ghcb,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_MWAIT, 0, 0);
 }
 
+static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
+					struct es_em_ctxt *ctxt)
+{
+	enum es_result ret;
+
+	ghcb_set_rax(ghcb, ctxt->regs->ax);
+	ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!ghcb_is_valid_rax(ghcb))
+		return ES_VMM_ERROR;
+
+	ctxt->regs->ax = ghcb->save.rax;
+
+	return ES_OK;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -766,6 +786,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_MSR:
 		result = vc_handle_msr(ghcb, ctxt);
 		break;
+	case SVM_EXIT_VMMCALL:
+		result = vc_handle_vmmcall(ghcb, ctxt);
+		break;
 	case SVM_EXIT_WBINVD:
 		result = vc_handle_wbinvd(ghcb, ctxt);
 		break;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 59/70] x86/sev-es: Handle #AC Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (57 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 58/70] x86/sev-es: Handle VMMCALL Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 60/70] x86/sev-es: Handle #DB Events Joerg Roedel
                   ` (10 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Implement a handler for #VC exceptions caused by #AC exceptions. The #AC
exception is just forwarded to do_alignment_check() and not pushed down
to the hypervisor, as requested by the SEV-ES GHCB Standardization
Specification.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index efb392e6f483..f22b361f6b60 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -766,6 +766,10 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_WRITE_DR7:
 		result = vc_handle_dr7_write(ghcb, ctxt, early);
 		break;
+	case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
+		do_alignment_check(ctxt->regs, 0);
+		result = ES_RETRY;
+		break;
 	case SVM_EXIT_RDTSC:
 	case SVM_EXIT_RDTSCP:
 		result = vc_handle_rdtsc(ghcb, ctxt, exit_code);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 60/70] x86/sev-es: Handle #DB Events
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (58 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 59/70] x86/sev-es: Handle #AC Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 61/70] x86/paravirt: Allow hypervisor specific VMMCALL handling under SEV-ES Joerg Roedel
                   ` (9 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Handle #VC exceptions caused by #DB exceptions in the guest. Do not
forward them to the hypervisor and handle them with do_debug() instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/sev-es.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index f22b361f6b60..bc553aae31d2 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -752,6 +752,15 @@ static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
 	return ES_OK;
 }
 
+static enum es_result vc_handle_db_exception(struct ghcb *ghcb,
+					     struct es_em_ctxt *ctxt)
+{
+	do_debug(ctxt->regs, 0);
+
+	/* Exception event, do not advance RIP */
+	return ES_RETRY;
+}
+
 static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 					 struct ghcb *ghcb,
 					 unsigned long exit_code,
@@ -766,6 +775,9 @@ static enum es_result vc_handle_exitcode(struct es_em_ctxt *ctxt,
 	case SVM_EXIT_WRITE_DR7:
 		result = vc_handle_dr7_write(ghcb, ctxt, early);
 		break;
+	case SVM_EXIT_EXCP_BASE + X86_TRAP_DB:
+		result = vc_handle_db_exception(ghcb, ctxt);
+		break;
 	case SVM_EXIT_EXCP_BASE + X86_TRAP_AC:
 		do_alignment_check(ctxt->regs, 0);
 		result = ES_RETRY;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 61/70] x86/paravirt: Allow hypervisor specific VMMCALL handling under SEV-ES
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (59 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 60/70] x86/sev-es: Handle #DB Events Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-19  9:13 ` [PATCH 62/70] x86/kvm: Add KVM " Joerg Roedel
                   ` (8 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add two new paravirt callbacks to provide hypervisor specific processor
state in the GHCB and to copy state from the hypervisor back to the
processor.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/x86_init.h | 16 +++++++++++++++-
 arch/x86/kernel/sev-es.c        | 12 ++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 96d9cd208610..c4790ec279cc 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -4,8 +4,10 @@
 
 #include <asm/bootparam.h>
 
+struct ghcb;
 struct mpc_bus;
 struct mpc_cpu;
+struct pt_regs;
 struct mpc_table;
 struct cpuinfo_x86;
 
@@ -238,10 +240,22 @@ struct x86_legacy_features {
 /**
  * struct x86_hyper_runtime - x86 hypervisor specific runtime callbacks
  *
- * @pin_vcpu:		pin current vcpu to specified physical cpu (run rarely)
+ * @pin_vcpu:			pin current vcpu to specified physical
+ *				cpu (run rarely)
+ * @sev_es_hcall_prepare:	Load additional hypervisor-specific
+ *				state into the GHCB when doing a VMMCALL under
+ *				SEV-ES. Called from the #VC exception handler.
+ * @sev_es_hcall_finish:	Copies state from the GHCB back into the
+ *				processor (or pt_regs). Also runs checks on the
+ *				state returned from the hypervisor after a
+ *				VMMCALL under SEV-ES.  Needs to return 'false'
+ *				if the checks fail.  Called from the #VC
+ *				exception handler.
  */
 struct x86_hyper_runtime {
 	void (*pin_vcpu)(int cpu);
+	void (*sev_es_hcall_prepare)(struct ghcb *ghcb, struct pt_regs *regs);
+	bool (*sev_es_hcall_finish)(struct ghcb *ghcb, struct pt_regs *regs);
 };
 
 /**
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index bc553aae31d2..635e7fc90d01 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -740,6 +740,9 @@ static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
 	ghcb_set_rax(ghcb, ctxt->regs->ax);
 	ghcb_set_cpl(ghcb, user_mode(ctxt->regs) ? 3 : 0);
 
+	if (x86_platform.hyper.sev_es_hcall_prepare)
+		x86_platform.hyper.sev_es_hcall_prepare(ghcb, ctxt->regs);
+
 	ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_VMMCALL, 0, 0);
 	if (ret != ES_OK)
 		return ret;
@@ -749,6 +752,15 @@ static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
 
 	ctxt->regs->ax = ghcb->save.rax;
 
+	/*
+	 * Call sev_es_hcall_finish() after regs->ax is already set.
+	 * This allows the hypervisor handler to overwrite it again if
+	 * necessary.
+	 */
+	if (x86_platform.hyper.sev_es_hcall_finish &&
+	    !x86_platform.hyper.sev_es_hcall_finish(ghcb, ctxt->regs))
+		return ES_VMM_ERROR;
+
 	return ES_OK;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 62/70] x86/kvm: Add KVM specific VMMCALL handling under SEV-ES
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (60 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 61/70] x86/paravirt: Allow hypervisor specific VMMCALL handling under SEV-ES Joerg Roedel
@ 2020-03-19  9:13 ` Joerg Roedel
  2020-03-20 21:23   ` David Rientjes
  2020-03-19  9:14 ` [PATCH 63/70] x86/vmware: Add VMware specific handling for VMMCALL " Joerg Roedel
                   ` (7 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:13 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Implement the callbacks to copy the processor state required by KVM to
the GHCB.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Split out of a larger patch
                   - Adapt to different callback functions ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/kvm.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6efe0410fb72..0e3fc798d719 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -34,6 +34,8 @@
 #include <asm/hypervisor.h>
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
+#include <asm/ptrace.h>
+#include <asm/svm.h>
 
 static int kvmapf = 1;
 
@@ -729,13 +731,34 @@ static void __init kvm_init_platform(void)
 	x86_platform.apic_post_init = kvm_apic_init;
 }
 
+#if defined(CONFIG_AMD_MEM_ENCRYPT)
+static void kvm_sev_es_hcall_prepare(struct ghcb *ghcb, struct pt_regs *regs)
+{
+	/* RAX and CPL are already in the GHCB */
+	ghcb_set_rbx(ghcb, regs->bx);
+	ghcb_set_rcx(ghcb, regs->cx);
+	ghcb_set_rdx(ghcb, regs->dx);
+	ghcb_set_rsi(ghcb, regs->si);
+}
+
+static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs)
+{
+	/* No checking of the return state needed */
+	return true;
+}
+#endif
+
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
-	.name			= "KVM",
-	.detect			= kvm_detect,
-	.type			= X86_HYPER_KVM,
-	.init.guest_late_init	= kvm_guest_init,
-	.init.x2apic_available	= kvm_para_available,
-	.init.init_platform	= kvm_init_platform,
+	.name				= "KVM",
+	.detect				= kvm_detect,
+	.type				= X86_HYPER_KVM,
+	.init.guest_late_init		= kvm_guest_init,
+	.init.x2apic_available		= kvm_para_available,
+	.init.init_platform		= kvm_init_platform,
+#if defined(CONFIG_AMD_MEM_ENCRYPT)
+	.runtime.sev_es_hcall_prepare	= kvm_sev_es_hcall_prepare,
+	.runtime.sev_es_hcall_finish	= kvm_sev_es_hcall_finish,
+#endif
 };
 
 static __init int activate_jump_labels(void)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 63/70] x86/vmware: Add VMware specific handling for VMMCALL under SEV-ES
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (61 preceding siblings ...)
  2020-03-19  9:13 ` [PATCH 62/70] x86/kvm: Add KVM " Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19 10:18     ` Thomas Hellstrom
  2020-03-19  9:14 ` [PATCH 64/70] x86/realmode: Add SEV-ES specific trampoline entry point Joerg Roedel
                   ` (6 subsequent siblings)
  69 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel, Doug Covelli

From: Doug Covelli <dcovelli@vmware.com>

This change adds VMware specific handling for #VC faults caused by
VMMCALL instructions.

Signed-off-by: Doug Covelli <dcovelli@vmware.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapt to different paravirt interface ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/cpu/vmware.c | 50 ++++++++++++++++++++++++++++++++----
 1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 46d732696c1c..d8bc9106c4e8 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -31,6 +31,7 @@
 #include <asm/timer.h>
 #include <asm/apic.h>
 #include <asm/vmware.h>
+#include <asm/svm.h>
 
 #undef pr_fmt
 #define pr_fmt(fmt)	"vmware: " fmt
@@ -263,10 +264,49 @@ static bool __init vmware_legacy_x2apic_available(void)
 	       (eax & (1 << VMWARE_CMD_LEGACY_X2APIC)) != 0;
 }
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+static void vmware_sev_es_hcall_prepare(struct ghcb *ghcb,
+					struct pt_regs *regs)
+{
+	/* Copy VMWARE specific Hypercall parameters to the GHCB */
+	ghcb_set_rip(ghcb, regs->ip);
+	ghcb_set_rbx(ghcb, regs->bx);
+	ghcb_set_rcx(ghcb, regs->cx);
+	ghcb_set_rdx(ghcb, regs->dx);
+	ghcb_set_rsi(ghcb, regs->si);
+	ghcb_set_rdi(ghcb, regs->di);
+	ghcb_set_rbp(ghcb, regs->bp);
+}
+
+static bool vmware_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs)
+{
+	if (!(ghcb_is_valid_rbx(ghcb) &&
+	      ghcb_is_valid_rcx(ghcb) &&
+	      ghcb_is_valid_rdx(ghcb) &&
+	      ghcb_is_valid_rsi(ghcb) &&
+	      ghcb_is_valid_rdi(ghcb) &&
+	      ghcb_is_valid_rbp(ghcb)))
+		return false;
+
+	regs->bx = ghcb->save.rbx;
+	regs->cx = ghcb->save.rcx;
+	regs->dx = ghcb->save.rdx;
+	regs->si = ghcb->save.rsi;
+	regs->di = ghcb->save.rdi;
+	regs->bp = ghcb->save.rbp;
+
+	return true;
+}
+#endif
+
 const __initconst struct hypervisor_x86 x86_hyper_vmware = {
-	.name			= "VMware",
-	.detect			= vmware_platform,
-	.type			= X86_HYPER_VMWARE,
-	.init.init_platform	= vmware_platform_setup,
-	.init.x2apic_available	= vmware_legacy_x2apic_available,
+	.name				= "VMware",
+	.detect				= vmware_platform,
+	.type				= X86_HYPER_VMWARE,
+	.init.init_platform		= vmware_platform_setup,
+	.init.x2apic_available		= vmware_legacy_x2apic_available,
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	.runtime.sev_es_hcall_prepare	= vmware_sev_es_hcall_prepare,
+	.runtime.sev_es_hcall_finish	= vmware_sev_es_hcall_finish,
+#endif
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 64/70] x86/realmode: Add SEV-ES specific trampoline entry point
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (62 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 63/70] x86/vmware: Add VMware specific handling for VMMCALL " Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19  9:14 ` [PATCH 65/70] x86/realmode: Setup AP jump table Joerg Roedel
                   ` (5 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The code at the trampoline entry point is executed in real-mode. In
real-mode #VC exceptions can't be handled, so anything that might cause
such an exception must be avoided.

In the standard trampoline entry code this is the WBINVD instruction and
the call to verify_cpu(), which are both not needed anyway when running
as an SEV-ES guest.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/realmode.h      |  3 +++
 arch/x86/realmode/rm/header.S        |  3 +++
 arch/x86/realmode/rm/trampoline_64.S | 20 ++++++++++++++++++++
 3 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index b35030eeec36..6590394af309 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -21,6 +21,9 @@ struct real_mode_header {
 	/* SMP trampoline */
 	u32	trampoline_start;
 	u32	trampoline_header;
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	u32	sev_es_trampoline_start;
+#endif
 #ifdef CONFIG_X86_64
 	u32	trampoline_pgd;
 #endif
diff --git a/arch/x86/realmode/rm/header.S b/arch/x86/realmode/rm/header.S
index af04512c02d9..8c1db5bf5d78 100644
--- a/arch/x86/realmode/rm/header.S
+++ b/arch/x86/realmode/rm/header.S
@@ -20,6 +20,9 @@ SYM_DATA_START(real_mode_header)
 	/* SMP trampoline */
 	.long	pa_trampoline_start
 	.long	pa_trampoline_header
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+	.long	pa_sev_es_trampoline_start
+#endif
 #ifdef CONFIG_X86_64
 	.long	pa_trampoline_pgd;
 #endif
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S
index 251758ed7443..84c5d1b33d10 100644
--- a/arch/x86/realmode/rm/trampoline_64.S
+++ b/arch/x86/realmode/rm/trampoline_64.S
@@ -56,6 +56,7 @@ SYM_CODE_START(trampoline_start)
 	testl   %eax, %eax		# Check for return code
 	jnz	no_longmode
 
+.Lswitch_to_protected:
 	/*
 	 * GDT tables in non default location kernel can be beyond 16MB and
 	 * lgdt will not be able to load the address as in real mode default
@@ -80,6 +81,25 @@ no_longmode:
 	jmp no_longmode
 SYM_CODE_END(trampoline_start)
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+/* SEV-ES supports non-zero IP for entry points - no alignment needed */
+SYM_CODE_START(sev_es_trampoline_start)
+	cli			# We should be safe anyway
+
+	LJMPW_RM(1f)
+1:
+	mov	%cs, %ax	# Code and data in the same place
+	mov	%ax, %ds
+	mov	%ax, %es
+	mov	%ax, %ss
+
+	# Setup stack
+	movl	$rm_stack_end, %esp
+
+	jmp	.Lswitch_to_protected
+SYM_CODE_END(sev_es_trampoline_start)
+#endif	/* CONFIG_AMD_MEM_ENCRYPT */
+
 #include "../kernel/verify_cpu.S"
 
 	.section ".text32","ax"
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 65/70] x86/realmode: Setup AP jump table
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (63 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 64/70] x86/realmode: Add SEV-ES specific trampoline entry point Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19  9:14 ` [PATCH 66/70] x86/head/64: Don't call verify_cpu() on starting APs Joerg Roedel
                   ` (4 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Tom Lendacky <thomas.lendacky@amd.com>

Setup the AP jump table to point to the SEV-ES trampoline code so that
the APs can boot.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapted to different code base
                   - Moved AP table setup from SIPI sending path to
		     real-mode setup code ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/sev-es.h   | 11 ++++++
 arch/x86/include/uapi/asm/svm.h |  3 ++
 arch/x86/kernel/sev-es.c        | 66 +++++++++++++++++++++++++++++++++
 arch/x86/realmode/init.c        |  6 +++
 4 files changed, 86 insertions(+)

diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index 122b3e71a788..63acf50e6280 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -78,4 +78,15 @@ static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
 extern void early_vc_handler(void);
 extern bool boot_vc_exception(struct pt_regs *regs);
 
+struct real_mode_header;
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
+#else /* CONFIG_AMD_MEM_ENCRYPT */
+static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
+{
+	return 0;
+}
+#endif /* CONFIG_AMD_MEM_ENCRYPT*/
+
 #endif
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 8f36ae021a7f..a19ce9681ec2 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -84,6 +84,9 @@
 /* SEV-ES software-defined VMGEXIT events */
 #define SVM_VMGEXIT_MMIO_READ			0x80000001
 #define SVM_VMGEXIT_MMIO_WRITE			0x80000002
+#define SVM_VMGEXIT_AP_JUMP_TABLE		0x80000005
+#define		SVM_VMGEXIT_SET_AP_JUMP_TABLE			0
+#define		SVM_VMGEXIT_GET_AP_JUMP_TABLE			1
 #define SVM_VMGEXIT_UNSUPPORTED_EVENT		0x8000ffff
 
 #define SVM_EXIT_ERR           -1
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 635e7fc90d01..f56bdaf12fbe 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -16,6 +16,7 @@
 #include <linux/mm.h>
 
 #include <asm/trap_defs.h>
+#include <asm/realmode.h>
 #include <asm/sev-es.h>
 #include <asm/insn-eval.h>
 #include <asm/fpu/internal.h>
@@ -89,6 +90,8 @@ static void sev_es_put_ghcb(struct ghcb_state *state)
 /* Needed in vc_early_vc_forward_exception */
 extern void early_exception(struct pt_regs *regs, int trapnr);
 
+static inline u64 sev_es_rd_ghcb_msr(void);
+
 static inline u64 sev_es_rd_ghcb_msr(void)
 {
 	return native_read_msr(MSR_AMD64_SEV_ES_GHCB);
@@ -265,6 +268,69 @@ static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
 
+static u64 sev_es_get_jump_table_addr(void)
+{
+	struct ghcb_state state;
+	unsigned long flags;
+	struct ghcb *ghcb;
+	u64 ret;
+
+	local_irq_save(flags);
+
+	ghcb = sev_es_get_ghcb(&state);
+
+	vc_ghcb_invalidate(ghcb);
+	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_JUMP_TABLE);
+	ghcb_set_sw_exit_info_1(ghcb, SVM_VMGEXIT_GET_AP_JUMP_TABLE);
+	ghcb_set_sw_exit_info_2(ghcb, 0);
+
+	sev_es_wr_ghcb_msr(__pa(ghcb));
+	VMGEXIT();
+
+	if (!ghcb_is_valid_sw_exit_info_1(ghcb) ||
+	    !ghcb_is_valid_sw_exit_info_2(ghcb))
+		ret = 0;
+
+	ret = ghcb->save.sw_exit_info_2;
+
+	sev_es_put_ghcb(&state);
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
+{
+	u16 startup_cs, startup_ip;
+	phys_addr_t jump_table_pa;
+	u64 jump_table_addr;
+	u16 *jump_table;
+
+	jump_table_addr = sev_es_get_jump_table_addr();
+
+	/* Check if AP Jump Table is non-zero and page-aligned */
+	if (!jump_table_addr || jump_table_addr & ~PAGE_MASK)
+		return 0;
+
+	jump_table_pa = jump_table_addr & PAGE_MASK;
+
+	startup_cs = (u16)(rmh->trampoline_start >> 4);
+	startup_ip = (u16)(rmh->sev_es_trampoline_start -
+			   rmh->trampoline_start);
+
+	jump_table = ioremap_encrypted(jump_table_pa, PAGE_SIZE);
+	if (!jump_table)
+		return -EIO;
+
+	jump_table[0] = startup_ip;
+	jump_table[1] = startup_cs;
+
+	iounmap(jump_table);
+
+	return 0;
+}
+
 static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	struct pt_regs *regs = ctxt->regs;
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 262f83cad355..1c5cbfd102d5 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -9,6 +9,7 @@
 #include <asm/realmode.h>
 #include <asm/tlbflush.h>
 #include <asm/crash.h>
+#include <asm/sev-es.h>
 
 struct real_mode_header *real_mode_header;
 u32 *trampoline_cr4_features;
@@ -107,6 +108,11 @@ static void __init setup_real_mode(void)
 	if (sme_active())
 		trampoline_header->flags |= TH_FLAGS_SME_ACTIVE;
 
+	if (sev_es_active()) {
+		if (sev_es_setup_ap_jump_table(real_mode_header))
+			panic("Failed to update SEV-ES AP Jump Table");
+	}
+
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
 	trampoline_pgd[511] = init_top_pgt[511].pgd;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 66/70] x86/head/64: Don't call verify_cpu() on starting APs
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (64 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 65/70] x86/realmode: Setup AP jump table Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19  9:14 ` [PATCH 67/70] x86/head/64: Rename start_cpu0 Joerg Roedel
                   ` (3 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The APs are not ready to handle exceptions when verify_cpu() is called
in secondary_startup_64.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/realmode.h | 1 +
 arch/x86/kernel/head_64.S       | 1 +
 arch/x86/realmode/init.c        | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/realmode.h b/arch/x86/include/asm/realmode.h
index 6590394af309..5c97807c38a4 100644
--- a/arch/x86/include/asm/realmode.h
+++ b/arch/x86/include/asm/realmode.h
@@ -69,6 +69,7 @@ extern unsigned char startup_32_smp[];
 extern unsigned char boot_gdt[];
 #else
 extern unsigned char secondary_startup_64[];
+extern unsigned char secondary_startup_64_no_verify[];
 #endif
 
 static inline size_t real_mode_size_needed(void)
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b3acecdabd34..c935d6d07393 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -150,6 +150,7 @@ SYM_CODE_START(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
+SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
 	/*
 	 * Retrieve the modifier (SME encryption mask if SME is active) to be
 	 * added to the initial pgdir entry that will be programmed into CR3.
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 1c5cbfd102d5..030c38268069 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -109,6 +109,12 @@ static void __init setup_real_mode(void)
 		trampoline_header->flags |= TH_FLAGS_SME_ACTIVE;
 
 	if (sev_es_active()) {
+		/*
+		 * Skip the call to verify_cpu() in secondary_startup_64 as it
+		 * will cause #VC exceptions when the AP can't handle them yet.
+		 */
+		trampoline_header->start = (u64) secondary_startup_64_no_verify;
+
 		if (sev_es_setup_ap_jump_table(real_mode_header))
 			panic("Failed to update SEV-ES AP Jump Table");
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 67/70] x86/head/64: Rename start_cpu0
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (65 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 66/70] x86/head/64: Don't call verify_cpu() on starting APs Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19  9:14 ` [PATCH 68/70] x86/sev-es: Support CPU offline/online Joerg Roedel
                   ` (2 subsequent siblings)
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

For SEV-ES this entry point will be used for restarting APs after they
have been offlined. Remove the '0' from the name to reflect that.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/cpu.h | 2 +-
 arch/x86/kernel/head_32.S  | 4 ++--
 arch/x86/kernel/head_64.S  | 6 +++---
 arch/x86/kernel/smpboot.c  | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index adc6cc86b062..00668daf8991 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -29,7 +29,7 @@ struct x86_cpu {
 #ifdef CONFIG_HOTPLUG_CPU
 extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
-extern void start_cpu0(void);
+extern void start_cpu(void);
 #ifdef CONFIG_DEBUG_HOTPLUG_CPU0
 extern int _debug_hotplug_cpu(int cpu, int action);
 #endif
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 3923ab4630d7..1a280152bd10 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -180,12 +180,12 @@ SYM_CODE_END(startup_32)
  * up already except stack. We just set up stack here. Then call
  * start_secondary().
  */
-SYM_FUNC_START(start_cpu0)
+SYM_FUNC_START(start_cpu)
 	movl initial_stack, %ecx
 	movl %ecx, %esp
 	call *(initial_code)
 1:	jmp 1b
-SYM_FUNC_END(start_cpu0)
+SYM_FUNC_END(start_cpu)
 #endif
 
 /*
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index c935d6d07393..f2e793213fa7 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -299,15 +299,15 @@ SYM_CODE_END(secondary_startup_64)
 
 #ifdef CONFIG_HOTPLUG_CPU
 /*
- * Boot CPU0 entry point. It's called from play_dead(). Everything has been set
+ * CPU entry point. It's called from play_dead(). Everything has been set
  * up already except stack. We just set up stack here. Then call
  * start_secondary() via .Ljump_to_C_code.
  */
-SYM_CODE_START(start_cpu0)
+SYM_CODE_START(start_cpu)
 	UNWIND_HINT_EMPTY
 	movq	initial_stack(%rip), %rsp
 	jmp	.Ljump_to_C_code
-SYM_CODE_END(start_cpu0)
+SYM_CODE_END(start_cpu)
 #endif
 
 	/* Both SMP bootup and ACPI suspend change these variables */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 69881b2d446c..19aa18f1e307 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1717,7 +1717,7 @@ static inline void mwait_play_dead(void)
 		 * If NMI wants to wake up CPU0, start CPU0.
 		 */
 		if (wakeup_cpu0())
-			start_cpu0();
+			start_cpu();
 	}
 }
 
@@ -1732,7 +1732,7 @@ void hlt_play_dead(void)
 		 * If NMI wants to wake up CPU0, start CPU0.
 		 */
 		if (wakeup_cpu0())
-			start_cpu0();
+			start_cpu();
 	}
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 68/70] x86/sev-es: Support CPU offline/online
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (66 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 67/70] x86/head/64: Rename start_cpu0 Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19  9:14 ` [PATCH 69/70] x86/cpufeature: Add SEV_ES_GUEST CPU Feature Joerg Roedel
  2020-03-19  9:14 ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Joerg Roedel
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Add a play_dead handler when running under SEV-ES. This is needed
because the hypervisor can't deliver an SIPI request to restart the AP.
Instead the kernel has to issue a VMGEXIT to halt the VCPU. When the
hypervisor would deliver and SIPI is wakes up the VCPU instead.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/uapi/asm/svm.h |  1 +
 arch/x86/kernel/sev-es.c        | 58 +++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index a19ce9681ec2..20a05839dd9a 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -84,6 +84,7 @@
 /* SEV-ES software-defined VMGEXIT events */
 #define SVM_VMGEXIT_MMIO_READ			0x80000001
 #define SVM_VMGEXIT_MMIO_WRITE			0x80000002
+#define SVM_VMGEXIT_AP_HLT_LOOP			0x80000004
 #define SVM_VMGEXIT_AP_JUMP_TABLE		0x80000005
 #define		SVM_VMGEXIT_SET_AP_JUMP_TABLE			0
 #define		SVM_VMGEXIT_GET_AP_JUMP_TABLE			1
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index f56bdaf12fbe..3c22f256645e 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -23,6 +23,8 @@
 #include <asm/processor.h>
 #include <asm/traps.h>
 #include <asm/svm.h>
+#include <asm/smp.h>
+#include <asm/cpu.h>
 
 #define DR7_RESET_VALUE        0x400
 
@@ -381,6 +383,60 @@ static bool __init sev_es_setup_ghcb(void)
 	return true;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static void sev_es_ap_hlt_loop(void)
+{
+	struct ghcb_state state;
+	struct ghcb *ghcb;
+
+	ghcb = sev_es_get_ghcb(&state);
+
+	while (true) {
+		vc_ghcb_invalidate(ghcb);
+		ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_HLT_LOOP);
+		ghcb_set_sw_exit_info_1(ghcb, 0);
+		ghcb_set_sw_exit_info_2(ghcb, 0);
+
+		sev_es_wr_ghcb_msr(__pa(ghcb));
+		VMGEXIT();
+
+		/* Wakup Signal? */
+		if (ghcb_is_valid_sw_exit_info_2(ghcb) &&
+		    ghcb->save.sw_exit_info_2 != 0)
+			break;
+	}
+
+	sev_es_put_ghcb(&state);
+}
+
+void sev_es_play_dead(void)
+{
+	play_dead_common();
+
+	/* IRQs now disabled */
+
+	sev_es_ap_hlt_loop();
+
+	/*
+	 * If we get here, the VCPU was woken up again. Jump to CPU
+	 * startup code to get it back online.
+	 */
+
+	start_cpu();
+}
+#else  /* CONFIG_HOTPLUG_CPU */
+#define sev_es_play_dead	native_play_dead
+#endif /* CONFIG_HOTPLUG_CPU */
+
+#ifdef CONFIG_SMP
+static void sev_es_setup_play_dead(void)
+{
+	smp_ops.play_dead = sev_es_play_dead;
+}
+#else
+static inline void sev_es_setup_play_dead(void) { }
+#endif
+
 void sev_es_init_ghcbs(void)
 {
 	int cpu;
@@ -399,6 +455,8 @@ void sev_es_init_ghcbs(void)
 				     sizeof(*ghcb) >> PAGE_SHIFT);
 		memset(ghcb, 0, sizeof(*ghcb));
 	}
+
+	sev_es_setup_play_dead();
 }
 
 static void __init vc_early_vc_forward_exception(struct es_em_ctxt *ctxt)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 69/70] x86/cpufeature: Add SEV_ES_GUEST CPU Feature
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (67 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 68/70] x86/sev-es: Support CPU offline/online Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19  9:14 ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Joerg Roedel
  69 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

The feature bit will indicate whether the kernel runs as an SEV-ES
guest. This can be used to apply alternatives at boot for SEV-ES guests
and provides a way for user-space to detect whether it runs as an SEV-ES
guest.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/amd.c          | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2fee1a2cac2f..35df826ee3fc 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -235,6 +235,7 @@
 #define X86_FEATURE_VMCALL		( 8*32+18) /* "" Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* "" VMware prefers VMMCALL hypercall instruction */
 #define X86_FEATURE_SEV_ES		( 8*32+20) /* AMD Secure Encrypted Virtualization - Encrypted State */
+#define X86_FEATURE_SEV_ES_GUEST	( 8*32+21) /* SEV-ES Guest */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE		( 9*32+ 0) /* RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 523a6a76c6c1..8cdb190822de 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -485,7 +485,6 @@ static void early_init_amd_mc(struct cpuinfo_x86 *c)
 
 static void bsp_init_amd(struct cpuinfo_x86 *c)
 {
-
 #ifdef CONFIG_X86_64
 	if (c->x86 >= 0xf) {
 		unsigned long long tseg;
@@ -614,6 +613,11 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 		setup_clear_cpu_cap(X86_FEATURE_SEV);
 		setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
 	}
+
+	if (!rdmsrl_safe(MSR_AMD64_SEV, &msr)) {
+		if (msr & MSR_AMD64_SEV_ES_ENABLED)
+			set_cpu_cap(c, X86_FEATURE_SEV_ES_GUEST);
+	}
 }
 
 static void early_init_amd(struct cpuinfo_x86 *c)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19  9:12 [RFC PATCH 00/70 v2] x86: SEV-ES Guest Support Joerg Roedel
                   ` (68 preceding siblings ...)
  2020-03-19  9:14 ` [PATCH 69/70] x86/cpufeature: Add SEV_ES_GUEST CPU Feature Joerg Roedel
@ 2020-03-19  9:14 ` Joerg Roedel
  2020-03-19 15:35   ` Andy Lutomirski
  2020-03-19 16:53   ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Mika Penttilä
  69 siblings, 2 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19  9:14 UTC (permalink / raw)
  To: x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Joerg Roedel

From: Joerg Roedel <jroedel@suse.de>

Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
vCPU when it reaches IRET.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_64.S       | 48 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/sev-es.h   | 27 +++++++++++++++++++
 arch/x86/include/uapi/asm/svm.h |  1 +
 arch/x86/kernel/nmi.c           |  8 ++++++
 arch/x86/kernel/sev-es.c        | 31 ++++++++++++++++++++-
 5 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 729876d368c5..355470b36896 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -38,6 +38,7 @@
 #include <asm/export.h>
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
+#include <asm/sev-es.h>
 #include <linux/err.h>
 
 #include "calling.h"
@@ -629,6 +630,13 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 	ud2
 1:
 #endif
+
+	/*
+	 * This code path is used by the NMI handler, so check if NMIs
+	 * need to be re-enabled when running as an SEV-ES guest.
+	 */
+	SEV_ES_IRET_CHECK
+
 	POP_REGS pop_rdi=0
 
 	/*
@@ -1474,6 +1482,8 @@ SYM_CODE_START(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	SEV_ES_NMI_COMPLETE
+
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
 	 * work, because we don't want to enable interrupts.
@@ -1599,6 +1609,7 @@ nested_nmi_out:
 	popq	%rdx
 
 	/* We are returning to kernel mode, so this cannot result in a fault. */
+	SEV_ES_NMI_COMPLETE
 	iretq
 
 first_nmi:
@@ -1687,6 +1698,12 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	/*
+	 * When running as an SEV-ES guest, jump to the SEV-ES NMI IRET
+	 * path.
+	 */
+	SEV_ES_NMI_COMPLETE
+
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
@@ -1715,6 +1732,9 @@ nmi_restore:
 	std
 	movq	$0, 5*8(%rsp)		/* clear "NMI executing" */
 
+nmi_return:
+	UNWIND_HINT_IRET_REGS
+
 	/*
 	 * iretq reads the "iret" frame and exits the NMI stack in a
 	 * single instruction.  We are returning to kernel mode, so this
@@ -1724,6 +1744,34 @@ nmi_restore:
 	iretq
 SYM_CODE_END(nmi)
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+SYM_CODE_START(sev_es_iret_user)
+	UNWIND_HINT_IRET_REGS offset=8
+	/*
+	 * The kernel jumps here directly from
+	 * swapgs_restore_regs_and_return_to_usermode. %rsp points already to
+	 * trampoline stack, but %cr3 is still from kernel. User-regs are live
+	 * except %rdi. Switch to user CR3, restore user %rdi and user gs_base
+	 * and single-step over IRET
+	 */
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
+	popq	%rdi
+	SWAPGS
+	/*
+	 * Enable single-stepping and execute IRET. When IRET is
+	 * finished the resulting #DB exception will cause a #VC
+	 * exception to be raised. The #VC exception handler will send a
+	 * NMI-complete message to the hypervisor to re-open the NMI
+	 * window.
+	 */
+sev_es_iret_kernel:
+	pushf
+	btsq $X86_EFLAGS_TF_BIT, (%rsp)
+	popf
+	iretq
+SYM_CODE_END(sev_es_iret_user)
+#endif
+
 #ifndef CONFIG_IA32_EMULATION
 /*
  * This handles SYSCALL from 32-bit code.  There is no way to program
diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index 63acf50e6280..d866adb3e6d4 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -8,6 +8,8 @@
 #ifndef __ASM_ENCRYPTED_STATE_H
 #define __ASM_ENCRYPTED_STATE_H
 
+#ifndef __ASSEMBLY__
+
 #include <linux/types.h>
 #include <asm/insn.h>
 
@@ -82,11 +84,36 @@ struct real_mode_header;
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
+void sev_es_nmi_enter(void);
 #else /* CONFIG_AMD_MEM_ENCRYPT */
 static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
 {
 	return 0;
 }
+static inline void sev_es_nmi_enter(void) { }
+#endif /* CONFIG_AMD_MEM_ENCRYPT*/
+
+#else /* !__ASSEMBLY__ */
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+#define SEV_ES_NMI_COMPLETE		\
+	ALTERNATIVE	"", "callq sev_es_nmi_complete", X86_FEATURE_SEV_ES_GUEST
+
+.macro	SEV_ES_IRET_CHECK
+	ALTERNATIVE	"jmp	.Lend_\@", "", X86_FEATURE_SEV_ES_GUEST
+	movq	PER_CPU_VAR(sev_es_in_nmi), %rdi
+	testq	%rdi, %rdi
+	jz	.Lend_\@
+	callq	sev_es_nmi_complete
+.Lend_\@:
+.endm
+
+#else  /* CONFIG_AMD_MEM_ENCRYPT */
+#define	SEV_ES_NMI_COMPLETE
+.macro	SEV_ES_IRET_CHECK
+.endm
 #endif /* CONFIG_AMD_MEM_ENCRYPT*/
 
+#endif /* __ASSEMBLY__ */
+
 #endif
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 20a05839dd9a..0f837339db66 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -84,6 +84,7 @@
 /* SEV-ES software-defined VMGEXIT events */
 #define SVM_VMGEXIT_MMIO_READ			0x80000001
 #define SVM_VMGEXIT_MMIO_WRITE			0x80000002
+#define SVM_VMGEXIT_NMI_COMPLETE		0x80000003
 #define SVM_VMGEXIT_AP_HLT_LOOP			0x80000004
 #define SVM_VMGEXIT_AP_JUMP_TABLE		0x80000005
 #define		SVM_VMGEXIT_SET_AP_JUMP_TABLE			0
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 54c21d6abd5a..7312a6d4d50f 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -37,6 +37,7 @@
 #include <asm/reboot.h>
 #include <asm/cache.h>
 #include <asm/nospec-branch.h>
+#include <asm/sev-es.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -510,6 +511,13 @@ NOKPROBE_SYMBOL(is_debug_stack);
 dotraplinkage notrace void
 do_nmi(struct pt_regs *regs, long error_code)
 {
+	/*
+	 * For SEV-ES the kernel needs to track whether NMIs are blocked until
+	 * IRET is reached, even when the CPU is offline.
+	 */
+	if (sev_es_active())
+		sev_es_nmi_enter();
+
 	if (IS_ENABLED(CONFIG_SMP) && cpu_is_offline(smp_processor_id()))
 		return;
 
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 3c22f256645e..409a7a2aa630 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -37,6 +37,7 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
  */
 struct ghcb __initdata *boot_ghcb;
 static DEFINE_PER_CPU(unsigned long, cached_dr7) = DR7_RESET_VALUE;
+DEFINE_PER_CPU(bool, sev_es_in_nmi) = false;
 
 struct ghcb_state {
 	struct ghcb *ghcb;
@@ -270,6 +271,31 @@ static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
 
+void sev_es_nmi_enter(void)
+{
+	this_cpu_write(sev_es_in_nmi, true);
+}
+
+void sev_es_nmi_complete(void)
+{
+	struct ghcb_state state;
+	struct ghcb *ghcb;
+
+	ghcb = sev_es_get_ghcb(&state);
+
+	vc_ghcb_invalidate(ghcb);
+	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_NMI_COMPLETE);
+	ghcb_set_sw_exit_info_1(ghcb, 0);
+	ghcb_set_sw_exit_info_2(ghcb, 0);
+
+	sev_es_wr_ghcb_msr(__pa(ghcb));
+	VMGEXIT();
+
+	sev_es_put_ghcb(&state);
+
+	this_cpu_write(sev_es_in_nmi, false);
+}
+
 static u64 sev_es_get_jump_table_addr(void)
 {
 	struct ghcb_state state;
@@ -891,7 +917,10 @@ static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
 static enum es_result vc_handle_db_exception(struct ghcb *ghcb,
 					     struct es_em_ctxt *ctxt)
 {
-	do_debug(ctxt->regs, 0);
+	if (this_cpu_read(sev_es_in_nmi))
+		sev_es_nmi_complete();
+	else
+		do_debug(ctxt->regs, 0);
 
 	/* Exception event, do not advance RIP */
 	return ES_RETRY;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 63/70] x86/vmware: Add VMware specific handling for VMMCALL under SEV-ES
  2020-03-19  9:14 ` [PATCH 63/70] x86/vmware: Add VMware specific handling for VMMCALL " Joerg Roedel
@ 2020-03-19 10:18     ` Thomas Hellstrom
  0 siblings, 0 replies; 243+ messages in thread
From: Thomas Hellstrom @ 2020-03-19 10:18 UTC (permalink / raw)
  To: joro, x86
  Cc: kvm, linux-kernel, peterz, keescook, virtualization, dave.hansen,
	jgross, Doug Covelli, dan.j.williams, hpa, luto, jroedel, jslaby,
	thomas.lendacky

On Thu, 2020-03-19 at 10:14 +0100, Joerg Roedel wrote:
> From: Doug Covelli <dcovelli@vmware.com>
> 
> This change adds VMware specific handling for #VC faults caused by
> VMMCALL instructions.
> 
> Signed-off-by: Doug Covelli <dcovelli@vmware.com>
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> [ jroedel@suse.de: - Adapt to different paravirt interface ]
> Co-developed-by: Joerg Roedel <jroedel@suse.de>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/cpu/vmware.c | 50 ++++++++++++++++++++++++++++++++
> ----
>  1 file changed, 45 insertions(+), 5 deletions(-)
> 

Acked-by: Thomas Hellstrom <thellstrom@vmware.com>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 63/70] x86/vmware: Add VMware specific handling for VMMCALL under SEV-ES
@ 2020-03-19 10:18     ` Thomas Hellstrom
  0 siblings, 0 replies; 243+ messages in thread
From: Thomas Hellstrom @ 2020-03-19 10:18 UTC (permalink / raw)
  To: joro, x86
  Cc: kvm, linux-kernel, peterz, keescook, virtualization, dave.hansen,
	jgross, Doug Covelli, dan.j.williams, hpa, luto, jroedel, jslaby,
	thomas.lendacky

On Thu, 2020-03-19 at 10:14 +0100, Joerg Roedel wrote:
> From: Doug Covelli <dcovelli@vmware.com>
> 
> This change adds VMware specific handling for #VC faults caused by
> VMMCALL instructions.
> 
> Signed-off-by: Doug Covelli <dcovelli@vmware.com>
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> [ jroedel@suse.de: - Adapt to different paravirt interface ]
> Co-developed-by: Joerg Roedel <jroedel@suse.de>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/cpu/vmware.c | 50 ++++++++++++++++++++++++++++++++
> ----
>  1 file changed, 45 insertions(+), 5 deletions(-)
> 

Acked-by: Thomas Hellstrom <thellstrom@vmware.com>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19  9:14 ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Joerg Roedel
@ 2020-03-19 15:35   ` Andy Lutomirski
  2020-03-19 16:07     ` Joerg Roedel
  2020-03-20 13:17     ` [RFC PATCH v2.1] x86/sev-es: Handle NMI State Joerg Roedel
  2020-03-19 16:53   ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Mika Penttilä
  1 sibling, 2 replies; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 15:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: X86 ML, H. Peter Anvin, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization, Joerg Roedel

On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> vCPU when it reaches IRET.

IIRC I suggested just re-enabling NMI in C from do_nmi().  What was
wrong with that approach?

> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +SYM_CODE_START(sev_es_iret_user)
> +       UNWIND_HINT_IRET_REGS offset=8
> +       /*
> +        * The kernel jumps here directly from
> +        * swapgs_restore_regs_and_return_to_usermode. %rsp points already to
> +        * trampoline stack, but %cr3 is still from kernel. User-regs are live
> +        * except %rdi. Switch to user CR3, restore user %rdi and user gs_base
> +        * and single-step over IRET
> +        */
> +       SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
> +       popq    %rdi
> +       SWAPGS
> +       /*
> +        * Enable single-stepping and execute IRET. When IRET is
> +        * finished the resulting #DB exception will cause a #VC
> +        * exception to be raised. The #VC exception handler will send a
> +        * NMI-complete message to the hypervisor to re-open the NMI
> +        * window.

This is distressing to say the least.  The sequence if events is, roughly:

1. We're here with NMI masking in an unknown state because do_nmi()
and any nested faults could have done IRET, at least architecturally.
NMI could occur or it could not.  I suppose that, on SEV-ES, as least
on current CPUs, NMI is definitely masked.  What about on newer CPUs?
What if we migrate?

> +        */
> +sev_es_iret_kernel:
> +       pushf
> +       btsq $X86_EFLAGS_TF_BIT, (%rsp)
> +       popf

Now we have TF on, NMIs (architecturally) in unknown state.

> +       iretq

This causes us to pop the NMI frame off the stack.  Assuming the NMI
restart logic is invoked (which is maybe impossible?), we get #DB,
which presumably is actually delivered.  And we end up on the #DB
stack, which might already have been in use, so we have a potential
increase in nesting.  Also, #DB may be called from an unexpected
context.

Now somehow #DB is supposed to invoke #VC, which is supposed to do the
magic hypercall, and all of this is supposed to be safe?  Or is #DB
unconditionally redirected to #VC?  What happens if we had no stack
(e.g. we interrupted SYSCALL) or we were already in #VC to begin with?

I think there are two credible ways to approach this:

1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
how to handle running do_nmi() with NMIs unmasked.  This is much, much
simpler than your code.

2. Have an entirely separate NMI path for the
SEV-ES-on-misdesigned-CPU case.  And have very clear documentation for
what prevents this code from being executed on future CPUs (Zen3?)
that have this issue fixed for real?

This hybrid code is no good.

--Andy

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-19 15:44   ` Andy Lutomirski
  2020-03-19 16:24     ` Joerg Roedel
  -1 siblings, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 15:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: X86 ML, H. Peter Anvin, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization, Joerg Roedel

On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Tom Lendacky <thomas.lendacky@amd.com>
>
> Add the handler for #VC exceptions invoked at runtime.

If I read this correctly, this does not use IST.  If that's true, I
don't see how this can possibly work.  There at least two nasty cases
that come to mind:

1. SYSCALL followed by NMI.  The NMI IRET hack gets to #VC and we
explode.  This is fixable by getting rid of the NMI EFLAGS.TF hack.

2. tools/testing/selftests/x86/mov_ss_trap_64.  User code does MOV
(addr), SS; SYSCALL, where addr has a data breakpoint.  We get #DB
promoted to #VC with no stack.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 42/70] x86/sev-es: Support nested #VC exceptions
  2020-03-19  9:13 ` [PATCH 42/70] x86/sev-es: Support nested #VC exceptions Joerg Roedel
@ 2020-03-19 15:46     ` Andy Lutomirski
  0 siblings, 0 replies; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 15:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: X86 ML, H. Peter Anvin, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization, Joerg Roedel

On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> Handle #VC exceptions that happen while the GHCB is in use. This can
> happen when an NMI happens in the #VC exception handler and the NMI
> handler causes a #VC exception itself. Save the contents of the GHCB
> when nesting is detected and restore it when the GHCB is no longer
> used.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 63 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 97241d2f0f70..3b7bbc8d841e 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -32,9 +32,57 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>   */
>  struct ghcb __initdata *boot_ghcb;
>
> +struct ghcb_state {
> +       struct ghcb *ghcb;
> +};
> +
>  /* Runtime GHCB pointers */
>  static struct ghcb __percpu *ghcb_page;
>
> +/*
> + * Mark the per-cpu GHCB as in-use to detect nested #VC exceptions.
> + * There is no need for it to be atomic, because nothing is written to the GHCB
> + * between the read and the write of ghcb_active. So it is safe to use it when a
> + * nested #VC exception happens before the write.
> + */
> +static DEFINE_PER_CPU(bool, ghcb_active);
> +
> +static struct ghcb *sev_es_get_ghcb(struct ghcb_state *state)
> +{
> +       struct ghcb *ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
> +       bool *active = this_cpu_ptr(&ghcb_active);
> +
> +       if (unlikely(*active)) {
> +               /* GHCB is already in use - save its contents */
> +
> +               state->ghcb = kzalloc(sizeof(struct ghcb), GFP_ATOMIC);
> +               if (!state->ghcb)
> +                       return NULL;

This can't possibly end well.  Maybe have a little percpu list of
GHCBs and make sure there are enough for any possible nesting?

Also, I admit confusion.  Isn't the GHCB required to be unencrypted?
How does that work with kzalloc()?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 42/70] x86/sev-es: Support nested #VC exceptions
@ 2020-03-19 15:46     ` Andy Lutomirski
  0 siblings, 0 replies; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 15:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Dave Hansen,
	Kees Cook, kvm list, Peter Zijlstra, X86 ML, LKML,
	Linux Virtualization, Joerg Roedel, Andy Lutomirski,
	H. Peter Anvin, Dan Williams, Jiri Slaby

On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> Handle #VC exceptions that happen while the GHCB is in use. This can
> happen when an NMI happens in the #VC exception handler and the NMI
> handler causes a #VC exception itself. Save the contents of the GHCB
> when nesting is detected and restore it when the GHCB is no longer
> used.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/sev-es.c | 63 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 97241d2f0f70..3b7bbc8d841e 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -32,9 +32,57 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>   */
>  struct ghcb __initdata *boot_ghcb;
>
> +struct ghcb_state {
> +       struct ghcb *ghcb;
> +};
> +
>  /* Runtime GHCB pointers */
>  static struct ghcb __percpu *ghcb_page;
>
> +/*
> + * Mark the per-cpu GHCB as in-use to detect nested #VC exceptions.
> + * There is no need for it to be atomic, because nothing is written to the GHCB
> + * between the read and the write of ghcb_active. So it is safe to use it when a
> + * nested #VC exception happens before the write.
> + */
> +static DEFINE_PER_CPU(bool, ghcb_active);
> +
> +static struct ghcb *sev_es_get_ghcb(struct ghcb_state *state)
> +{
> +       struct ghcb *ghcb = (struct ghcb *)this_cpu_ptr(ghcb_page);
> +       bool *active = this_cpu_ptr(&ghcb_active);
> +
> +       if (unlikely(*active)) {
> +               /* GHCB is already in use - save its contents */
> +
> +               state->ghcb = kzalloc(sizeof(struct ghcb), GFP_ATOMIC);
> +               if (!state->ghcb)
> +                       return NULL;

This can't possibly end well.  Maybe have a little percpu list of
GHCBs and make sure there are enough for any possible nesting?

Also, I admit confusion.  Isn't the GHCB required to be unencrypted?
How does that work with kzalloc()?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19 15:35   ` Andy Lutomirski
@ 2020-03-19 16:07     ` Joerg Roedel
  2020-03-19 18:40       ` Andy Lutomirski
  2020-03-20 13:17     ` [RFC PATCH v2.1] x86/sev-es: Handle NMI State Joerg Roedel
  1 sibling, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19 16:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, LKML, kvm list, Linux Virtualization,
	Joerg Roedel

Hi Andy,

On Thu, Mar 19, 2020 at 08:35:59AM -0700, Andy Lutomirski wrote:
> On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
> >
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> > vCPU when it reaches IRET.
> 
> IIRC I suggested just re-enabling NMI in C from do_nmi().  What was
> wrong with that approach?

If I understand the code correctly a nested NMI will just reset the
interrupted NMI handler to start executing again at 'restart_nmi'.
The interrupted NMI handler could be in the #VC handler, and it is not
safe to just jump back to the start of the NMI handler from somewhere
within the #VC handler.

So I decided to not allow NMI nesting for SEV-ES and only re-enable the
NMI window when the first NMI returns. This is not implemented in this
patch, but I will do that once Thomas' entry-code rewrite is upstream.

> This causes us to pop the NMI frame off the stack.  Assuming the NMI
> restart logic is invoked (which is maybe impossible?), we get #DB,
> which presumably is actually delivered.  And we end up on the #DB
> stack, which might already have been in use, so we have a potential
> increase in nesting.  Also, #DB may be called from an unexpected
> context.

An SEV-ES hypervisor is required to intercept #DB, which means that the
#DB exception actually ends up being a #VC exception. So it will not end
up on the #DB stack.

> Now somehow #DB is supposed to invoke #VC, which is supposed to do the
> magic hypercall, and all of this is supposed to be safe?  Or is #DB
> unconditionally redirected to #VC?  What happens if we had no stack
> (e.g. we interrupted SYSCALL) or we were already in #VC to begin with?

Yeah, as I said above, the #DB is redirected to #VC, as the hypervisor
has to intercept #DB.

The stack-problem is the one that prevents the Single-step-over-iret
approach right now, because the NMI can hit while in kernel mode and on
entry stack, which the generic entry code (besided NMI) does not handle.
Getting a #VC exception there (like after an IRET to that state) breaks
things.

Last, in this version of the patch-set the #VC handler became
nesting-safe. It detects whether the per-cpu GHCB is in use and
safes/restores its contents in this case.


> I think there are two credible ways to approach this:
> 
> 1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
> how to handle running do_nmi() with NMIs unmasked.  This is much, much
> simpler than your code.

Right, and I thought about that, but the implication is that the
complexity is moved somewhere else, namely into the #VC handler, which
then has to be restartable.

> 2. Have an entirely separate NMI path for the
> SEV-ES-on-misdesigned-CPU case.  And have very clear documentation for
> what prevents this code from being executed on future CPUs (Zen3?)
> that have this issue fixed for real?

That sounds like a good alternative, I will investigate this approach.
The NMI handler should be much simpler as it doesn't need to allow NMI
nesting. The question is, does the C code down the NMI path depend on
the NMI handlers stack frame layout (e.g. the in-nmi flag)?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 42/70] x86/sev-es: Support nested #VC exceptions
  2020-03-19 15:46     ` Andy Lutomirski
  (?)
@ 2020-03-19 16:12     ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19 16:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, LKML, kvm list, Linux Virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 08:46:36AM -0700, Andy Lutomirski wrote:
> This can't possibly end well.  Maybe have a little percpu list of
> GHCBs and make sure there are enough for any possible nesting?

Yeah, it is not entirely robust yet. Without NMI nesting the number of
possible #VC nesting levels should be limited. At least one backup GHCB
pre-allocated is probably a good idea.

> Also, I admit confusion.  Isn't the GHCB required to be unencrypted?
> How does that work with kzalloc()?

Yes, but the kzalloc'ed ghcb is just the backup space for the real GHCB,
which is mapped unencrypted. The contents of the unencrypted GHCB is
copied to the backup and restored on return, so that the interrupted #VC
handler finds the GHCB unmodified.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler
  2020-03-19 15:44   ` Andy Lutomirski
@ 2020-03-19 16:24     ` Joerg Roedel
  2020-03-19 18:43       ` Andy Lutomirski
  0 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19 16:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, LKML, kvm list, Linux Virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 08:44:03AM -0700, Andy Lutomirski wrote:
> On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
> >
> > From: Tom Lendacky <thomas.lendacky@amd.com>
> >
> > Add the handler for #VC exceptions invoked at runtime.
> 
> If I read this correctly, this does not use IST.  If that's true, I
> don't see how this can possibly work.  There at least two nasty cases
> that come to mind:
> 
> 1. SYSCALL followed by NMI.  The NMI IRET hack gets to #VC and we
> explode.  This is fixable by getting rid of the NMI EFLAGS.TF hack.

Not an issue in this patch-set, the confusion comes from the fact that I
left some parts of the single-step-over-iret code in the patch. But it
is not used. The NMI handling in this patch-set sends the NMI-complete
message before the IRET, when the kernel is still in a safe environment
(kernel stack, kernel cr3).

> 2. tools/testing/selftests/x86/mov_ss_trap_64.  User code does MOV
> (addr), SS; SYSCALL, where addr has a data breakpoint.  We get #DB
> promoted to #VC with no stack.

Also not an issue, as debugging is not supported at the moment in SEV-ES
guests (hardware has no way yet to save/restore the debug registers
across #VMEXITs). But this will change with future hardware. If you look
at the implementation for dr7 read/write events, you see that the dr7
value is cached and returned, but does not make it to the hardware dr7.

I though about using IST for the #VC handler, but the implications for
nesting #VC handlers made me decide against it. But for future hardware
that supports debugging inside SEV-ES guests it will be an issue. I'll
think about how to fix the problem, it probably has to be IST :(

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19  9:14 ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Joerg Roedel
  2020-03-19 15:35   ` Andy Lutomirski
@ 2020-03-19 16:53   ` Mika Penttilä
  2020-03-19 19:41     ` Joerg Roedel
  1 sibling, 1 reply; 243+ messages in thread
From: Mika Penttilä @ 2020-03-19 16:53 UTC (permalink / raw)
  To: Joerg Roedel, x86
  Cc: hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel


Hi!

On 19.3.2020 11.14, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> vCPU when it reaches IRET.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>   arch/x86/entry/entry_64.S       | 48 +++++++++++++++++++++++++++++++++
>   arch/x86/include/asm/sev-es.h   | 27 +++++++++++++++++++
>   arch/x86/include/uapi/asm/svm.h |  1 +
>   arch/x86/kernel/nmi.c           |  8 ++++++
>   arch/x86/kernel/sev-es.c        | 31 ++++++++++++++++++++-
>   5 files changed, 114 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 729876d368c5..355470b36896 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -38,6 +38,7 @@
>   #include <asm/export.h>
>   #include <asm/frame.h>
>   #include <asm/nospec-branch.h>
> +#include <asm/sev-es.h>
>   #include <linux/err.h>
>   
>   #include "calling.h"
> @@ -629,6 +630,13 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
>   	ud2
>   1:
>   #endif
> +
> +	/*
> +	 * This code path is used by the NMI handler, so check if NMIs
> +	 * need to be re-enabled when running as an SEV-ES guest.
> +	 */
> +	SEV_ES_IRET_CHECK
> +
>   	POP_REGS pop_rdi=0
>   
>   	/*
> @@ -1474,6 +1482,8 @@ SYM_CODE_START(nmi)
>   	movq	$-1, %rsi
>   	call	do_nmi
>   
> +	SEV_ES_NMI_COMPLETE
> +
>   	/*
>   	 * Return back to user mode.  We must *not* do the normal exit
>   	 * work, because we don't want to enable interrupts.
> @@ -1599,6 +1609,7 @@ nested_nmi_out:
>   	popq	%rdx
>   
>   	/* We are returning to kernel mode, so this cannot result in a fault. */
> +	SEV_ES_NMI_COMPLETE
>   	iretq
>   
>   first_nmi:
> @@ -1687,6 +1698,12 @@ end_repeat_nmi:
>   	movq	$-1, %rsi
>   	call	do_nmi
>   
> +	/*
> +	 * When running as an SEV-ES guest, jump to the SEV-ES NMI IRET
> +	 * path.
> +	 */
> +	SEV_ES_NMI_COMPLETE
> +
>   	/* Always restore stashed CR3 value (see paranoid_entry) */
>   	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>   
> @@ -1715,6 +1732,9 @@ nmi_restore:
>   	std
>   	movq	$0, 5*8(%rsp)		/* clear "NMI executing" */
>   
> +nmi_return:
> +	UNWIND_HINT_IRET_REGS
> +
>   	/*
>   	 * iretq reads the "iret" frame and exits the NMI stack in a
>   	 * single instruction.  We are returning to kernel mode, so this
> @@ -1724,6 +1744,34 @@ nmi_restore:
>   	iretq
>   SYM_CODE_END(nmi)
>   
> +#ifdef CONFIG_AMD_MEM_ENCRYPT

> +SYM_CODE_START(sev_es_iret_user)


What makes kernel jump here? Can't see this referenced from anywhere?


> +	UNWIND_HINT_IRET_REGS offset=8
> +	/*
> +	 * The kernel jumps here directly from
> +	 * swapgs_restore_regs_and_return_to_usermode. %rsp points already to
> +	 * trampoline stack, but %cr3 is still from kernel. User-regs are live
> +	 * except %rdi. Switch to user CR3, restore user %rdi and user gs_base
> +	 * and single-step over IRET
> +	 */
> +	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
> +	popq	%rdi
> +	SWAPGS
> +	/*
> +	 * Enable single-stepping and execute IRET. When IRET is
> +	 * finished the resulting #DB exception will cause a #VC
> +	 * exception to be raised. The #VC exception handler will send a
> +	 * NMI-complete message to the hypervisor to re-open the NMI
> +	 * window.
> +	 */
> +sev_es_iret_kernel:
> +	pushf
> +	btsq $X86_EFLAGS_TF_BIT, (%rsp)
> +	popf
> +	iretq
> +SYM_CODE_END(sev_es_iret_user)
> +#endif
> +
>   #ifndef CONFIG_IA32_EMULATION
>   /*
>    * This handles SYSCALL from 32-bit code.  There is no way to program
> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> index 63acf50e6280..d866adb3e6d4 100644
> --- a/arch/x86/include/asm/sev-es.h
> +++ b/arch/x86/include/asm/sev-es.h
> @@ -8,6 +8,8 @@
>   #ifndef __ASM_ENCRYPTED_STATE_H
>   #define __ASM_ENCRYPTED_STATE_H
>   
> +#ifndef __ASSEMBLY__
> +
>   #include <linux/types.h>
>   #include <asm/insn.h>
>   
> @@ -82,11 +84,36 @@ struct real_mode_header;
>   
>   #ifdef CONFIG_AMD_MEM_ENCRYPT
>   int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
> +void sev_es_nmi_enter(void);
>   #else /* CONFIG_AMD_MEM_ENCRYPT */
>   static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
>   {
>   	return 0;
>   }
> +static inline void sev_es_nmi_enter(void) { }
> +#endif /* CONFIG_AMD_MEM_ENCRYPT*/
> +
> +#else /* !__ASSEMBLY__ */
> +
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +#define SEV_ES_NMI_COMPLETE		\
> +	ALTERNATIVE	"", "callq sev_es_nmi_complete", X86_FEATURE_SEV_ES_GUEST
> +
> +.macro	SEV_ES_IRET_CHECK
> +	ALTERNATIVE	"jmp	.Lend_\@", "", X86_FEATURE_SEV_ES_GUEST
> +	movq	PER_CPU_VAR(sev_es_in_nmi), %rdi
> +	testq	%rdi, %rdi
> +	jz	.Lend_\@
> +	callq	sev_es_nmi_complete
> +.Lend_\@:
> +.endm
> +
> +#else  /* CONFIG_AMD_MEM_ENCRYPT */
> +#define	SEV_ES_NMI_COMPLETE
> +.macro	SEV_ES_IRET_CHECK
> +.endm
>   #endif /* CONFIG_AMD_MEM_ENCRYPT*/
>   
> +#endif /* __ASSEMBLY__ */
> +
>   #endif
> diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
> index 20a05839dd9a..0f837339db66 100644
> --- a/arch/x86/include/uapi/asm/svm.h
> +++ b/arch/x86/include/uapi/asm/svm.h
> @@ -84,6 +84,7 @@
>   /* SEV-ES software-defined VMGEXIT events */
>   #define SVM_VMGEXIT_MMIO_READ			0x80000001
>   #define SVM_VMGEXIT_MMIO_WRITE			0x80000002
> +#define SVM_VMGEXIT_NMI_COMPLETE		0x80000003
>   #define SVM_VMGEXIT_AP_HLT_LOOP			0x80000004
>   #define SVM_VMGEXIT_AP_JUMP_TABLE		0x80000005
>   #define		SVM_VMGEXIT_SET_AP_JUMP_TABLE			0
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 54c21d6abd5a..7312a6d4d50f 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -37,6 +37,7 @@
>   #include <asm/reboot.h>
>   #include <asm/cache.h>
>   #include <asm/nospec-branch.h>
> +#include <asm/sev-es.h>
>   
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/nmi.h>
> @@ -510,6 +511,13 @@ NOKPROBE_SYMBOL(is_debug_stack);
>   dotraplinkage notrace void
>   do_nmi(struct pt_regs *regs, long error_code)
>   {
> +	/*
> +	 * For SEV-ES the kernel needs to track whether NMIs are blocked until
> +	 * IRET is reached, even when the CPU is offline.
> +	 */
> +	if (sev_es_active())
> +		sev_es_nmi_enter();
> +
>   	if (IS_ENABLED(CONFIG_SMP) && cpu_is_offline(smp_processor_id()))
>   		return;
>   
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index 3c22f256645e..409a7a2aa630 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -37,6 +37,7 @@ struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>    */
>   struct ghcb __initdata *boot_ghcb;
>   static DEFINE_PER_CPU(unsigned long, cached_dr7) = DR7_RESET_VALUE;
> +DEFINE_PER_CPU(bool, sev_es_in_nmi) = false;
>   
>   struct ghcb_state {
>   	struct ghcb *ghcb;
> @@ -270,6 +271,31 @@ static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
>   /* Include code shared with pre-decompression boot stage */
>   #include "sev-es-shared.c"
>   
> +void sev_es_nmi_enter(void)
> +{
> +	this_cpu_write(sev_es_in_nmi, true);
> +}
> +
> +void sev_es_nmi_complete(void)
> +{
> +	struct ghcb_state state;
> +	struct ghcb *ghcb;
> +
> +	ghcb = sev_es_get_ghcb(&state);
> +
> +	vc_ghcb_invalidate(ghcb);
> +	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_NMI_COMPLETE);
> +	ghcb_set_sw_exit_info_1(ghcb, 0);
> +	ghcb_set_sw_exit_info_2(ghcb, 0);
> +
> +	sev_es_wr_ghcb_msr(__pa(ghcb));
> +	VMGEXIT();
> +
> +	sev_es_put_ghcb(&state);
> +
> +	this_cpu_write(sev_es_in_nmi, false);
> +}
> +
>   static u64 sev_es_get_jump_table_addr(void)
>   {
>   	struct ghcb_state state;
> @@ -891,7 +917,10 @@ static enum es_result vc_handle_vmmcall(struct ghcb *ghcb,
>   static enum es_result vc_handle_db_exception(struct ghcb *ghcb,
>   					     struct es_em_ctxt *ctxt)
>   {
> -	do_debug(ctxt->regs, 0);
> +	if (this_cpu_read(sev_es_in_nmi))
> +		sev_es_nmi_complete();
> +	else
> +		do_debug(ctxt->regs, 0);
>   
>   	/* Exception event, do not advance RIP */
>   	return ES_RETRY;


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19 16:07     ` Joerg Roedel
@ 2020-03-19 18:40       ` Andy Lutomirski
  2020-03-19 19:26         ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 18:40 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, X86 ML, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization, Joerg Roedel

On Thu, Mar 19, 2020 at 9:07 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> Hi Andy,
>
> On Thu, Mar 19, 2020 at 08:35:59AM -0700, Andy Lutomirski wrote:
> > On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
> > >
> > > From: Joerg Roedel <jroedel@suse.de>
> > >
> > > Keep NMI state in SEV-ES code so the kernel can re-enable NMIs for the
> > > vCPU when it reaches IRET.
> >
> > IIRC I suggested just re-enabling NMI in C from do_nmi().  What was
> > wrong with that approach?
>
> If I understand the code correctly a nested NMI will just reset the
> interrupted NMI handler to start executing again at 'restart_nmi'.
> The interrupted NMI handler could be in the #VC handler, and it is not
> safe to just jump back to the start of the NMI handler from somewhere
> within the #VC handler.

Nope.  A nested NMI will reset the interrupted NMI's return frame to
cause it to run again when it's done.  I don't think this will have
any real interaction with #VC.  There's no longjmp() here.

>
> So I decided to not allow NMI nesting for SEV-ES and only re-enable the
> NMI window when the first NMI returns. This is not implemented in this
> patch, but I will do that once Thomas' entry-code rewrite is upstream.
>

I certainly *like* preventing nesting, but I don't think we really
want a whole alternate NMI path just for a couple of messed-up AMD
generations.  And the TF trick is not so pretty either.

> > This causes us to pop the NMI frame off the stack.  Assuming the NMI
> > restart logic is invoked (which is maybe impossible?), we get #DB,
> > which presumably is actually delivered.  And we end up on the #DB
> > stack, which might already have been in use, so we have a potential
> > increase in nesting.  Also, #DB may be called from an unexpected
> > context.
>
> An SEV-ES hypervisor is required to intercept #DB, which means that the
> #DB exception actually ends up being a #VC exception. So it will not end
> up on the #DB stack.

With your patch set, #DB doesn't seem to end up on the #DB stack either.

>
> > I think there are two credible ways to approach this:
> >
> > 1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
> > how to handle running do_nmi() with NMIs unmasked.  This is much, much
> > simpler than your code.
>
> Right, and I thought about that, but the implication is that the
> complexity is moved somewhere else, namely into the #VC handler, which
> then has to be restartable.

As above, I don't think there's an actual problem here.

>
> > 2. Have an entirely separate NMI path for the
> > SEV-ES-on-misdesigned-CPU case.  And have very clear documentation for
> > what prevents this code from being executed on future CPUs (Zen3?)
> > that have this issue fixed for real?
>
> That sounds like a good alternative, I will investigate this approach.
> The NMI handler should be much simpler as it doesn't need to allow NMI
> nesting. The question is, does the C code down the NMI path depend on
> the NMI handlers stack frame layout (e.g. the in-nmi flag)?

Nope.  In particular, the 32-bit path doesn't have all this.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler
  2020-03-19 16:24     ` Joerg Roedel
@ 2020-03-19 18:43       ` Andy Lutomirski
  2020-03-19 19:38         ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 18:43 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, X86 ML, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization, Joerg Roedel

On Thu, Mar 19, 2020 at 9:24 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> On Thu, Mar 19, 2020 at 08:44:03AM -0700, Andy Lutomirski wrote:
> > On Thu, Mar 19, 2020 at 2:14 AM Joerg Roedel <joro@8bytes.org> wrote:
> > >
> > > From: Tom Lendacky <thomas.lendacky@amd.com>
> > >
> > > Add the handler for #VC exceptions invoked at runtime.
> >
> > If I read this correctly, this does not use IST.  If that's true, I
> > don't see how this can possibly work.  There at least two nasty cases
> > that come to mind:
> >
> > 1. SYSCALL followed by NMI.  The NMI IRET hack gets to #VC and we
> > explode.  This is fixable by getting rid of the NMI EFLAGS.TF hack.
>
> Not an issue in this patch-set, the confusion comes from the fact that I
> left some parts of the single-step-over-iret code in the patch. But it
> is not used. The NMI handling in this patch-set sends the NMI-complete
> message before the IRET, when the kernel is still in a safe environment
> (kernel stack, kernel cr3).

Got it!

>
> > 2. tools/testing/selftests/x86/mov_ss_trap_64.  User code does MOV
> > (addr), SS; SYSCALL, where addr has a data breakpoint.  We get #DB
> > promoted to #VC with no stack.
>
> Also not an issue, as debugging is not supported at the moment in SEV-ES
> guests (hardware has no way yet to save/restore the debug registers
> across #VMEXITs). But this will change with future hardware. If you look
> at the implementation for dr7 read/write events, you see that the dr7
> value is cached and returned, but does not make it to the hardware dr7.

Eek.  This would probably benefit from some ptrace / perf logic to
prevent the kernel or userspace from thinking that debugging works.

I guess this means that #DB only happens due to TF or INT01.  I
suppose this is probably okay.

>
> I though about using IST for the #VC handler, but the implications for
> nesting #VC handlers made me decide against it. But for future hardware
> that supports debugging inside SEV-ES guests it will be an issue. I'll
> think about how to fix the problem, it probably has to be IST :(

Or future generations could have enough hardware support for debugging
that #DB doesn't need to be intercepted or can be re-injected
correctly with the #DB vector.

>
> Regards,
>
>         Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19 18:40       ` Andy Lutomirski
@ 2020-03-19 19:26         ` Joerg Roedel
  2020-03-19 21:27           ` Andy Lutomirski
  0 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19 19:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, X86 ML, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization

On Thu, Mar 19, 2020 at 11:40:39AM -0700, Andy Lutomirski wrote:
 
> Nope.  A nested NMI will reset the interrupted NMI's return frame to
> cause it to run again when it's done.  I don't think this will have
> any real interaction with #VC.  There's no longjmp() here.

Ahh, so I misunderstood that part, in this case your proposal of sending
the NMI-complete message right at the beginning of do_nmi() should work
just fine. I will test this and see how it works out.

> I certainly *like* preventing nesting, but I don't think we really
> want a whole alternate NMI path just for a couple of messed-up AMD
> generations.  And the TF trick is not so pretty either.

Indeed, if it could be avoided, it should.

> 
> > > This causes us to pop the NMI frame off the stack.  Assuming the NMI
> > > restart logic is invoked (which is maybe impossible?), we get #DB,
> > > which presumably is actually delivered.  And we end up on the #DB
> > > stack, which might already have been in use, so we have a potential
> > > increase in nesting.  Also, #DB may be called from an unexpected
> > > context.
> >
> > An SEV-ES hypervisor is required to intercept #DB, which means that the
> > #DB exception actually ends up being a #VC exception. So it will not end
> > up on the #DB stack.
> 
> With your patch set, #DB doesn't seem to end up on the #DB stack either.

Right, it does not use the #DB stack or shift-ist stuff. Maybe it
should, is this needed for anything else than making entry code
debugable by kgdb?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 41/70] x86/sev-es: Add Runtime #VC Exception Handler
  2020-03-19 18:43       ` Andy Lutomirski
@ 2020-03-19 19:38         ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19 19:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, X86 ML, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization

On Thu, Mar 19, 2020 at 11:43:20AM -0700, Andy Lutomirski wrote:
> Or future generations could have enough hardware support for debugging
> that #DB doesn't need to be intercepted or can be re-injected
> correctly with the #DB vector.

Yeah, the problem is, the GHCB spec suggests the single-step-over-iret
way to re-enable the NMI window and requires intercepting #DB for it. So
the hypervisor probably still has to intercept it, even when debug
support is added some day. I need to think more about this.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19 16:53   ` [PATCH 70/70] x86/sev-es: Add NMI state tracking Mika Penttilä
@ 2020-03-19 19:41     ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-19 19:41 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

On Thu, Mar 19, 2020 at 06:53:29PM +0200, Mika Penttilä wrote:
> > +SYM_CODE_START(sev_es_iret_user)
> 
> 
> What makes kernel jump here? Can't see this referenced from anywhere?

Sorry, it is just a left-over from a previous version of this patch
(which implemented the single-step-over-iret). This label is not used
anymore. The jump to it was in
swapgs_restore_regs_and_return_to_usermode, after checking the
sev_es_in_nmi flag.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19 19:26         ` Joerg Roedel
@ 2020-03-19 21:27           ` Andy Lutomirski
  2020-03-20 19:48             ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-03-19 21:27 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, X86 ML, H. Peter Anvin,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Tom Lendacky, Juergen Gross, Kees Cook, LKML,
	kvm list, Linux Virtualization

On Thu, Mar 19, 2020 at 12:26 PM Joerg Roedel <jroedel@suse.de> wrote:
>
> On Thu, Mar 19, 2020 at 11:40:39AM -0700, Andy Lutomirski wrote:
>
> > Nope.  A nested NMI will reset the interrupted NMI's return frame to
> > cause it to run again when it's done.  I don't think this will have
> > any real interaction with #VC.  There's no longjmp() here.
>
> Ahh, so I misunderstood that part, in this case your proposal of sending
> the NMI-complete message right at the beginning of do_nmi() should work
> just fine. I will test this and see how it works out.
>
> > I certainly *like* preventing nesting, but I don't think we really
> > want a whole alternate NMI path just for a couple of messed-up AMD
> > generations.  And the TF trick is not so pretty either.
>
> Indeed, if it could be avoided, it should.
>
> >
> > > > This causes us to pop the NMI frame off the stack.  Assuming the NMI
> > > > restart logic is invoked (which is maybe impossible?), we get #DB,
> > > > which presumably is actually delivered.  And we end up on the #DB
> > > > stack, which might already have been in use, so we have a potential
> > > > increase in nesting.  Also, #DB may be called from an unexpected
> > > > context.
> > >
> > > An SEV-ES hypervisor is required to intercept #DB, which means that the
> > > #DB exception actually ends up being a #VC exception. So it will not end
> > > up on the #DB stack.
> >
> > With your patch set, #DB doesn't seem to end up on the #DB stack either.
>
> Right, it does not use the #DB stack or shift-ist stuff. Maybe it
> should, is this needed for anything else than making entry code
> debugable by kgdb?

AIUI the shift-ist stuff is because we aren't very good about the way
that we handle tracing right now, and that can cause a limited degree
of recursion.  #DB uses IST for historical reasons that don't
necessarily make sense.  Right now, we need it for only one reason:
the MOV SS issue.  IIRC this isn't actually triggerable without
debugging enabled -- MOV SS with no breakpoint but TF on doesn't seem
to malfunction quite as badly.

--Andy

>
> Regards,
>
>         Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [RFC PATCH v2.1] x86/sev-es: Handle NMI State
  2020-03-19 15:35   ` Andy Lutomirski
  2020-03-19 16:07     ` Joerg Roedel
@ 2020-03-20 13:17     ` Joerg Roedel
  2020-03-20 14:42       ` Dave Hansen
  1 sibling, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 13:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, LKML, kvm list, Linux Virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 08:35:59AM -0700, Andy Lutomirski wrote:
> 1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
> how to handle running do_nmi() with NMIs unmasked.  This is much, much
> simpler than your code.

Okay, attached is the updated patch which implements this approach. I
tested it in an SEV-ES guest with 'perf top' running for a little more
than 30 minutes and all looked good. I also removed the dead code from
the patch.


From ec3b021c5d9130fd66e00d823c4fabc675c4b49e Mon Sep 17 00:00:00 2001
From: Joerg Roedel <jroedel@suse.de>
Date: Tue, 28 Jan 2020 17:31:05 +0100
Subject: [PATCH] x86/sev-es: Handle NMI State

When running under SEV-ES the kernel has to tell the hypervisor when to
open the NMI window again after an NMI was injected. This is done with
an NMI-complete message to the hypervisor.

Add code to the kernels NMI handler to send this message right at the
beginning of do_nmi(). This always allows nesting NMIs.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/sev-es.h   |  2 ++
 arch/x86/include/uapi/asm/svm.h |  1 +
 arch/x86/kernel/nmi.c           |  8 ++++++++
 arch/x86/kernel/sev-es.c        | 18 ++++++++++++++++++
 4 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
index 63acf50e6280..441ec1ba2cc7 100644
--- a/arch/x86/include/asm/sev-es.h
+++ b/arch/x86/include/asm/sev-es.h
@@ -82,11 +82,13 @@ struct real_mode_header;
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 int sev_es_setup_ap_jump_table(struct real_mode_header *rmh);
+void sev_es_nmi_complete(void);
 #else /* CONFIG_AMD_MEM_ENCRYPT */
 static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
 {
 	return 0;
 }
+static inline void sev_es_nmi_complete(void) { }
 #endif /* CONFIG_AMD_MEM_ENCRYPT*/
 
 #endif
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 20a05839dd9a..0f837339db66 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -84,6 +84,7 @@
 /* SEV-ES software-defined VMGEXIT events */
 #define SVM_VMGEXIT_MMIO_READ			0x80000001
 #define SVM_VMGEXIT_MMIO_WRITE			0x80000002
+#define SVM_VMGEXIT_NMI_COMPLETE		0x80000003
 #define SVM_VMGEXIT_AP_HLT_LOOP			0x80000004
 #define SVM_VMGEXIT_AP_JUMP_TABLE		0x80000005
 #define		SVM_VMGEXIT_SET_AP_JUMP_TABLE			0
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 54c21d6abd5a..fc872a7e0ed1 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -37,6 +37,7 @@
 #include <asm/reboot.h>
 #include <asm/cache.h>
 #include <asm/nospec-branch.h>
+#include <asm/sev-es.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/nmi.h>
@@ -510,6 +511,13 @@ NOKPROBE_SYMBOL(is_debug_stack);
 dotraplinkage notrace void
 do_nmi(struct pt_regs *regs, long error_code)
 {
+	/*
+	 * Re-enable NMIs right here when running as an SEV-ES guest. This might
+	 * cause nested NMIs, but those can be handled safely.
+	 */
+	if (sev_es_active())
+		sev_es_nmi_complete();
+
 	if (IS_ENABLED(CONFIG_SMP) && cpu_is_offline(smp_processor_id()))
 		return;
 
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 3c22f256645e..a7e2739771e7 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -270,6 +270,24 @@ static phys_addr_t vc_slow_virt_to_phys(struct ghcb *ghcb, long vaddr)
 /* Include code shared with pre-decompression boot stage */
 #include "sev-es-shared.c"
 
+void sev_es_nmi_complete(void)
+{
+	struct ghcb_state state;
+	struct ghcb *ghcb;
+
+	ghcb = sev_es_get_ghcb(&state);
+
+	vc_ghcb_invalidate(ghcb);
+	ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_NMI_COMPLETE);
+	ghcb_set_sw_exit_info_1(ghcb, 0);
+	ghcb_set_sw_exit_info_2(ghcb, 0);
+
+	sev_es_wr_ghcb_msr(__pa(ghcb));
+	VMGEXIT();
+
+	sev_es_put_ghcb(&state);
+}
+
 static u64 sev_es_get_jump_table_addr(void)
 {
 	struct ghcb_state state;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC PATCH v2.1] x86/sev-es: Handle NMI State
  2020-03-20 13:17     ` [RFC PATCH v2.1] x86/sev-es: Handle NMI State Joerg Roedel
@ 2020-03-20 14:42       ` Dave Hansen
  2020-03-20 19:42         ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Dave Hansen @ 2020-03-20 14:42 UTC (permalink / raw)
  To: Joerg Roedel, Andy Lutomirski
  Cc: X86 ML, H. Peter Anvin, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, LKML, kvm list, Linux Virtualization,
	Joerg Roedel

On 3/20/20 6:17 AM, Joerg Roedel wrote:
> On Thu, Mar 19, 2020 at 08:35:59AM -0700, Andy Lutomirski wrote:
>> 1. Just put the NMI unmask in do_nmi().  The kernel *already* knows
>> how to handle running do_nmi() with NMIs unmasked.  This is much, much
>> simpler than your code.
> Okay, attached is the updated patch which implements this approach. I
> tested it in an SEV-ES guest with 'perf top' running for a little more
> than 30 minutes and all looked good. I also removed the dead code from
> the patch.

FWIW, perf plus the x86 selftests run in a big loop was my best way of
stressing the NMI path when we mucked with it for PTI.  The selftests
make sure to hit some of the more rare entry/exit paths.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [RFC PATCH v2.1] x86/sev-es: Handle NMI State
  2020-03-20 14:42       ` Dave Hansen
@ 2020-03-20 19:42         ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 19:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization, Joerg Roedel

On Fri, Mar 20, 2020 at 07:42:09AM -0700, Dave Hansen wrote:
> FWIW, perf plus the x86 selftests run in a big loop was my best way of
> stressing the NMI path when we mucked with it for PTI.  The selftests
> make sure to hit some of the more rare entry/exit paths.

Yeah, I ran the x86 selftests in an SEV-ES guest on-top of these
patches, that works. But doing this together with 'perf top' is also on
the list of tests to do.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 70/70] x86/sev-es: Add NMI state tracking
  2020-03-19 21:27           ` Andy Lutomirski
@ 2020-03-20 19:48             ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 19:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, X86 ML, H. Peter Anvin, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, LKML, kvm list,
	Linux Virtualization

On Thu, Mar 19, 2020 at 02:27:49PM -0700, Andy Lutomirski wrote:
> AIUI the shift-ist stuff is because we aren't very good about the way
> that we handle tracing right now, and that can cause a limited degree
> of recursion.  #DB uses IST for historical reasons that don't
> necessarily make sense.  Right now, we need it for only one reason:
> the MOV SS issue.  IIRC this isn't actually triggerable without
> debugging enabled -- MOV SS with no breakpoint but TF on doesn't seem
> to malfunction quite as badly.

I had a look at the shift_ist stuff today and it looks like a good
solution to the #VC nesting problem when it is turned into a #VC
handler. The devil is in the details, of course, as 3 or 4 stacks for
the #VC handler (per cpu) should only be allocated when actually running
in an SEV-ES guest. Let's see how this works out in practice.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-20 20:53   ` David Rientjes
  2020-03-20 21:02     ` Dave Hansen
  -1 siblings, 1 reply; 243+ messages in thread
From: David Rientjes @ 2020-03-20 20:53 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, 19 Mar 2020, Joerg Roedel wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> This function is needed to map the GHCB for SEV-ES guests. The GHCB is
> used for communication with the hypervisor, so its content must not be
> encrypted.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/boot/compressed/ident_map_64.c | 125 ++++++++++++++++++++++++
>  arch/x86/boot/compressed/misc.h         |   1 +
>  2 files changed, 126 insertions(+)
> 
> diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
> index feb180cced28..04a5ff4bda66 100644
> --- a/arch/x86/boot/compressed/ident_map_64.c
> +++ b/arch/x86/boot/compressed/ident_map_64.c
> @@ -26,6 +26,7 @@
>  #include <asm/init.h>
>  #include <asm/pgtable.h>
>  #include <asm/trap_defs.h>
> +#include <asm/cmpxchg.h>
>  /* Use the static base for this part of the boot process */
>  #undef __PAGE_OFFSET
>  #define __PAGE_OFFSET __PAGE_OFFSET_BASE
> @@ -157,6 +158,130 @@ void initialize_identity_maps(void)
>  	write_cr3(top_level_pgt);
>  }
>  
> +static pte_t *split_large_pmd(struct x86_mapping_info *info,
> +			      pmd_t *pmdp, unsigned long __address)
> +{
> +	unsigned long page_flags;
> +	unsigned long address;
> +	pte_t *pte;
> +	pmd_t pmd;
> +	int i;
> +
> +	pte = (pte_t *)info->alloc_pgt_page(info->context);
> +	if (!pte)
> +		return NULL;
> +
> +	address     = __address & PMD_MASK;
> +	/* No large page - clear PSE flag */
> +	page_flags  = info->page_flag & ~_PAGE_PSE;
> +
> +	/* Populate the PTEs */
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		set_pte(&pte[i], __pte(address | page_flags));
> +		address += PAGE_SIZE;
> +	}
> +
> +	/*
> +	 * Ideally we need to clear the large PMD first and do a TLB
> +	 * flush before we write the new PMD. But the 2M range of the
> +	 * PMD might contain the code we execute and/or the stack
> +	 * we are on, so we can't do that. But that should be safe here
> +	 * because we are going from large to small mappings and we are
> +	 * also the only user of the page-table, so there is no chance
> +	 * of a TLB multihit.
> +	 */
> +	pmd = __pmd((unsigned long)pte | info->kernpg_flag);
> +	set_pmd(pmdp, pmd);
> +	/* Flush TLB to establish the new PMD */
> +	write_cr3(top_level_pgt);
> +
> +	return pte + pte_index(__address);
> +}
> +
> +static void clflush_page(unsigned long address)
> +{
> +	unsigned int flush_size;
> +	char *cl, *start, *end;
> +
> +	/*
> +	 * Hardcode cl-size to 64 - CPUID can't be used here because that might
> +	 * cause another #VC exception and the GHCB is not ready to use yet.
> +	 */
> +	flush_size = 64;
> +	start      = (char *)(address & PAGE_MASK);
> +	end        = start + PAGE_SIZE;
> +
> +	/*
> +	 * First make sure there are no pending writes on the cache-lines to
> +	 * flush.
> +	 */
> +	asm volatile("mfence" : : : "memory");
> +
> +	for (cl = start; cl != end; cl += flush_size)
> +		clflush(cl);
> +}
> +
> +static int __set_page_decrypted(struct x86_mapping_info *info,
> +				unsigned long address)
> +{
> +	unsigned long scratch, *target;
> +	pgd_t *pgdp = (pgd_t *)top_level_pgt;
> +	p4d_t *p4dp;
> +	pud_t *pudp;
> +	pmd_t *pmdp;
> +	pte_t *ptep, pte;
> +
> +	/*
> +	 * First make sure there is a PMD mapping for 'address'.
> +	 * It should already exist, but keep things generic.
> +	 *
> +	 * To map the page just read from it and fault it in if there is no
> +	 * mapping yet. add_identity_map() can't be called here because that
> +	 * would unconditionally map the address on PMD level, destroying any
> +	 * PTE-level mappings that might already exist.  Also do something
> +	 * useless with 'scratch' so the access won't be optimized away.
> +	 */
> +	target = (unsigned long *)address;
> +	scratch = *target;
> +	arch_cmpxchg(target, scratch, scratch);
> +
> +	/*
> +	 * The page is mapped at least with PMD size - so skip checks and walk
> +	 * directly to the PMD.
> +	 */
> +	p4dp = p4d_offset(pgdp, address);
> +	pudp = pud_offset(p4dp, address);
> +	pmdp = pmd_offset(pudp, address);
> +
> +	if (pmd_large(*pmdp))
> +		ptep = split_large_pmd(info, pmdp, address);
> +	else
> +		ptep = pte_offset_kernel(pmdp, address);
> +
> +	if (!ptep)
> +		return -ENOMEM;
> +
> +	/* Clear encryption flag and write new pte */
> +	pte = pte_clear_flags(*ptep, _PAGE_ENC);
> +	set_pte(ptep, pte);
> +
> +	/* Flush TLB to map the page unencrypted */
> +	write_cr3(top_level_pgt);
> +

Is there a guarantee that this flushes the tlb if cr3 == top_level_pgt 
alrady without an invlpg?

> +	/*
> +	 * Changing encryption attributes of a page requires to flush it from
> +	 * the caches.
> +	 */
> +	clflush_page(address);
> +
> +	return 0;
> +}
> +
> +int set_page_decrypted(unsigned long address)
> +{
> +	return __set_page_decrypted(&mapping_info, address);
> +}
> +
>  static void pf_error(unsigned long error_code, unsigned long address,
>  		     struct pt_regs *regs)
>  {
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 0e3508c5c15c..42f68a858a35 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -98,6 +98,7 @@ static inline void choose_random_location(unsigned long input,
>  #endif
>  
>  #ifdef CONFIG_X86_64
> +extern int set_page_decrypted(unsigned long address);
>  extern unsigned char _pgtable[];
>  #endif
>  

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
  2020-03-20 20:53   ` David Rientjes
@ 2020-03-20 21:02     ` Dave Hansen
  2020-03-20 22:12       ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Dave Hansen @ 2020-03-20 21:02 UTC (permalink / raw)
  To: David Rientjes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On 3/20/20 1:53 PM, David Rientjes wrote:
>> +
>> +	/* Clear encryption flag and write new pte */
>> +	pte = pte_clear_flags(*ptep, _PAGE_ENC);
>> +	set_pte(ptep, pte);
>> +
>> +	/* Flush TLB to map the page unencrypted */
>> +	write_cr3(top_level_pgt);
>> +
> Is there a guarantee that this flushes the tlb if cr3 == top_level_pgt 
> alrady without an invlpg?

Ahh, good catch.

It *never* flushes global pages.  For a generic function like this, that
seems pretty dangerous because the PTEs it goes after could quite easily
be Global.  It's also not _obviously_ correct if PCIDs are in play
(which I don't think they are on AMD).

A flush_tlb_global() is probably more appropriate.  Better yet, is there
a reason not to use flush_tlb_kernel_range()?  I don't think it's
necessary to whack the entire TLB for one PTE set.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 23/70] x86/sev-es: Add support for handling IOIO exceptions
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-20 21:03   ` David Rientjes
  2020-03-20 22:24     ` Joerg Roedel
  -1 siblings, 1 reply; 243+ messages in thread
From: David Rientjes @ 2020-03-20 21:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, 19 Mar 2020, Joerg Roedel wrote:

> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Add support for decoding and handling #VC exceptions for IOIO events.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> [ jroedel@suse.de: Adapted code to #VC handling framework ]
> Co-developed-by: Joerg Roedel <jroedel@suse.de>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/boot/compressed/sev-es.c |  32 +++++
>  arch/x86/kernel/sev-es-shared.c   | 202 ++++++++++++++++++++++++++++++
>  2 files changed, 234 insertions(+)
> 
> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
> index 193c970a3379..ae5fbd371fd9 100644
> --- a/arch/x86/boot/compressed/sev-es.c
> +++ b/arch/x86/boot/compressed/sev-es.c
> @@ -18,6 +18,35 @@
>  struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
>  struct ghcb *boot_ghcb;
>  
> +/*
> + * Copy a version of this function here - insn-eval.c can't be used in
> + * pre-decompression code.
> + */
> +static bool insn_rep_prefix(struct insn *insn)
> +{
> +	int i;
> +
> +	insn_get_prefixes(insn);
> +
> +	for (i = 0; i < insn->prefixes.nbytes; i++) {
> +		insn_byte_t p = insn->prefixes.bytes[i];
> +
> +		if (p == 0xf2 || p == 0xf3)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
> + * doesn't use segments.
> + */
> +static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
> +{
> +	return 0UL;
> +}
> +
>  static inline u64 sev_es_rd_ghcb_msr(void)
>  {
>  	unsigned long low, high;
> @@ -117,6 +146,9 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
>  		goto finish;
>  
>  	switch (exit_code) {
> +	case SVM_EXIT_IOIO:
> +		result = vc_handle_ioio(boot_ghcb, &ctxt);
> +		break;
>  	default:
>  		result = ES_UNSUPPORTED;
>  		break;
> diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
> index f0947ea3c601..46fc5318d1d7 100644
> --- a/arch/x86/kernel/sev-es-shared.c
> +++ b/arch/x86/kernel/sev-es-shared.c
> @@ -205,3 +205,205 @@ static enum es_result vc_insn_string_write(struct es_em_ctxt *ctxt,
>  
>  	return ret;
>  }
> +
> +#define IOIO_TYPE_STR  BIT(2)
> +#define IOIO_TYPE_IN   1
> +#define IOIO_TYPE_INS  (IOIO_TYPE_IN | IOIO_TYPE_STR)
> +#define IOIO_TYPE_OUT  0
> +#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
> +
> +#define IOIO_REP       BIT(3)
> +
> +#define IOIO_ADDR_64   BIT(9)
> +#define IOIO_ADDR_32   BIT(8)
> +#define IOIO_ADDR_16   BIT(7)
> +
> +#define IOIO_DATA_32   BIT(6)
> +#define IOIO_DATA_16   BIT(5)
> +#define IOIO_DATA_8    BIT(4)
> +
> +#define IOIO_SEG_ES    (0 << 10)
> +#define IOIO_SEG_DS    (3 << 10)
> +
> +static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 *exitinfo)
> +{
> +	struct insn *insn = &ctxt->insn;
> +	*exitinfo = 0;
> +
> +	switch (insn->opcode.bytes[0]) {
> +	/* INS opcodes */
> +	case 0x6c:
> +	case 0x6d:
> +		*exitinfo |= IOIO_TYPE_INS;
> +		*exitinfo |= IOIO_SEG_ES;
> +		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
> +		break;
> +
> +	/* OUTS opcodes */
> +	case 0x6e:
> +	case 0x6f:
> +		*exitinfo |= IOIO_TYPE_OUTS;
> +		*exitinfo |= IOIO_SEG_DS;
> +		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
> +		break;
> +
> +	/* IN immediate opcodes */
> +	case 0xe4:
> +	case 0xe5:
> +		*exitinfo |= IOIO_TYPE_IN;
> +		*exitinfo |= insn->immediate.value << 16;
> +		break;
> +
> +	/* OUT immediate opcodes */
> +	case 0xe6:
> +	case 0xe7:
> +		*exitinfo |= IOIO_TYPE_OUT;
> +		*exitinfo |= insn->immediate.value << 16;
> +		break;
> +
> +	/* IN register opcodes */
> +	case 0xec:
> +	case 0xed:
> +		*exitinfo |= IOIO_TYPE_IN;
> +		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
> +		break;
> +
> +	/* OUT register opcodes */
> +	case 0xee:
> +	case 0xef:
> +		*exitinfo |= IOIO_TYPE_OUT;
> +		*exitinfo |= (ctxt->regs->dx & 0xffff) << 16;
> +		break;
> +
> +	default:
> +		return ES_DECODE_FAILED;
> +	}
> +
> +	switch (insn->opcode.bytes[0]) {
> +	case 0x6c:
> +	case 0x6e:
> +	case 0xe4:
> +	case 0xe6:
> +	case 0xec:
> +	case 0xee:
> +		/* Single byte opcodes */
> +		*exitinfo |= IOIO_DATA_8;
> +		break;
> +	default:
> +		/* Length determined by instruction parsing */
> +		*exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
> +						     : IOIO_DATA_32;
> +	}
> +	switch (insn->addr_bytes) {
> +	case 2:
> +		*exitinfo |= IOIO_ADDR_16;
> +		break;
> +	case 4:
> +		*exitinfo |= IOIO_ADDR_32;
> +		break;
> +	case 8:
> +		*exitinfo |= IOIO_ADDR_64;
> +		break;
> +	}
> +
> +	if (insn_rep_prefix(insn))
> +		*exitinfo |= IOIO_REP;
> +
> +	return ES_OK;
> +}
> +
> +static enum es_result vc_handle_ioio(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
> +{
> +	struct pt_regs *regs = ctxt->regs;
> +	u64 exit_info_1, exit_info_2;
> +	enum es_result ret;
> +
> +	ret = vc_ioio_exitinfo(ctxt, &exit_info_1);
> +	if (ret != ES_OK)
> +		return ret;
> +
> +	if (exit_info_1 & IOIO_TYPE_STR) {
> +		int df = (regs->flags & X86_EFLAGS_DF) ? -1 : 1;
> +		unsigned int io_bytes, exit_bytes;
> +		unsigned int ghcb_count, op_count;
> +		unsigned long es_base;
> +		u64 sw_scratch;
> +
> +		/*
> +		 * For the string variants with rep prefix the amount of in/out
> +		 * operations per #VC exception is limited so that the kernel
> +		 * has a chance to take interrupts an re-schedule while the
> +		 * instruction is emulated.
> +		 */
> +		io_bytes   = (exit_info_1 >> 4) & 0x7;
> +		ghcb_count = sizeof(ghcb->shared_buffer) / io_bytes;
> +
> +		op_count    = (exit_info_1 & IOIO_REP) ? regs->cx : 1;
> +		exit_info_2 = min(op_count, ghcb_count);
> +		exit_bytes  = exit_info_2 * io_bytes;
> +
> +		es_base = insn_get_seg_base(ctxt->regs, INAT_SEG_REG_ES);
> +
> +		if (!(exit_info_1 & IOIO_TYPE_IN)) {
> +			ret = vc_insn_string_read(ctxt,
> +					       (void *)(es_base + regs->si),
> +					       ghcb->shared_buffer, io_bytes,
> +					       exit_info_2, df);

The last argument to vc_insn_string_read() is "bool backwards" which in 
this case it appears will always be true?

> +			if (ret)
> +				return ret;
> +		}
> +
> +		sw_scratch = __pa(ghcb) + offsetof(struct ghcb, shared_buffer);
> +		ghcb_set_sw_scratch(ghcb, sw_scratch);
> +		ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO,
> +				   exit_info_1, exit_info_2);
> +		if (ret != ES_OK)
> +			return ret;
> +
> +		/* Everything went well, write back results */
> +		if (exit_info_1 & IOIO_TYPE_IN) {
> +			ret = vc_insn_string_write(ctxt,
> +						(void *)(es_base + regs->di),
> +						ghcb->shared_buffer, io_bytes,
> +						exit_info_2, df);
> +			if (ret)
> +				return ret;
> +
> +			if (df)
> +				regs->di -= exit_bytes;
> +			else
> +				regs->di += exit_bytes;
> +		} else {
> +			if (df)
> +				regs->si -= exit_bytes;
> +			else
> +				regs->si += exit_bytes;
> +		}
> +
> +		if (exit_info_1 & IOIO_REP)
> +			regs->cx -= exit_info_2;
> +
> +		ret = regs->cx ? ES_RETRY : ES_OK;
> +
> +	} else {
> +		int bits = (exit_info_1 & 0x70) >> 1;
> +		u64 rax = 0;
> +
> +		if (!(exit_info_1 & IOIO_TYPE_IN))
> +			rax = lower_bits(regs->ax, bits);
> +
> +		ghcb_set_rax(ghcb, rax);
> +
> +		ret = sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_IOIO, exit_info_1, 0);
> +		if (ret != ES_OK)
> +			return ret;
> +
> +		if (exit_info_1 & IOIO_TYPE_IN) {
> +			if (!ghcb_is_valid_rax(ghcb))
> +				return ES_VMM_ERROR;
> +			regs->ax = lower_bits(ghcb->save.rax, bits);
> +		}
> +	}
> +
> +	return ret;
> +}
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 18/70] x86/boot/compressed/64: Add stage1 #VC handler
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-20 21:16   ` David Rientjes
  2020-03-20 22:19     ` Joerg Roedel
  -1 siblings, 1 reply; 243+ messages in thread
From: David Rientjes @ 2020-03-20 21:16 UTC (permalink / raw)
  To: Joerg Roedel, erdemaktas
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, 19 Mar 2020, Joerg Roedel wrote:

> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> new file mode 100644
> index 000000000000..f524b40aef07
> --- /dev/null
> +++ b/arch/x86/include/asm/sev-es.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * AMD Encrypted Register State Support
> + *
> + * Author: Joerg Roedel <jroedel@suse.de>
> + */
> +
> +#ifndef __ASM_ENCRYPTED_STATE_H
> +#define __ASM_ENCRYPTED_STATE_H
> +
> +#include <linux/types.h>
> +
> +#define GHCB_SEV_CPUID_REQ	0x004UL
> +#define		GHCB_CPUID_REQ_EAX	0
> +#define		GHCB_CPUID_REQ_EBX	1
> +#define		GHCB_CPUID_REQ_ECX	2
> +#define		GHCB_CPUID_REQ_EDX	3
> +#define		GHCB_CPUID_REQ(fn, reg) (GHCB_SEV_CPUID_REQ | \
> +					(((unsigned long)reg & 3) << 30) | \
> +					(((unsigned long)fn) << 32))
> +
> +#define GHCB_SEV_CPUID_RESP	0x005UL
> +#define GHCB_SEV_TERMINATE	0x100UL
> +
> +#define	GHCB_SEV_GHCB_RESP_CODE(v)	((v) & 0xfff)
> +#define	VMGEXIT()			{ asm volatile("rep; vmmcall\n\r"); }

Since preemption and irqs should be disabled before updating the GHCB and 
its MSR and until the contents have been accessed following VMGEXIT, 
should there be checks in place to ensure that's always the case?

> +
> +static inline u64 lower_bits(u64 val, unsigned int bits)
> +{
> +	u64 mask = (1ULL << bits) - 1;
> +
> +	return (val & mask);
> +}
> +
> +static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
> +{
> +	u64 mask = (1ULL << bits) - 1;
> +
> +	out &= ~mask;
> +	out |= lower_bits(in, bits);
> +
> +	return out;
> +}
> +
> +#endif
> diff --git a/arch/x86/include/asm/trap_defs.h b/arch/x86/include/asm/trap_defs.h
> index 488f82ac36da..af45d65f0458 100644
> --- a/arch/x86/include/asm/trap_defs.h
> +++ b/arch/x86/include/asm/trap_defs.h
> @@ -24,6 +24,7 @@ enum {
>  	X86_TRAP_AC,		/* 17, Alignment Check */
>  	X86_TRAP_MC,		/* 18, Machine Check */
>  	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
> +	X86_TRAP_VC = 29,	/* 29, VMM Communication Exception */
>  	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
>  };
>  
> diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
> new file mode 100644
> index 000000000000..e963b48d3e86
> --- /dev/null
> +++ b/arch/x86/kernel/sev-es-shared.c
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * AMD Encrypted Register State Support
> + *
> + * Author: Joerg Roedel <jroedel@suse.de>
> + *
> + * This file is not compiled stand-alone. It contains code shared
> + * between the pre-decompression boot code and the running Linux kernel
> + * and is included directly into both code-bases.
> + */
> +
> +/*
> + * Boot VC Handler - This is the first VC handler during boot, there is no GHCB
> + * page yet, so it only supports the MSR based communication with the
> + * hypervisor and only the CPUID exit-code.
> + */
> +void __init vc_no_ghcb_handler(struct pt_regs *regs, unsigned long exit_code)
> +{
> +	unsigned int fn = lower_bits(regs->ax, 32);
> +	unsigned long val;
> +
> +	/* Only CPUID is supported via MSR protocol */
> +	if (exit_code != SVM_EXIT_CPUID)
> +		goto fail;
> +
> +	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EAX));
> +	VMGEXIT();
> +	val = sev_es_rd_ghcb_msr();
> +	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> +		goto fail;
> +	regs->ax = val >> 32;
> +
> +	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EBX));
> +	VMGEXIT();
> +	val = sev_es_rd_ghcb_msr();
> +	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> +		goto fail;
> +	regs->bx = val >> 32;
> +
> +	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_ECX));
> +	VMGEXIT();
> +	val = sev_es_rd_ghcb_msr();
> +	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> +		goto fail;
> +	regs->cx = val >> 32;
> +
> +	sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EDX));
> +	VMGEXIT();
> +	val = sev_es_rd_ghcb_msr();
> +	if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> +		goto fail;
> +	regs->dx = val >> 32;
> +
> +	regs->ip += 2;
> +
> +	return;
> +
> +fail:
> +	sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
> +	VMGEXIT();
> +
> +	/* Shouldn't get here - if we do halt the machine */
> +	while (true)
> +		asm volatile("hlt\n");
> +}
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 62/70] x86/kvm: Add KVM specific VMMCALL handling under SEV-ES
  2020-03-19  9:13 ` [PATCH 62/70] x86/kvm: Add KVM " Joerg Roedel
@ 2020-03-20 21:23   ` David Rientjes
  2020-03-20 22:21     ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: David Rientjes @ 2020-03-20 21:23 UTC (permalink / raw)
  To: Joerg Roedel, erdemaktas
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, 19 Mar 2020, Joerg Roedel wrote:

> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Implement the callbacks to copy the processor state required by KVM to
> the GHCB.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> [ jroedel@suse.de: - Split out of a larger patch
>                    - Adapt to different callback functions ]
> Co-developed-by: Joerg Roedel <jroedel@suse.de>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/kvm.c | 35 +++++++++++++++++++++++++++++------
>  1 file changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 6efe0410fb72..0e3fc798d719 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -34,6 +34,8 @@
>  #include <asm/hypervisor.h>
>  #include <asm/tlb.h>
>  #include <asm/cpuidle_haltpoll.h>
> +#include <asm/ptrace.h>
> +#include <asm/svm.h>
>  
>  static int kvmapf = 1;
>  
> @@ -729,13 +731,34 @@ static void __init kvm_init_platform(void)
>  	x86_platform.apic_post_init = kvm_apic_init;
>  }
>  
> +#if defined(CONFIG_AMD_MEM_ENCRYPT)
> +static void kvm_sev_es_hcall_prepare(struct ghcb *ghcb, struct pt_regs *regs)
> +{
> +	/* RAX and CPL are already in the GHCB */
> +	ghcb_set_rbx(ghcb, regs->bx);
> +	ghcb_set_rcx(ghcb, regs->cx);
> +	ghcb_set_rdx(ghcb, regs->dx);
> +	ghcb_set_rsi(ghcb, regs->si);

Is it possible to check the hypercall from RAX and only copy the needed 
regs or is there a requirement that they must all be copied 
unconditionally?

> +}
> +
> +static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs)
> +{
> +	/* No checking of the return state needed */
> +	return true;
> +}
> +#endif
> +
>  const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> -	.name			= "KVM",
> -	.detect			= kvm_detect,
> -	.type			= X86_HYPER_KVM,
> -	.init.guest_late_init	= kvm_guest_init,
> -	.init.x2apic_available	= kvm_para_available,
> -	.init.init_platform	= kvm_init_platform,
> +	.name				= "KVM",
> +	.detect				= kvm_detect,
> +	.type				= X86_HYPER_KVM,
> +	.init.guest_late_init		= kvm_guest_init,
> +	.init.x2apic_available		= kvm_para_available,
> +	.init.init_platform		= kvm_init_platform,
> +#if defined(CONFIG_AMD_MEM_ENCRYPT)
> +	.runtime.sev_es_hcall_prepare	= kvm_sev_es_hcall_prepare,
> +	.runtime.sev_es_hcall_finish	= kvm_sev_es_hcall_finish,
> +#endif
>  };
>  
>  static __init int activate_jump_labels(void)
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
  2020-03-20 21:02     ` Dave Hansen
@ 2020-03-20 22:12       ` Joerg Roedel
  2020-03-20 22:26         ` Dave Hansen
  0 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 22:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Rientjes, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel

On Fri, Mar 20, 2020 at 02:02:13PM -0700, Dave Hansen wrote:
> It *never* flushes global pages.  For a generic function like this, that
> seems pretty dangerous because the PTEs it goes after could quite easily
> be Global.  It's also not _obviously_ correct if PCIDs are in play
> (which I don't think they are on AMD).
> 
> A flush_tlb_global() is probably more appropriate.  Better yet, is there
> a reason not to use flush_tlb_kernel_range()?  I don't think it's
> necessary to whack the entire TLB for one PTE set.

This code runs before the actual kernel image is decompressed, so there
is no PCID and no global pages (I think CR4.PGE is still 0). So a
cr3-write is enough to flush the TLB. Also the TLB-flush helpers of the
running kernel are not available here.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 18/70] x86/boot/compressed/64: Add stage1 #VC handler
  2020-03-20 21:16   ` David Rientjes
@ 2020-03-20 22:19     ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 22:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: erdemaktas, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel

On Fri, Mar 20, 2020 at 02:16:39PM -0700, David Rientjes wrote:
> On Thu, 19 Mar 2020, Joerg Roedel wrote:
> > +#define	GHCB_SEV_GHCB_RESP_CODE(v)	((v) & 0xfff)
> > +#define	VMGEXIT()			{ asm volatile("rep; vmmcall\n\r"); }
> 
> Since preemption and irqs should be disabled before updating the GHCB and 
> its MSR and until the contents have been accessed following VMGEXIT, 
> should there be checks in place to ensure that's always the case?

Good point, some checking is certainly helpful. Currently it is the
case, because the GHCB is accessed and used only:

	1) At boot when only the boot CPU is running

	2) In the #VC handler, which does not enable interrupts

	3) In the NMI handler, which is also not preemptible

I can also add code to sev_es_get/put_ghcb to make sure these conditions
are met. All this does not prevent the preemption by NMIs, which could
cause another nested #VC exception, but that is handled separatly.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 62/70] x86/kvm: Add KVM specific VMMCALL handling under SEV-ES
  2020-03-20 21:23   ` David Rientjes
@ 2020-03-20 22:21     ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 22:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: erdemaktas, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel

On Fri, Mar 20, 2020 at 02:23:58PM -0700, David Rientjes wrote:
> On Thu, 19 Mar 2020, Joerg Roedel wrote:
> > +#if defined(CONFIG_AMD_MEM_ENCRYPT)
> > +static void kvm_sev_es_hcall_prepare(struct ghcb *ghcb, struct pt_regs *regs)
> > +{
> > +	/* RAX and CPL are already in the GHCB */
> > +	ghcb_set_rbx(ghcb, regs->bx);
> > +	ghcb_set_rcx(ghcb, regs->cx);
> > +	ghcb_set_rdx(ghcb, regs->dx);
> > +	ghcb_set_rsi(ghcb, regs->si);
> 
> Is it possible to check the hypercall from RAX and only copy the needed 
> regs or is there a requirement that they must all be copied 
> unconditionally?

No, there is no such requirement. This could be optimized with hypercall
specific knowledge as it is in the KVM code anyway.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 23/70] x86/sev-es: Add support for handling IOIO exceptions
  2020-03-20 21:03   ` David Rientjes
@ 2020-03-20 22:24     ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-20 22:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

On Fri, Mar 20, 2020 at 02:03:17PM -0700, David Rientjes wrote:
> On Thu, 19 Mar 2020, Joerg Roedel wrote:
> > +	if (exit_info_1 & IOIO_TYPE_STR) {
> > +		int df = (regs->flags & X86_EFLAGS_DF) ? -1 : 1;
> >		[ ... ]
> > +		if (!(exit_info_1 & IOIO_TYPE_IN)) {
> > +			ret = vc_insn_string_read(ctxt,
> > +					       (void *)(es_base + regs->si),
> > +					       ghcb->shared_buffer, io_bytes,
> > +					       exit_info_2, df);
> 
> The last argument to vc_insn_string_read() is "bool backwards" which in 
> this case it appears will always be true?

Right, thanks, good catch, I'll fix this. Seems to be a leftover from a
previous version.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
  2020-03-20 22:12       ` Joerg Roedel
@ 2020-03-20 22:26         ` Dave Hansen
  2020-03-21 15:40             ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Dave Hansen @ 2020-03-20 22:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Rientjes, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel

On 3/20/20 3:12 PM, Joerg Roedel wrote:
> On Fri, Mar 20, 2020 at 02:02:13PM -0700, Dave Hansen wrote:
>> It *never* flushes global pages.  For a generic function like this, that
>> seems pretty dangerous because the PTEs it goes after could quite easily
>> be Global.  It's also not _obviously_ correct if PCIDs are in play
>> (which I don't think they are on AMD).
>>
>> A flush_tlb_global() is probably more appropriate.  Better yet, is there
>> a reason not to use flush_tlb_kernel_range()?  I don't think it's
>> necessary to whack the entire TLB for one PTE set.
> 
> This code runs before the actual kernel image is decompressed, so there
> is no PCID and no global pages (I think CR4.PGE is still 0). So a
> cr3-write is enough to flush the TLB. Also the TLB-flush helpers of the
> running kernel are not available here.

Geez, I always forget about the compressed code. :)  Good point about PCIDs.

In any case, I thought this all came through initialize_identity_maps(),
which does, for instance:

        mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;

Where:

#define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)

That looks like it has the Global bit set.  Does that not apply here
somehow?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
  2020-03-20 22:26         ` Dave Hansen
@ 2020-03-21 15:40             ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-21 15:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Rientjes, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel

On Fri, Mar 20, 2020 at 03:26:09PM -0700, Dave Hansen wrote:
> In any case, I thought this all came through initialize_identity_maps(),
> which does, for instance:
> 
>         mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
> 
> Where:
> 
> #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
> 
> That looks like it has the Global bit set.  Does that not apply here
> somehow?

No, as the value of %cr4 at boot is 0x00000020, so PGE is not set and
global pages are not enabled. It wouldn't make sense anyhow, as global
pages only make sense when there are more than one address space, which
is not the case that early in boot.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted
@ 2020-03-21 15:40             ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-21 15:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, x86, Kees Cook,
	kvm, Peter Zijlstra, Dave Hansen, linux-kernel, virtualization,
	Joerg Roedel, Andy Lutomirski, hpa, David Rientjes, Dan Williams,
	Jiri Slaby

On Fri, Mar 20, 2020 at 03:26:09PM -0700, Dave Hansen wrote:
> In any case, I thought this all came through initialize_identity_maps(),
> which does, for instance:
> 
>         mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
> 
> Where:
> 
> #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
> 
> That looks like it has the Global bit set.  Does that not apply here
> somehow?

No, as the value of %cr4 at boot is 0x00000020, so PGE is not set and
global pages are not enabled. It wouldn't make sense anyhow, as global
pages only make sense when there are more than one address space, which
is not the case that early in boot.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH] KVM: SVM: Use __packed shorthard
  2020-03-19  9:12   ` Joerg Roedel
  (?)
@ 2020-03-23 13:23   ` Borislav Petkov
  2020-03-24 12:43     ` Joerg Roedel
  -1 siblings, 1 reply; 243+ messages in thread
From: Borislav Petkov @ 2020-03-23 13:23 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

I guess we can do that ontop.

---
From: Borislav Petkov <bp@suse.de>
Date: Mon, 23 Mar 2020 14:20:08 +0100

... to make it more readable.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/svm.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index f36288c659b5..1ec813f02c58 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -151,14 +151,14 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define SVM_NESTED_CTL_NP_ENABLE	BIT(0)
 #define SVM_NESTED_CTL_SEV_ENABLE	BIT(1)
 
-struct __attribute__ ((__packed__)) vmcb_seg {
+struct vmcb_seg {
 	u16 selector;
 	u16 attrib;
 	u32 limit;
 	u64 base;
-};
+} __packed;
 
-struct __attribute__ ((__packed__)) vmcb_save_area {
+struct vmcb_save_area {
 	struct vmcb_seg es;
 	struct vmcb_seg cs;
 	struct vmcb_seg ss;
@@ -233,9 +233,9 @@ struct __attribute__ ((__packed__)) vmcb_save_area {
 	u8 valid_bitmap[16];
 	u64 x87_state_gpa;
 	u8 reserved_12[1016];
-};
+} __packed;
 
-struct __attribute__ ((__packed__)) ghcb {
+struct ghcb {
 	struct vmcb_save_area save;
 
 	u8 shared_buffer[2032];
@@ -243,12 +243,12 @@ struct __attribute__ ((__packed__)) ghcb {
 	u8 reserved_1[10];
 	u16 protocol_version;	/* negotiated SEV-ES/GHCB protocol version */
 	u32 ghcb_usage;
-};
+} __packed;
 
-struct __attribute__ ((__packed__)) vmcb {
+struct vmcb {
 	struct vmcb_control_area control;
 	struct vmcb_save_area save;
-};
+} __packed;
 
 #define SVM_CPUID_FUNC 0x8000000a
 
-- 
2.21.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] KVM: SVM: Use __packed shorthard
  2020-03-23 13:23   ` [PATCH] KVM: SVM: Use __packed shorthard Borislav Petkov
@ 2020-03-24 12:43     ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-03-24 12:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Mon, Mar 23, 2020 at 02:23:15PM +0100, Borislav Petkov wrote:
> I guess we can do that ontop.
> 
> ---
> From: Borislav Petkov <bp@suse.de>
> Date: Mon, 23 Mar 2020 14:20:08 +0100
> 
> ... to make it more readable.
> 
> No functional changes.
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/include/asm/svm.h | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)

Added it to the patch-set, thanks Boris.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-25 15:39   ` Borislav Petkov
  2020-03-27  3:02       ` Masami Hiramatsu
  -1 siblings, 1 reply; 243+ messages in thread
From: Borislav Petkov @ 2020-03-25 15:39 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel, Masami Hiramatsu

+ Masami.

On Thu, Mar 19, 2020 at 10:13:02AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> The inat-tables.c file has some arrays in it that contain pointers to
> other arrays. These pointers need to be relocated when the kernel
> image is moved to a different location.
> 
> The pre-decompression boot-code has no support for applying ELF
> relocations, so initialize these arrays at runtime in the
> pre-decompression code to make sure all pointers are correctly
> initialized.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/tools/gen-insn-attr-x86.awk       | 50 +++++++++++++++++++++-
>  tools/arch/x86/tools/gen-insn-attr-x86.awk | 50 +++++++++++++++++++++-
>  2 files changed, 98 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/tools/gen-insn-attr-x86.awk b/arch/x86/tools/gen-insn-attr-x86.awk
> index a42015b305f4..af38469afd14 100644
> --- a/arch/x86/tools/gen-insn-attr-x86.awk
> +++ b/arch/x86/tools/gen-insn-attr-x86.awk
> @@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
>  END {
>  	if (awkchecked != "")
>  		exit 1
> +
> +	print "#ifndef __BOOT_COMPRESSED\n"
> +
>  	# print escape opcode map's array
>  	print "/* Escape opcode map array */"
>  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> @@ -388,6 +391,51 @@ END {
>  		for (j = 0; j < max_lprefix; j++)
>  			if (atable[i,j])
>  				print "	["i"]["j"] = "atable[i,j]","
> -	print "};"
> +	print "};\n"
> +
> +	print "#else /* !__BOOT_COMPRESSED */\n"
> +
> +	print "/* Escape opcode map array */"
> +	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
> +	      "[INAT_LSTPFX_MAX + 1];"
> +	print ""
> +
> +	print "/* Group opcode map array */"
> +	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
> +	      "[INAT_LSTPFX_MAX + 1];"
> +	print ""
> +
> +	print "/* AVX opcode map array */"
> +	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
> +	      "[INAT_LSTPFX_MAX + 1];"
> +	print ""
> +
> +	print "static void inat_init_tables(void)"
> +	print "{"
> +
> +	# print escape opcode map's array
> +	print "\t/* Print Escape opcode map array */"
> +	for (i = 0; i < geid; i++)
> +		for (j = 0; j < max_lprefix; j++)
> +			if (etable[i,j])
> +				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
> +	print ""
> +
> +	# print group opcode map's array
> +	print "\t/* Print Group opcode map array */"
> +	for (i = 0; i < ggid; i++)
> +		for (j = 0; j < max_lprefix; j++)
> +			if (gtable[i,j])
> +				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
> +	print ""
> +	# print AVX opcode map's array
> +	print "\t/* Print AVX opcode map array */"
> +	for (i = 0; i < gaid; i++)
> +		for (j = 0; j < max_lprefix; j++)
> +			if (atable[i,j])
> +				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
> +
> +	print "}"
> +	print "#endif"
>  }
>  
> diff --git a/tools/arch/x86/tools/gen-insn-attr-x86.awk b/tools/arch/x86/tools/gen-insn-attr-x86.awk
> index a42015b305f4..af38469afd14 100644
> --- a/tools/arch/x86/tools/gen-insn-attr-x86.awk
> +++ b/tools/arch/x86/tools/gen-insn-attr-x86.awk
> @@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
>  END {
>  	if (awkchecked != "")
>  		exit 1
> +
> +	print "#ifndef __BOOT_COMPRESSED\n"
> +
>  	# print escape opcode map's array
>  	print "/* Escape opcode map array */"
>  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> @@ -388,6 +391,51 @@ END {
>  		for (j = 0; j < max_lprefix; j++)
>  			if (atable[i,j])
>  				print "	["i"]["j"] = "atable[i,j]","
> -	print "};"
> +	print "};\n"
> +
> +	print "#else /* !__BOOT_COMPRESSED */\n"
> +
> +	print "/* Escape opcode map array */"
> +	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
> +	      "[INAT_LSTPFX_MAX + 1];"
> +	print ""
> +
> +	print "/* Group opcode map array */"
> +	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
> +	      "[INAT_LSTPFX_MAX + 1];"
> +	print ""
> +
> +	print "/* AVX opcode map array */"
> +	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
> +	      "[INAT_LSTPFX_MAX + 1];"
> +	print ""
> +
> +	print "static void inat_init_tables(void)"
> +	print "{"
> +
> +	# print escape opcode map's array
> +	print "\t/* Print Escape opcode map array */"
> +	for (i = 0; i < geid; i++)
> +		for (j = 0; j < max_lprefix; j++)
> +			if (etable[i,j])
> +				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
> +	print ""
> +
> +	# print group opcode map's array
> +	print "\t/* Print Group opcode map array */"
> +	for (i = 0; i < ggid; i++)
> +		for (j = 0; j < max_lprefix; j++)
> +			if (gtable[i,j])
> +				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
> +	print ""
> +	# print AVX opcode map's array
> +	print "\t/* Print AVX opcode map array */"
> +	for (i = 0; i < gaid; i++)
> +		for (j = 0; j < max_lprefix; j++)
> +			if (atable[i,j])
> +				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
> +
> +	print "}"
> +	print "#endif"
>  }
>  
> -- 
> 2.17.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 06/70] x86/umip: Factor out instruction fetch
  2020-03-19  9:13 ` [PATCH 06/70] x86/umip: Factor out instruction fetch Joerg Roedel
@ 2020-03-26 17:21   ` Borislav Petkov
  0 siblings, 0 replies; 243+ messages in thread
From: Borislav Petkov @ 2020-03-26 17:21 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:03AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Factor out the code to fetch the instruction from user-space to a helper
> function.

Add "No functional changes." here.

> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/insn-eval.h |  2 ++
>  arch/x86/kernel/umip.c           | 26 +++++-----------------
>  arch/x86/lib/insn-eval.c         | 38 ++++++++++++++++++++++++++++++++
>  3 files changed, 46 insertions(+), 20 deletions(-)

...

> +int insn_fetch_from_user(struct pt_regs *regs,
> +			 unsigned char buf[MAX_INSN_SIZE])

No need for that linebreak - fits in 80 cols.

> +{
> +	unsigned long seg_base = 0;
> +	int not_copied;
> +
> +	/*
> +	 * If not in user-space long mode, a custom code segment could be in
> +	 * use. This is true in protected mode (if the process defined a local
> +	 * descriptor table), or virtual-8086 mode. In most of the cases
> +	 * seg_base will be zero as in USER_CS.
> +	 */
> +	if (!user_64bit_mode(regs))
> +		seg_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
> +
> +	if (seg_base == -1L)
> +		return 0;

This reads strange: seg_base is changed only inside that if test so I
guess we could test it there too:

        if (!user_64bit_mode(regs)) {
                seg_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
                if (seg_base == -1L)
                        return 0;
        }

which is a small enough change to not require a separate patch.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 07/70] x86/umip: Factor out instruction decoding
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-26 17:24   ` Borislav Petkov
  -1 siblings, 0 replies; 243+ messages in thread
From: Borislav Petkov @ 2020-03-26 17:24 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:04AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Factor out the code used to decode an instruction with the correct
> address and operand sizes to a helper function.

As for the previous one: "No functional changes."

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
  2020-03-25 15:39   ` Borislav Petkov
@ 2020-03-27  3:02       ` Masami Hiramatsu
  0 siblings, 0 replies; 243+ messages in thread
From: Masami Hiramatsu @ 2020-03-27  3:02 UTC (permalink / raw)
  To: Borislav Petkov, Joerg Roedel
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel, Masami Hiramatsu

Hi,

On Wed, 25 Mar 2020 16:39:45 +0100
Borislav Petkov <bp@alien8.de> wrote:

> + Masami.
> 
> On Thu, Mar 19, 2020 at 10:13:02AM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > The inat-tables.c file has some arrays in it that contain pointers to
> > other arrays. These pointers need to be relocated when the kernel
> > image is moved to a different location.
> > 
> > The pre-decompression boot-code has no support for applying ELF
> > relocations, so initialize these arrays at runtime in the
> > pre-decompression code to make sure all pointers are correctly
> > initialized.

I need to check the whole series, but as far as I can understand from
this patch, this seems not allowing to store the address value in
static pointers. It may break more things, for example _kprobe_blacklist
records the NOKPROBE_SYMBOL() symbol addresses at the build time.

I have some comments here.
 
> > Signed-off-by: Joerg Roedel <jroedel@suse.de>
> > ---
> >  arch/x86/tools/gen-insn-attr-x86.awk       | 50 +++++++++++++++++++++-
> >  tools/arch/x86/tools/gen-insn-attr-x86.awk | 50 +++++++++++++++++++++-
> >  2 files changed, 98 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/tools/gen-insn-attr-x86.awk b/arch/x86/tools/gen-insn-attr-x86.awk
> > index a42015b305f4..af38469afd14 100644
> > --- a/arch/x86/tools/gen-insn-attr-x86.awk
> > +++ b/arch/x86/tools/gen-insn-attr-x86.awk
> > @@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
> >  END {
> >  	if (awkchecked != "")
> >  		exit 1
> > +
> > +	print "#ifndef __BOOT_COMPRESSED\n"
> > +
> >  	# print escape opcode map's array
> >  	print "/* Escape opcode map array */"
> >  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> > @@ -388,6 +391,51 @@ END {
> >  		for (j = 0; j < max_lprefix; j++)
> >  			if (atable[i,j])
> >  				print "	["i"]["j"] = "atable[i,j]","
> > -	print "};"
> > +	print "};\n"
> > +
> > +	print "#else /* !__BOOT_COMPRESSED */\n"

I think the definitions of inat_*_tables can be shared in both case.
If __BOOT_COMPRESSED is set, we can define inat_init_tables() as a
initialize function, and if not, it will be just a dummy "do {} while (0)".

BTW, where is the __BOOT_COMPRESSED defined?

> > +
> > +	print "/* Escape opcode map array */"
> > +	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
> > +	      "[INAT_LSTPFX_MAX + 1];"
> > +	print ""
> > +
> > +	print "/* Group opcode map array */"
> > +	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
> > +	      "[INAT_LSTPFX_MAX + 1];"
> > +	print ""
> > +
> > +	print "/* AVX opcode map array */"
> > +	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
> > +	      "[INAT_LSTPFX_MAX + 1];"
> > +	print ""
> > +
> > +	print "static void inat_init_tables(void)"

This functions should be "inline".
And I can not see the call-site of inat_init_tables() in this patch.

If possible, please include call-site with definition (especially
new init function) so that I can check the init call timing too.

> > +	print "{"
> > +
> > +	# print escape opcode map's array
> > +	print "\t/* Print Escape opcode map array */"
> > +	for (i = 0; i < geid; i++)
> > +		for (j = 0; j < max_lprefix; j++)
> > +			if (etable[i,j])
> > +				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
> > +	print ""
> > +
> > +	# print group opcode map's array
> > +	print "\t/* Print Group opcode map array */"
> > +	for (i = 0; i < ggid; i++)
> > +		for (j = 0; j < max_lprefix; j++)
> > +			if (gtable[i,j])
> > +				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
> > +	print ""
> > +	# print AVX opcode map's array
> > +	print "\t/* Print AVX opcode map array */"
> > +	for (i = 0; i < gaid; i++)
> > +		for (j = 0; j < max_lprefix; j++)
> > +			if (atable[i,j])
> > +				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
> > +
> > +	print "}"
> > +	print "#endif"
> >  }

The code itself looks good to me.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
@ 2020-03-27  3:02       ` Masami Hiramatsu
  0 siblings, 0 replies; 243+ messages in thread
From: Masami Hiramatsu @ 2020-03-27  3:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization, Joerg Roedel, Masami Hiramatsu

Hi,

On Wed, 25 Mar 2020 16:39:45 +0100
Borislav Petkov <bp@alien8.de> wrote:

> + Masami.
> 
> On Thu, Mar 19, 2020 at 10:13:02AM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > The inat-tables.c file has some arrays in it that contain pointers to
> > other arrays. These pointers need to be relocated when the kernel
> > image is moved to a different location.
> > 
> > The pre-decompression boot-code has no support for applying ELF
> > relocations, so initialize these arrays at runtime in the
> > pre-decompression code to make sure all pointers are correctly
> > initialized.

I need to check the whole series, but as far as I can understand from
this patch, this seems not allowing to store the address value in
static pointers. It may break more things, for example _kprobe_blacklist
records the NOKPROBE_SYMBOL() symbol addresses at the build time.

I have some comments here.
 
> > Signed-off-by: Joerg Roedel <jroedel@suse.de>
> > ---
> >  arch/x86/tools/gen-insn-attr-x86.awk       | 50 +++++++++++++++++++++-
> >  tools/arch/x86/tools/gen-insn-attr-x86.awk | 50 +++++++++++++++++++++-
> >  2 files changed, 98 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/tools/gen-insn-attr-x86.awk b/arch/x86/tools/gen-insn-attr-x86.awk
> > index a42015b305f4..af38469afd14 100644
> > --- a/arch/x86/tools/gen-insn-attr-x86.awk
> > +++ b/arch/x86/tools/gen-insn-attr-x86.awk
> > @@ -362,6 +362,9 @@ function convert_operands(count,opnd,       i,j,imm,mod)
> >  END {
> >  	if (awkchecked != "")
> >  		exit 1
> > +
> > +	print "#ifndef __BOOT_COMPRESSED\n"
> > +
> >  	# print escape opcode map's array
> >  	print "/* Escape opcode map array */"
> >  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> > @@ -388,6 +391,51 @@ END {
> >  		for (j = 0; j < max_lprefix; j++)
> >  			if (atable[i,j])
> >  				print "	["i"]["j"] = "atable[i,j]","
> > -	print "};"
> > +	print "};\n"
> > +
> > +	print "#else /* !__BOOT_COMPRESSED */\n"

I think the definitions of inat_*_tables can be shared in both case.
If __BOOT_COMPRESSED is set, we can define inat_init_tables() as a
initialize function, and if not, it will be just a dummy "do {} while (0)".

BTW, where is the __BOOT_COMPRESSED defined?

> > +
> > +	print "/* Escape opcode map array */"
> > +	print "static const insn_attr_t *inat_escape_tables[INAT_ESC_MAX + 1]" \
> > +	      "[INAT_LSTPFX_MAX + 1];"
> > +	print ""
> > +
> > +	print "/* Group opcode map array */"
> > +	print "static const insn_attr_t *inat_group_tables[INAT_GRP_MAX + 1]"\
> > +	      "[INAT_LSTPFX_MAX + 1];"
> > +	print ""
> > +
> > +	print "/* AVX opcode map array */"
> > +	print "static const insn_attr_t *inat_avx_tables[X86_VEX_M_MAX + 1]"\
> > +	      "[INAT_LSTPFX_MAX + 1];"
> > +	print ""
> > +
> > +	print "static void inat_init_tables(void)"

This functions should be "inline".
And I can not see the call-site of inat_init_tables() in this patch.

If possible, please include call-site with definition (especially
new init function) so that I can check the init call timing too.

> > +	print "{"
> > +
> > +	# print escape opcode map's array
> > +	print "\t/* Print Escape opcode map array */"
> > +	for (i = 0; i < geid; i++)
> > +		for (j = 0; j < max_lprefix; j++)
> > +			if (etable[i,j])
> > +				print "\tinat_escape_tables["i"]["j"] = "etable[i,j]";"
> > +	print ""
> > +
> > +	# print group opcode map's array
> > +	print "\t/* Print Group opcode map array */"
> > +	for (i = 0; i < ggid; i++)
> > +		for (j = 0; j < max_lprefix; j++)
> > +			if (gtable[i,j])
> > +				print "\tinat_group_tables["i"]["j"] = "gtable[i,j]";"
> > +	print ""
> > +	# print AVX opcode map's array
> > +	print "\t/* Print AVX opcode map array */"
> > +	for (i = 0; i < gaid; i++)
> > +		for (j = 0; j < max_lprefix; j++)
> > +			if (atable[i,j])
> > +				print "\tinat_avx_tables["i"]["j"] = "atable[i,j]";"
> > +
> > +	print "}"
> > +	print "#endif"
> >  }

The code itself looks good to me.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 09/70] x86/insn: Add insn_rep_prefix() helper
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-27  3:56   ` Masami Hiramatsu
  -1 siblings, 0 replies; 243+ messages in thread
From: Masami Hiramatsu @ 2020-03-27  3:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, 19 Mar 2020 10:13:06 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> Add a function to check whether an instruction has a REP prefix.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/insn-eval.h |  1 +
>  arch/x86/lib/insn-eval.c         | 24 ++++++++++++++++++++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
> index 1e343010129e..41dee0faae97 100644
> --- a/arch/x86/include/asm/insn-eval.h
> +++ b/arch/x86/include/asm/insn-eval.h
> @@ -15,6 +15,7 @@
>  #define INSN_CODE_SEG_OPND_SZ(params) (params & 0xf)
>  #define INSN_CODE_SEG_PARAMS(oper_sz, addr_sz) (oper_sz | (addr_sz << 4))
>  
> +bool insn_rep_prefix(struct insn *insn);

Can you make it "insn_has_rep_prefix()"?

Thank you,

>  void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
>  int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
>  int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
> diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
> index f18260a19960..5d98dff5a2d7 100644
> --- a/arch/x86/lib/insn-eval.c
> +++ b/arch/x86/lib/insn-eval.c
> @@ -53,6 +53,30 @@ static bool is_string_insn(struct insn *insn)
>  	}
>  }
>  
> +/**
> + * insn_rep_prefix() - Determine if instruction has a REP prefix
> + * @insn:	Instruction containing the prefix to inspect
> + *
> + * Returns:
> + *
> + * true if the instruction has a REP prefix, false if not.
> + */
> +bool insn_rep_prefix(struct insn *insn)
> +{
> +	int i;
> +
> +	insn_get_prefixes(insn);
> +
> +	for (i = 0; i < insn->prefixes.nbytes; i++) {
> +		insn_byte_t p = insn->prefixes.bytes[i];
> +
> +		if (p == 0xf2 || p == 0xf3)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  /**
>   * get_seg_reg_override_idx() - obtain segment register override index
>   * @insn:	Valid instruction with segment override prefixes
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 08/70] x86/insn: Add insn_get_modrm_reg_off()
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-27  3:57   ` Masami Hiramatsu
  -1 siblings, 0 replies; 243+ messages in thread
From: Masami Hiramatsu @ 2020-03-27  3:57 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, 19 Mar 2020 10:13:05 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> Add a function to the instruction decoder which returns the pt_regs
> offset of the register specified in the reg field of the modrm byte.
> 

This looks good to me.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you,

> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/insn-eval.h |  1 +
>  arch/x86/lib/insn-eval.c         | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h
> index b4ff3e3316d1..1e343010129e 100644
> --- a/arch/x86/include/asm/insn-eval.h
> +++ b/arch/x86/include/asm/insn-eval.h
> @@ -17,6 +17,7 @@
>  
>  void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs);
>  int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
> +int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs);
>  unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx);
>  int insn_get_code_seg_params(struct pt_regs *regs);
>  int insn_fetch_from_user(struct pt_regs *regs,
> diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
> index 1949f5258f9e..f18260a19960 100644
> --- a/arch/x86/lib/insn-eval.c
> +++ b/arch/x86/lib/insn-eval.c
> @@ -20,6 +20,7 @@
>  
>  enum reg_type {
>  	REG_TYPE_RM = 0,
> +	REG_TYPE_REG,
>  	REG_TYPE_INDEX,
>  	REG_TYPE_BASE,
>  };
> @@ -441,6 +442,13 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
>  			regno += 8;
>  		break;
>  
> +	case REG_TYPE_REG:
> +		regno = X86_MODRM_REG(insn->modrm.value);
> +
> +		if (X86_REX_R(insn->rex_prefix.value))
> +			regno += 8;
> +		break;
> +
>  	case REG_TYPE_INDEX:
>  		regno = X86_SIB_INDEX(insn->sib.value);
>  		if (X86_REX_X(insn->rex_prefix.value))
> @@ -809,6 +817,21 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs)
>  	return get_reg_offset(insn, regs, REG_TYPE_RM);
>  }
>  
> +/**
> + * insn_get_modrm_reg_off() - Obtain register in reg part of the ModRM byte
> + * @insn:	Instruction containing the ModRM byte
> + * @regs:	Register values as seen when entering kernel mode
> + *
> + * Returns:
> + *
> + * The register indicated by the reg part of the ModRM byte. The
> + * register is obtained as an offset from the base of pt_regs.
> + */
> +int insn_get_modrm_reg_off(struct insn *insn, struct pt_regs *regs)
> +{
> +	return get_reg_offset(insn, regs, REG_TYPE_REG);
> +}
> +
>  /**
>   * get_seg_base_limit() - obtain base address and limit of a segment
>   * @insn:	Instruction. Must be valid.
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [tip: x86/boot] x86/boot/compressed: Fix debug_puthex() parameter type
  2020-03-19  9:13 ` [PATCH 10/70] x86/boot/compressed: Fix debug_puthex() parameter type Joerg Roedel
@ 2020-03-28 11:23   ` tip-bot2 for Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: tip-bot2 for Joerg Roedel @ 2020-03-28 11:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Joerg Roedel, Borislav Petkov, x86, LKML

The following commit has been merged into the x86/boot branch of tip:

Commit-ID:     c90beea22a2bece4b0bbb39789bf835504421594
Gitweb:        https://git.kernel.org/tip/c90beea22a2bece4b0bbb39789bf835504421594
Author:        Joerg Roedel <jroedel@suse.de>
AuthorDate:    Thu, 19 Mar 2020 10:13:07 +01:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Sat, 28 Mar 2020 12:14:26 +01:00

x86/boot/compressed: Fix debug_puthex() parameter type

In the CONFIG_X86_VERBOSE_BOOTUP=Y case, the debug_puthex() macro just
turns into __puthex(), which takes 'unsigned long' as parameter.

But in the CONFIG_X86_VERBOSE_BOOTUP=N case, it is a function which
takes 'unsigned char *', causing compile warnings when the function is
used. Fix the parameter type to get rid of the warnings.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200319091407.1481-11-joro@8bytes.org
---
 arch/x86/boot/compressed/misc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index c818139..726e264 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -59,7 +59,7 @@ void __puthex(unsigned long value);
 
 static inline void debug_putstr(const char *s)
 { }
-static inline void debug_puthex(const char *s)
+static inline void debug_puthex(unsigned long value)
 { }
 #define debug_putaddr(x) /* */
 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 11/70] x86/boot/compressed/64: Disable red-zone usage
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-03-31 13:16   ` Borislav Petkov
  -1 siblings, 0 replies; 243+ messages in thread
From: Borislav Petkov @ 2020-03-31 13:16 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:08AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> The x86-64 ABI defines a red-zone on the stack:
> 
>   The 128-byte area beyond the location pointed to by %rsp is
>   considered to be reserved and shall not be modified by signal or
>   interrupt handlers. 10 Therefore, functions may use this area for
			^^

That 10 is the footnote number from the pdf. :)

>   temporary data that is not needed across function calls. In
>   particular, leaf functions may use this area for their entire stack
>   frame, rather than adjusting the stack pointer in the prologue and
>   epilogue. This area is known as the red zone.
> 
> This is not compatible with exception handling, so disable it for the

I could use some blurb as to what the problem is, for future reference.

> pre-decompression boot code.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/boot/Makefile            | 2 +-
>  arch/x86/boot/compressed/Makefile | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
> index 012b82fc8617..8f55e4ce1ccc 100644
> --- a/arch/x86/boot/Makefile
> +++ b/arch/x86/boot/Makefile
> @@ -65,7 +65,7 @@ clean-files += cpustr.h
>  
>  # ---------------------------------------------------------------------------
>  
> -KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP
> +KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP -mno-red-zone
>  KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  KBUILD_CFLAGS	+= $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
>  GCOV_PROFILE := n
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 26050ae0b27e..e186cc0b628d 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -30,7 +30,7 @@ KBUILD_CFLAGS := -m$(BITS) -O2
>  KBUILD_CFLAGS += -fno-strict-aliasing $(call cc-option, -fPIE, -fPIC)
>  KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
>  cflags-$(CONFIG_X86_32) := -march=i386
> -cflags-$(CONFIG_X86_64) := -mcmodel=small
> +cflags-$(CONFIG_X86_64) := -mcmodel=small -mno-red-zone
>  KBUILD_CFLAGS += $(cflags-y)
>  KBUILD_CFLAGS += -mno-mmx -mno-sse
>  KBUILD_CFLAGS += $(call cc-option,-ffreestanding)
> @@ -87,7 +87,7 @@ endif
>  
>  vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
>  
> -$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
> +$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar
>  
>  vmlinux-objs-$(CONFIG_EFI_STUB) += $(obj)/eboot.o \
>  	$(objtree)/drivers/firmware/efi/libstub/lib.a

That last chunk is not needed anymore after

c2d0b470154c ("efi/libstub/x86: Incorporate eboot.c into libstub")

AFAICT.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 14/70] x86/boot/compressed/64: Add page-fault handler
  2020-03-19  9:13   ` Joerg Roedel
  (?)
@ 2020-04-02 11:49   ` Borislav Petkov
  -1 siblings, 0 replies; 243+ messages in thread
From: Borislav Petkov @ 2020-04-02 11:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:11AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Install a page-fault handler to add an identity mapping to addresses
> not yet mapped. Also do some checking whether the error code is sane.
> 
> This makes non SEV-ES machines use the exception handling
> infrastructure in the pre-decompressions boot code too, making it less
> likely to break in the future.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/boot/compressed/ident_map_64.c    | 38 ++++++++++++++++++++++
>  arch/x86/boot/compressed/idt_64.c          |  2 ++
>  arch/x86/boot/compressed/idt_handlers_64.S |  2 ++
>  arch/x86/boot/compressed/misc.h            |  6 ++++
>  4 files changed, 48 insertions(+)
> 
> diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
> index 3a2115582920..0865d181b85d 100644
> --- a/arch/x86/boot/compressed/ident_map_64.c
> +++ b/arch/x86/boot/compressed/ident_map_64.c
> @@ -19,11 +19,13 @@
>  /* No PAGE_TABLE_ISOLATION support needed either: */
>  #undef CONFIG_PAGE_TABLE_ISOLATION
>  
> +#include "error.h"
>  #include "misc.h"
>  
>  /* These actually do the work of building the kernel identity maps. */
>  #include <asm/init.h>
>  #include <asm/pgtable.h>
> +#include <asm/trap_defs.h>
>  /* Use the static base for this part of the boot process */
>  #undef __PAGE_OFFSET
>  #define __PAGE_OFFSET __PAGE_OFFSET_BASE
> @@ -163,3 +165,39 @@ void finalize_identity_maps(void)
>  {
>  	write_cr3(top_level_pgt);
>  }
> +
> +static void pf_error(unsigned long error_code, unsigned long address,
> +		     struct pt_regs *regs)

AFAICT, that function is called below only so just merge its body into
the call site instead...

> +{
> +	error_putstr("Unexpected page-fault:");
> +	error_putstr("\nError Code: ");
> +	error_puthex(error_code);
> +	error_putstr("\nCR2: 0x");
> +	error_puthex(address);
> +	error_putstr("\nRIP relative to _head: 0x");
> +	error_puthex(regs->ip - (unsigned long)_head);
> +	error_putstr("\n");
> +
> +	error("Stopping.\n");
> +}
> +
> +void do_boot_page_fault(struct pt_regs *regs)
> +{
> +	unsigned long address = native_read_cr2();
> +	unsigned long error_code = regs->orig_ax;
> +
> +	/*
> +	 * Check for unexpected error codes. Unexpected are:
> +	 *	- Faults on present pages
> +	 *	- User faults
> +	 *	- Reserved bits set
> +	 */
> +	if (error_code & (X86_PF_PROT | X86_PF_USER | X86_PF_RSVD))
> +		pf_error(error_code, address, regs);
> +
> +	/*
> +	 * Error code is sane - now identity map the 2M region around
> +	 * the faulting address.
> +	 */
> +	add_identity_map(address & PMD_MASK, PMD_SIZE);
> +}
> diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
> index 46ecea671b90..84ba57d9d436 100644
> --- a/arch/x86/boot/compressed/idt_64.c
> +++ b/arch/x86/boot/compressed/idt_64.c
> @@ -39,5 +39,7 @@ void load_stage2_idt(void)
>  {
>  	boot_idt_desc.address = (unsigned long)boot_idt;
>  
> +	set_idt_entry(X86_TRAP_PF, boot_pf_handler);
> +
>  	load_boot_idt(&boot_idt_desc);
>  }
> diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
> index 3d86ab35ef52..bfb3fc5aa144 100644
> --- a/arch/x86/boot/compressed/idt_handlers_64.S
> +++ b/arch/x86/boot/compressed/idt_handlers_64.S
> @@ -73,3 +73,5 @@ SYM_FUNC_END(\name)
>  
>  	.text
>  	.code64
> +
> +EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1

			boot_page_fault do_boot_page_fault

equivalent to the PF handler proper naming pls. Grepping "page_fault"
would give you all then.

> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 3a030a878d53..eff4ed0b1cea 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -37,6 +37,9 @@
>  #define memptr unsigned
>  #endif
>  
> +/* boot/compressed/vmlinux start and end markers */
> +extern char _head[], _end[];
> +
>  /* misc.c */
>  extern memptr free_mem_ptr;
>  extern memptr free_mem_end_ptr;
> @@ -146,4 +149,7 @@ extern pteval_t __default_kernel_pte_mask;
>  extern gate_desc boot_idt[BOOT_IDT_ENTRIES];
>  extern struct desc_ptr boot_idt_desc;
>  
> +/* IDT Entry Points */
> +void boot_pf_handler(void);
> +
>  #endif /* BOOT_COMPRESSED_MISC_H */
> -- 
> 2.17.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 15/70] x86/boot/compressed/64: Always switch to own page-table
  2020-03-19  9:13 ` [PATCH 15/70] x86/boot/compressed/64: Always switch to own page-table Joerg Roedel
@ 2020-04-06 11:56   ` Borislav Petkov
  0 siblings, 0 replies; 243+ messages in thread
From: Borislav Petkov @ 2020-04-06 11:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:12AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> When booted through startup_64 the kernel keeps running on the EFI
> page-table until the KASLR code sets up its own page-table. Without
> KASLR the pre-decompression boot code never switches off the EFI
> page-table. Change that by unconditionally switching to our own
> page-table once the kernel is relocated.
> 
> This makes sure we can make changes to the mapping when necessary, for

Pls use passive voice in your commit message: no "we" or "I", etc, and
describe your changes in imperative mood.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 18/70] x86/boot/compressed/64: Add stage1 #VC handler
  2020-03-19  9:13   ` Joerg Roedel
  (?)
  (?)
@ 2020-04-06 12:41   ` Borislav Petkov
  -1 siblings, 0 replies; 243+ messages in thread
From: Borislav Petkov @ 2020-04-06 12:41 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:15AM +0100, Joerg Roedel wrote:
> diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
> index bfb3fc5aa144..67ddafab2943 100644
> --- a/arch/x86/boot/compressed/idt_handlers_64.S
> +++ b/arch/x86/boot/compressed/idt_handlers_64.S
> @@ -75,3 +75,7 @@ SYM_FUNC_END(\name)
>  	.code64
>  
>  EXCEPTION_HANDLER	boot_pf_handler do_boot_page_fault error_code=1
> +
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +EXCEPTION_HANDLER	boot_stage1_vc_handler vc_no_ghcb_handler error_code=1

Like the others
			boot_stage1_vc	do_boot_stage1_vc ...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 12/70] x86/boot/compressed/64: Add IDT Infrastructure
  2020-03-19  9:13 ` [PATCH 12/70] x86/boot/compressed/64: Add IDT Infrastructure Joerg Roedel
@ 2020-04-07  2:21   ` Arvind Sankar
  2020-04-16 13:30       ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Arvind Sankar @ 2020-04-07  2:21 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Thu, Mar 19, 2020 at 10:13:09AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Add code needed to setup an IDT in the early pre-decompression
> boot-code. The IDT is loaded first in startup_64, which is after
> EfiExitBootServices() has been called, and later reloaded when the
> kernel image has been relocated to the end of the decompression area.
> 
> This allows to setup different IDT handlers before and after the
> relocation.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index 1f1f6c8139b3..d27a9ce1bcb0 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -465,6 +470,16 @@ SYM_FUNC_END_ALIAS(efi_stub_entry)
>  	.text
>  SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
>  
> +/*
> + * Reload GDT after relocation - The GDT at the non-relocated position
> + * might be overwritten soon by the in-place decompression, so reload
> + * GDT at the relocated address. The GDT is referenced by exception
> + * handling and needs to be set up correctly.
> + */
> +	leaq	gdt(%rip), %rax
> +	movq	%rax, gdt64+2(%rip)
> +	lgdt	gdt64(%rip)
> +
>  /*
>   * Clear BSS (stack is currently empty)
>   */

Note that this is now done in mainline as of commit c98a76eabbb6e, just
prior to jumping to .Lrelocated, so this can be dropped on the next
rebase.

Thanks.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-03-19  9:13 ` [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler Joerg Roedel
@ 2020-04-14 19:03     ` Mike Stunes
  0 siblings, 0 replies; 243+ messages in thread
From: Mike Stunes @ 2020-04-14 19:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Mar 19, 2020, at 2:13 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> The runtime handler needs a GHCB per CPU. Set them up and map them
> unencrypted.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
> arch/x86/include/asm/mem_encrypt.h |  2 ++
> arch/x86/kernel/sev-es.c           | 28 +++++++++++++++++++++++++++-
> arch/x86/kernel/traps.c            |  3 +++
> 3 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index c17980e8db78..4bf5286310a0 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void)
> 	return true;
> }
> 
> +void sev_es_init_ghcbs(void)
> +{
> +	int cpu;
> +
> +	if (!sev_es_active())
> +		return;
> +
> +	/* Allocate GHCB pages */
> +	ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE);
> +
> +	/* Initialize per-cpu GHCB pages */
> +	for_each_possible_cpu(cpu) {
> +		struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu);
> +
> +		set_memory_decrypted((unsigned long)ghcb,
> +				     sizeof(*ghcb) >> PAGE_SHIFT);
> +		memset(ghcb, 0, sizeof(*ghcb));
> +	}
> +}
> +

set_memory_decrypted needs to check the return value. I see it
consistently return ENOMEM. I've traced that back to split_large_page
in arch/x86/mm/pat/set_memory.c.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-14 19:03     ` Mike Stunes
  0 siblings, 0 replies; 243+ messages in thread
From: Mike Stunes @ 2020-04-14 19:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

On Mar 19, 2020, at 2:13 AM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> The runtime handler needs a GHCB per CPU. Set them up and map them
> unencrypted.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
> arch/x86/include/asm/mem_encrypt.h |  2 ++
> arch/x86/kernel/sev-es.c           | 28 +++++++++++++++++++++++++++-
> arch/x86/kernel/traps.c            |  3 +++
> 3 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
> index c17980e8db78..4bf5286310a0 100644
> --- a/arch/x86/kernel/sev-es.c
> +++ b/arch/x86/kernel/sev-es.c
> @@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void)
> 	return true;
> }
> 
> +void sev_es_init_ghcbs(void)
> +{
> +	int cpu;
> +
> +	if (!sev_es_active())
> +		return;
> +
> +	/* Allocate GHCB pages */
> +	ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE);
> +
> +	/* Initialize per-cpu GHCB pages */
> +	for_each_possible_cpu(cpu) {
> +		struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu);
> +
> +		set_memory_decrypted((unsigned long)ghcb,
> +				     sizeof(*ghcb) >> PAGE_SHIFT);
> +		memset(ghcb, 0, sizeof(*ghcb));
> +	}
> +}
> +

set_memory_decrypted needs to check the return value. I see it
consistently return ENOMEM. I've traced that back to split_large_page
in arch/x86/mm/pat/set_memory.c.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-14 19:03     ` Mike Stunes
@ 2020-04-14 20:04       ` Tom Lendacky
  -1 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-14 20:04 UTC (permalink / raw)
  To: Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel

On 4/14/20 2:03 PM, Mike Stunes wrote:
> On Mar 19, 2020, at 2:13 AM, Joerg Roedel <joro@8bytes.org> wrote:
>>
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> The runtime handler needs a GHCB per CPU. Set them up and map them
>> unencrypted.
>>
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Signed-off-by: Joerg Roedel <jroedel@suse.de>
>> ---
>> arch/x86/include/asm/mem_encrypt.h |  2 ++
>> arch/x86/kernel/sev-es.c           | 28 +++++++++++++++++++++++++++-
>> arch/x86/kernel/traps.c            |  3 +++
>> 3 files changed, 32 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
>> index c17980e8db78..4bf5286310a0 100644
>> --- a/arch/x86/kernel/sev-es.c
>> +++ b/arch/x86/kernel/sev-es.c
>> @@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void)
>> 	return true;
>> }
>>
>> +void sev_es_init_ghcbs(void)
>> +{
>> +	int cpu;
>> +
>> +	if (!sev_es_active())
>> +		return;
>> +
>> +	/* Allocate GHCB pages */
>> +	ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE);
>> +
>> +	/* Initialize per-cpu GHCB pages */
>> +	for_each_possible_cpu(cpu) {
>> +		struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu);
>> +
>> +		set_memory_decrypted((unsigned long)ghcb,
>> +				     sizeof(*ghcb) >> PAGE_SHIFT);
>> +		memset(ghcb, 0, sizeof(*ghcb));
>> +	}
>> +}
>> +
> 
> set_memory_decrypted needs to check the return value. I see it
> consistently return ENOMEM. I've traced that back to split_large_page
> in arch/x86/mm/pat/set_memory.c.

At that point the guest won't be able to communicate with the hypervisor, 
too. Maybe we should BUG() here to terminate further processing?

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-14 20:04       ` Tom Lendacky
  0 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-14 20:04 UTC (permalink / raw)
  To: Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel

On 4/14/20 2:03 PM, Mike Stunes wrote:
> On Mar 19, 2020, at 2:13 AM, Joerg Roedel <joro@8bytes.org> wrote:
>>
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> The runtime handler needs a GHCB per CPU. Set them up and map them
>> unencrypted.
>>
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Signed-off-by: Joerg Roedel <jroedel@suse.de>
>> ---
>> arch/x86/include/asm/mem_encrypt.h |  2 ++
>> arch/x86/kernel/sev-es.c           | 28 +++++++++++++++++++++++++++-
>> arch/x86/kernel/traps.c            |  3 +++
>> 3 files changed, 32 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
>> index c17980e8db78..4bf5286310a0 100644
>> --- a/arch/x86/kernel/sev-es.c
>> +++ b/arch/x86/kernel/sev-es.c
>> @@ -197,6 +203,26 @@ static bool __init sev_es_setup_ghcb(void)
>> 	return true;
>> }
>>
>> +void sev_es_init_ghcbs(void)
>> +{
>> +	int cpu;
>> +
>> +	if (!sev_es_active())
>> +		return;
>> +
>> +	/* Allocate GHCB pages */
>> +	ghcb_page = __alloc_percpu(sizeof(struct ghcb), PAGE_SIZE);
>> +
>> +	/* Initialize per-cpu GHCB pages */
>> +	for_each_possible_cpu(cpu) {
>> +		struct ghcb *ghcb = (struct ghcb *)per_cpu_ptr(ghcb_page, cpu);
>> +
>> +		set_memory_decrypted((unsigned long)ghcb,
>> +				     sizeof(*ghcb) >> PAGE_SHIFT);
>> +		memset(ghcb, 0, sizeof(*ghcb));
>> +	}
>> +}
>> +
> 
> set_memory_decrypted needs to check the return value. I see it
> consistently return ENOMEM. I've traced that back to split_large_page
> in arch/x86/mm/pat/set_memory.c.

At that point the guest won't be able to communicate with the hypervisor, 
too. Maybe we should BUG() here to terminate further processing?

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-14 20:04       ` Tom Lendacky
@ 2020-04-14 20:12         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2020-04-14 20:12 UTC (permalink / raw)
  To: Tom Lendacky, Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel

On 4/14/20 1:04 PM, Tom Lendacky wrote:
>> set_memory_decrypted needs to check the return value. I see it
>> consistently return ENOMEM. I've traced that back to split_large_page
>> in arch/x86/mm/pat/set_memory.c.
> 
> At that point the guest won't be able to communicate with the
> hypervisor, too. Maybe we should BUG() here to terminate further
> processing?

Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
Granted, the guest may be in an unrecoverable state, but the host
doesn't need to be too.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-14 20:12         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2020-04-14 20:12 UTC (permalink / raw)
  To: Tom Lendacky, Mike Stunes, Joerg Roedel
  Cc: Juergen Gross, Thomas Hellstrom, Dave Hansen, Kees Cook, kvm,
	Peter Zijlstra, x86, linux-kernel, virtualization, Joerg Roedel,
	Andy Lutomirski, hpa, Dan Williams, Jiri Slaby

On 4/14/20 1:04 PM, Tom Lendacky wrote:
>> set_memory_decrypted needs to check the return value. I see it
>> consistently return ENOMEM. I've traced that back to split_large_page
>> in arch/x86/mm/pat/set_memory.c.
> 
> At that point the guest won't be able to communicate with the
> hypervisor, too. Maybe we should BUG() here to terminate further
> processing?

Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
Granted, the guest may be in an unrecoverable state, but the host
doesn't need to be too.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-14 20:12         ` Dave Hansen
@ 2020-04-14 20:16           ` Tom Lendacky
  -1 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-14 20:16 UTC (permalink / raw)
  To: Dave Hansen, Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel



On 4/14/20 3:12 PM, Dave Hansen wrote:
> On 4/14/20 1:04 PM, Tom Lendacky wrote:
>>> set_memory_decrypted needs to check the return value. I see it
>>> consistently return ENOMEM. I've traced that back to split_large_page
>>> in arch/x86/mm/pat/set_memory.c.
>>
>> At that point the guest won't be able to communicate with the
>> hypervisor, too. Maybe we should BUG() here to terminate further
>> processing?
> 
> Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
> Granted, the guest may be in an unrecoverable state, but the host
> doesn't need to be too.
> 

The host wouldn't be. This only happens in a guest, so it would be just 
causing the guest kernel to panic early in the boot.

Thanks,
Tom


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-14 20:16           ` Tom Lendacky
  0 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-14 20:16 UTC (permalink / raw)
  To: Dave Hansen, Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel



On 4/14/20 3:12 PM, Dave Hansen wrote:
> On 4/14/20 1:04 PM, Tom Lendacky wrote:
>>> set_memory_decrypted needs to check the return value. I see it
>>> consistently return ENOMEM. I've traced that back to split_large_page
>>> in arch/x86/mm/pat/set_memory.c.
>>
>> At that point the guest won't be able to communicate with the
>> hypervisor, too. Maybe we should BUG() here to terminate further
>> processing?
> 
> Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
> Granted, the guest may be in an unrecoverable state, but the host
> doesn't need to be too.
> 

The host wouldn't be. This only happens in a guest, so it would be just 
causing the guest kernel to panic early in the boot.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-14 20:16           ` Tom Lendacky
@ 2020-04-14 20:18             ` Tom Lendacky
  -1 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-14 20:18 UTC (permalink / raw)
  To: Dave Hansen, Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel

On 4/14/20 3:16 PM, Tom Lendacky wrote:
> 
> 
> On 4/14/20 3:12 PM, Dave Hansen wrote:
>> On 4/14/20 1:04 PM, Tom Lendacky wrote:
>>>> set_memory_decrypted needs to check the return value. I see it
>>>> consistently return ENOMEM. I've traced that back to split_large_page
>>>> in arch/x86/mm/pat/set_memory.c.
>>>
>>> At that point the guest won't be able to communicate with the
>>> hypervisor, too. Maybe we should BUG() here to terminate further
>>> processing?
>>
>> Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
>> Granted, the guest may be in an unrecoverable state, but the host
>> doesn't need to be too.
>>
> 
> The host wouldn't be. This only happens in a guest, so it would be just 
> causing the guest kernel to panic early in the boot.

And I should add that it would only impact an SEV-ES guest.

Thanks,
Tom

> 
> Thanks,
> Tom
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-14 20:18             ` Tom Lendacky
  0 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-14 20:18 UTC (permalink / raw)
  To: Dave Hansen, Mike Stunes, Joerg Roedel
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Juergen Gross,
	Kees Cook, linux-kernel, kvm, virtualization, Joerg Roedel

On 4/14/20 3:16 PM, Tom Lendacky wrote:
> 
> 
> On 4/14/20 3:12 PM, Dave Hansen wrote:
>> On 4/14/20 1:04 PM, Tom Lendacky wrote:
>>>> set_memory_decrypted needs to check the return value. I see it
>>>> consistently return ENOMEM. I've traced that back to split_large_page
>>>> in arch/x86/mm/pat/set_memory.c.
>>>
>>> At that point the guest won't be able to communicate with the
>>> hypervisor, too. Maybe we should BUG() here to terminate further
>>> processing?
>>
>> Escalating an -ENOMEM into a crashed kernel seems a bit extreme.
>> Granted, the guest may be in an unrecoverable state, but the host
>> doesn't need to be too.
>>
> 
> The host wouldn't be. This only happens in a guest, so it would be just 
> causing the guest kernel to panic early in the boot.

And I should add that it would only impact an SEV-ES guest.

Thanks,
Tom

> 
> Thanks,
> Tom
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-14 19:03     ` Mike Stunes
@ 2020-04-15 15:53       ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-15 15:53 UTC (permalink / raw)
  To: Mike Stunes
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

Hi Mike,

On Tue, Apr 14, 2020 at 07:03:44PM +0000, Mike Stunes wrote:
> set_memory_decrypted needs to check the return value. I see it
> consistently return ENOMEM. I've traced that back to split_large_page
> in arch/x86/mm/pat/set_memory.c.

I agree that the return code needs to be checked. But I wonder why this
happens. The split_large_page() function returns -ENOMEM when
alloc_pages() fails. Do you boot the guest with minal RAM assigned?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-15 15:53       ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-15 15:53 UTC (permalink / raw)
  To: Mike Stunes
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

Hi Mike,

On Tue, Apr 14, 2020 at 07:03:44PM +0000, Mike Stunes wrote:
> set_memory_decrypted needs to check the return value. I see it
> consistently return ENOMEM. I've traced that back to split_large_page
> in arch/x86/mm/pat/set_memory.c.

I agree that the return code needs to be checked. But I wonder why this
happens. The split_large_page() function returns -ENOMEM when
alloc_pages() fails. Do you boot the guest with minal RAM assigned?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-14 20:04       ` Tom Lendacky
@ 2020-04-15 15:54         ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-15 15:54 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Mike Stunes, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

On Tue, Apr 14, 2020 at 03:04:42PM -0500, Tom Lendacky wrote:
> At that point the guest won't be able to communicate with the hypervisor,
> too. Maybe we should BUG() here to terminate further processing?

We could talk to the hypervisor, there is still the boot-GHCB in the
bss-decrypted section. But there is nothing that could be done here
anyway besides terminating the guest.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-15 15:54         ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-15 15:54 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Mike Stunes, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

On Tue, Apr 14, 2020 at 03:04:42PM -0500, Tom Lendacky wrote:
> At that point the guest won't be able to communicate with the hypervisor,
> too. Maybe we should BUG() here to terminate further processing?

We could talk to the hypervisor, there is still the boot-GHCB in the
bss-decrypted section. But there is nothing that could be done here
anyway besides terminating the guest.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 12/70] x86/boot/compressed/64: Add IDT Infrastructure
  2020-04-07  2:21   ` Arvind Sankar
@ 2020-04-16 13:30       ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-16 13:30 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: x86, hpa, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Hellstrom, Jiri Slaby, Dan Williams, Tom Lendacky,
	Juergen Gross, Kees Cook, linux-kernel, kvm, virtualization,
	Joerg Roedel

Hi Arvind,

On Mon, Apr 06, 2020 at 10:21:27PM -0400, Arvind Sankar wrote:
> On Thu, Mar 19, 2020 at 10:13:09AM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > +/*
> > + * Reload GDT after relocation - The GDT at the non-relocated position
> > + * might be overwritten soon by the in-place decompression, so reload
> > + * GDT at the relocated address. The GDT is referenced by exception
> > + * handling and needs to be set up correctly.
> > + */
> > +	leaq	gdt(%rip), %rax
> > +	movq	%rax, gdt64+2(%rip)
> > +	lgdt	gdt64(%rip)
> > +
> >  /*
> >   * Clear BSS (stack is currently empty)
> >   */
> 
> Note that this is now done in mainline as of commit c98a76eabbb6e, just
> prior to jumping to .Lrelocated, so this can be dropped on the next
> rebase.

Thanks for the heads-up, I removed this hunk.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 12/70] x86/boot/compressed/64: Add IDT Infrastructure
@ 2020-04-16 13:30       ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-16 13:30 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Dave Hansen,
	Kees Cook, kvm, Peter Zijlstra, x86, linux-kernel,
	virtualization, Joerg Roedel, Andy Lutomirski, hpa, Dan Williams,
	Jiri Slaby

Hi Arvind,

On Mon, Apr 06, 2020 at 10:21:27PM -0400, Arvind Sankar wrote:
> On Thu, Mar 19, 2020 at 10:13:09AM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > +/*
> > + * Reload GDT after relocation - The GDT at the non-relocated position
> > + * might be overwritten soon by the in-place decompression, so reload
> > + * GDT at the relocated address. The GDT is referenced by exception
> > + * handling and needs to be set up correctly.
> > + */
> > +	leaq	gdt(%rip), %rax
> > +	movq	%rax, gdt64+2(%rip)
> > +	lgdt	gdt64(%rip)
> > +
> >  /*
> >   * Clear BSS (stack is currently empty)
> >   */
> 
> Note that this is now done in mainline as of commit c98a76eabbb6e, just
> prior to jumping to .Lrelocated, so this can be dropped on the next
> rebase.

Thanks for the heads-up, I removed this hunk.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
  2020-03-27  3:02       ` Masami Hiramatsu
@ 2020-04-16 15:24         ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-16 15:24 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Borislav Petkov, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Tom Lendacky, Juergen Gross, Kees Cook,
	linux-kernel, kvm, virtualization

Hi Masami,

On Fri, Mar 27, 2020 at 12:02:32PM +0900, Masami Hiramatsu wrote:
> On Wed, 25 Mar 2020 16:39:45 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> > + Masami.
> > 
> > On Thu, Mar 19, 2020 at 10:13:02AM +0100, Joerg Roedel wrote:
> > > From: Joerg Roedel <jroedel@suse.de>
> > > 
> > > The inat-tables.c file has some arrays in it that contain pointers to
> > > other arrays. These pointers need to be relocated when the kernel
> > > image is moved to a different location.
> > > 
> > > The pre-decompression boot-code has no support for applying ELF
> > > relocations, so initialize these arrays at runtime in the
> > > pre-decompression code to make sure all pointers are correctly
> > > initialized.
> 
> I need to check the whole series, but as far as I can understand from
> this patch, this seems not allowing to store the address value in
> static pointers. It may break more things, for example _kprobe_blacklist
> records the NOKPROBE_SYMBOL() symbol addresses at the build time.

The runtime-initialization function is only used in the
pre-decompression boot code (arch/x86/boot/compressed/) which is not
part of the running kernel image. At that stage of booting there is no
support for kprobe or tracing or any other neat features that might
break things here.


> > > +	print "#ifndef __BOOT_COMPRESSED\n"
> > > +
> > >  	# print escape opcode map's array
> > >  	print "/* Escape opcode map array */"
> > >  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> > > @@ -388,6 +391,51 @@ END {
> > >  		for (j = 0; j < max_lprefix; j++)
> > >  			if (atable[i,j])
> > >  				print "	["i"]["j"] = "atable[i,j]","
> > > -	print "};"
> > > +	print "};\n"
> > > +
> > > +	print "#else /* !__BOOT_COMPRESSED */\n"
> 
> I think the definitions of inat_*_tables can be shared in both case.
> If __BOOT_COMPRESSED is set, we can define inat_init_tables() as a
> initialize function, and if not, it will be just a dummy "do {} while (0)".

The inat_*_tables are all declared const, so this way it is not possible
to change them at runtime. For the running kernel image this is fine, as
there are ELF relocations which fix things up, but at the
pre-decompression boot stage there are no ELF relocations which can fix
the tables, so the pointers in there need to be initialized at runtime.

> BTW, where is the __BOOT_COMPRESSED defined?

It is defined in arch/x86/boot/compressed/sev-es.c by patch

	x86/boot/compressed/64: Setup GHCB Based VC Exception handler

which also includes parts of the instruction decoder into the
pre-decompression boot code and adds the only call-site for
inat_init_tables().

> > > +	print "static void inat_init_tables(void)"
> 
> This functions should be "inline".
> And I can not see the call-site of inat_init_tables() in this patch.

The call-site is added with the patch that includes the
instruction decoder into the pre-decompression code. If possible I'd
like to keep those things separate, as both patches are already pretty
big by themselfes (and they do different things, in different parts of
the code).

> If possible, please include call-site with definition (especially
> new init function) so that I can check the init call timing too.

The function is called at the first #VC exception after a GHCB has been
set up. Call-path is: boot_vc_handler -> sev_es_setup_ghcb ->
inat_init_tables.

See

	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/tree/arch/x86/boot/compressed/sev-es.c?h=sev-es-client-v5.6-rc6

for the full code there.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
@ 2020-04-16 15:24         ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-16 15:24 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Kees Cook, kvm, Peter Zijlstra, Dave Hansen, x86, linux-kernel,
	Borislav Petkov, Andy Lutomirski, hpa, Dan Williams,
	virtualization, Jiri Slaby

Hi Masami,

On Fri, Mar 27, 2020 at 12:02:32PM +0900, Masami Hiramatsu wrote:
> On Wed, 25 Mar 2020 16:39:45 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> > + Masami.
> > 
> > On Thu, Mar 19, 2020 at 10:13:02AM +0100, Joerg Roedel wrote:
> > > From: Joerg Roedel <jroedel@suse.de>
> > > 
> > > The inat-tables.c file has some arrays in it that contain pointers to
> > > other arrays. These pointers need to be relocated when the kernel
> > > image is moved to a different location.
> > > 
> > > The pre-decompression boot-code has no support for applying ELF
> > > relocations, so initialize these arrays at runtime in the
> > > pre-decompression code to make sure all pointers are correctly
> > > initialized.
> 
> I need to check the whole series, but as far as I can understand from
> this patch, this seems not allowing to store the address value in
> static pointers. It may break more things, for example _kprobe_blacklist
> records the NOKPROBE_SYMBOL() symbol addresses at the build time.

The runtime-initialization function is only used in the
pre-decompression boot code (arch/x86/boot/compressed/) which is not
part of the running kernel image. At that stage of booting there is no
support for kprobe or tracing or any other neat features that might
break things here.


> > > +	print "#ifndef __BOOT_COMPRESSED\n"
> > > +
> > >  	# print escape opcode map's array
> > >  	print "/* Escape opcode map array */"
> > >  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> > > @@ -388,6 +391,51 @@ END {
> > >  		for (j = 0; j < max_lprefix; j++)
> > >  			if (atable[i,j])
> > >  				print "	["i"]["j"] = "atable[i,j]","
> > > -	print "};"
> > > +	print "};\n"
> > > +
> > > +	print "#else /* !__BOOT_COMPRESSED */\n"
> 
> I think the definitions of inat_*_tables can be shared in both case.
> If __BOOT_COMPRESSED is set, we can define inat_init_tables() as a
> initialize function, and if not, it will be just a dummy "do {} while (0)".

The inat_*_tables are all declared const, so this way it is not possible
to change them at runtime. For the running kernel image this is fine, as
there are ELF relocations which fix things up, but at the
pre-decompression boot stage there are no ELF relocations which can fix
the tables, so the pointers in there need to be initialized at runtime.

> BTW, where is the __BOOT_COMPRESSED defined?

It is defined in arch/x86/boot/compressed/sev-es.c by patch

	x86/boot/compressed/64: Setup GHCB Based VC Exception handler

which also includes parts of the instruction decoder into the
pre-decompression boot code and adds the only call-site for
inat_init_tables().

> > > +	print "static void inat_init_tables(void)"
> 
> This functions should be "inline".
> And I can not see the call-site of inat_init_tables() in this patch.

The call-site is added with the patch that includes the
instruction decoder into the pre-decompression code. If possible I'd
like to keep those things separate, as both patches are already pretty
big by themselfes (and they do different things, in different parts of
the code).

> If possible, please include call-site with definition (especially
> new init function) so that I can check the init call timing too.

The function is called at the first #VC exception after a GHCB has been
set up. Call-path is: boot_vc_handler -> sev_es_setup_ghcb ->
inat_init_tables.

See

	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/tree/arch/x86/boot/compressed/sev-es.c?h=sev-es-client-v5.6-rc6

for the full code there.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
  2020-04-16 15:24         ` Joerg Roedel
  (?)
@ 2020-04-17 12:50         ` Masami Hiramatsu
  2020-04-17 13:39           ` Joerg Roedel
  -1 siblings, 1 reply; 243+ messages in thread
From: Masami Hiramatsu @ 2020-04-17 12:50 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Borislav Petkov, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Tom Lendacky, Juergen Gross, Kees Cook,
	linux-kernel, kvm, virtualization

On Thu, 16 Apr 2020 17:24:06 +0200
Joerg Roedel <joro@8bytes.org> wrote:

> Hi Masami,
> 
> On Fri, Mar 27, 2020 at 12:02:32PM +0900, Masami Hiramatsu wrote:
> > On Wed, 25 Mar 2020 16:39:45 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > + Masami.
> > > 
> > > On Thu, Mar 19, 2020 at 10:13:02AM +0100, Joerg Roedel wrote:
> > > > From: Joerg Roedel <jroedel@suse.de>
> > > > 
> > > > The inat-tables.c file has some arrays in it that contain pointers to
> > > > other arrays. These pointers need to be relocated when the kernel
> > > > image is moved to a different location.
> > > > 
> > > > The pre-decompression boot-code has no support for applying ELF
> > > > relocations, so initialize these arrays at runtime in the
> > > > pre-decompression code to make sure all pointers are correctly
> > > > initialized.
> > 
> > I need to check the whole series, but as far as I can understand from
> > this patch, this seems not allowing to store the address value in
> > static pointers. It may break more things, for example _kprobe_blacklist
> > records the NOKPROBE_SYMBOL() symbol addresses at the build time.
> 
> The runtime-initialization function is only used in the
> pre-decompression boot code (arch/x86/boot/compressed/) which is not
> part of the running kernel image. At that stage of booting there is no
> support for kprobe or tracing or any other neat features that might
> break things here.

Ah, I got it. So you intended to port the instruction decoder to
pre-decompression boot code, correct?

> > > > +	print "#ifndef __BOOT_COMPRESSED\n"
> > > > +
> > > >  	# print escape opcode map's array
> > > >  	print "/* Escape opcode map array */"
> > > >  	print "const insn_attr_t * const inat_escape_tables[INAT_ESC_MAX + 1]" \
> > > > @@ -388,6 +391,51 @@ END {
> > > >  		for (j = 0; j < max_lprefix; j++)
> > > >  			if (atable[i,j])
> > > >  				print "	["i"]["j"] = "atable[i,j]","
> > > > -	print "};"
> > > > +	print "};\n"
> > > > +
> > > > +	print "#else /* !__BOOT_COMPRESSED */\n"
> > 
> > I think the definitions of inat_*_tables can be shared in both case.
> > If __BOOT_COMPRESSED is set, we can define inat_init_tables() as a
> > initialize function, and if not, it will be just a dummy "do {} while (0)".
> 
> The inat_*_tables are all declared const, so this way it is not possible
> to change them at runtime.

Indeed.

> For the running kernel image this is fine, as
> there are ELF relocations which fix things up, but at the
> pre-decompression boot stage there are no ELF relocations which can fix
> the tables, so the pointers in there need to be initialized at runtime.

OK.

> 
> > BTW, where is the __BOOT_COMPRESSED defined?
> 
> It is defined in arch/x86/boot/compressed/sev-es.c by patch
> 
> 	x86/boot/compressed/64: Setup GHCB Based VC Exception handler
> 
> which also includes parts of the instruction decoder into the
> pre-decompression boot code and adds the only call-site for
> inat_init_tables().

Thanks, I understand it.

> 
> > > > +	print "static void inat_init_tables(void)"
> > 
> > This functions should be "inline".
> > And I can not see the call-site of inat_init_tables() in this patch.
> 
> The call-site is added with the patch that includes the
> instruction decoder into the pre-decompression code. If possible I'd
> like to keep those things separate, as both patches are already pretty
> big by themselfes (and they do different things, in different parts of
> the code).

OK, if you will send v2, please CC both to me.

> 
> > If possible, please include call-site with definition (especially
> > new init function) so that I can check the init call timing too.
> 
> The function is called at the first #VC exception after a GHCB has been
> set up. Call-path is: boot_vc_handler -> sev_es_setup_ghcb ->
> inat_init_tables.

sound good to me. 

Thank you,

> 
> See
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/tree/arch/x86/boot/compressed/sev-es.c?h=sev-es-client-v5.6-rc6
> 
> for the full code there.
> 
> Thanks,
> 
> 	Joerg


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH 05/70] x86/insn: Make inat-tables.c suitable for pre-decompression code
  2020-04-17 12:50         ` Masami Hiramatsu
@ 2020-04-17 13:39           ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-17 13:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Borislav Petkov, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Tom Lendacky, Juergen Gross, Kees Cook,
	linux-kernel, kvm, virtualization

On Fri, Apr 17, 2020 at 09:50:00PM +0900, Masami Hiramatsu wrote:
> On Thu, 16 Apr 2020 17:24:06 +0200
> Joerg Roedel <joro@8bytes.org> wrote:

> Ah, I got it. So you intended to port the instruction decoder to
> pre-decompression boot code, correct?

Right, it is needed there to decode instructions which cause #VC
exceptions when running as an SEV-ES guest.

> > The call-site is added with the patch that includes the
> > instruction decoder into the pre-decompression code. If possible I'd
> > like to keep those things separate, as both patches are already pretty
> > big by themselfes (and they do different things, in different parts of
> > the code).
> 
> OK, if you will send v2, please CC both to me.

Will do, I added you to the cc-list for future posts of this series.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-15 15:53       ` Joerg Roedel
@ 2020-04-23  1:33         ` Bo Gan
  -1 siblings, 0 replies; 243+ messages in thread
From: Bo Gan @ 2020-04-23  1:33 UTC (permalink / raw)
  To: Joerg Roedel, Mike Stunes
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

On 4/15/20 8:53 AM, Joerg Roedel wrote:
> Hi Mike,
> 
> On Tue, Apr 14, 2020 at 07:03:44PM +0000, Mike Stunes wrote:
>> set_memory_decrypted needs to check the return value. I see it
>> consistently return ENOMEM. I've traced that back to split_large_page
>> in arch/x86/mm/pat/set_memory.c.
> 
> I agree that the return code needs to be checked. But I wonder why this
> happens. The split_large_page() function returns -ENOMEM when
> alloc_pages() fails. Do you boot the guest with minal RAM assigned?
> 
> Regards,
> 
> 	Joerg
> 

I just want to add some context around this. The call path that lead to 
the failure is like the following:

	__alloc_pages_slowpath
	__alloc_pages_nodemask
	alloc_pages_current
	alloc_pages
	split_large_page
	__change_page_attr
	__change_page_attr_set_clr
	__set_memory_enc_dec
	set_memory_decrypted
	sev_es_init_ghcbs
	trap_init   -> before mm_init (in init/main.c)
	start_kernel
	x86_64_start_reservations
	x86_64_start_kernel
	secondary_startup_64

At this time, mem_init hasn't been called yet (which would be called by 
mm_init). Thus, the free pages are still owned by memblock. It's in 
mem_init (x86/mm/init_64.c) that memblock_free_all gets called and free 
pages are released.

During testing, I've also noticed that debug_pagealloc=1 will make the 
issue disappear. That's because with debug_pagealloc=1, 
probe_page_size_mask in x86/mm/init.c will not allow large pages 
(2M/1G). Therefore, no split_large_page would happen. Similarly, if CPU 
doesn't have X86_FEATURE_PSE, there won't be large pages either.

Any thoughts? Maybe split_large_page should get pages from memblock at 
early boot?

Bo

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-23  1:33         ` Bo Gan
  0 siblings, 0 replies; 243+ messages in thread
From: Bo Gan @ 2020-04-23  1:33 UTC (permalink / raw)
  To: Joerg Roedel, Mike Stunes
  Cc: Joerg Roedel, x86, hpa, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Hellstrom, Jiri Slaby, Dan Williams,
	Tom Lendacky, Juergen Gross, Kees Cook, linux-kernel, kvm,
	virtualization

On 4/15/20 8:53 AM, Joerg Roedel wrote:
> Hi Mike,
> 
> On Tue, Apr 14, 2020 at 07:03:44PM +0000, Mike Stunes wrote:
>> set_memory_decrypted needs to check the return value. I see it
>> consistently return ENOMEM. I've traced that back to split_large_page
>> in arch/x86/mm/pat/set_memory.c.
> 
> I agree that the return code needs to be checked. But I wonder why this
> happens. The split_large_page() function returns -ENOMEM when
> alloc_pages() fails. Do you boot the guest with minal RAM assigned?
> 
> Regards,
> 
> 	Joerg
> 

I just want to add some context around this. The call path that lead to 
the failure is like the following:

	__alloc_pages_slowpath
	__alloc_pages_nodemask
	alloc_pages_current
	alloc_pages
	split_large_page
	__change_page_attr
	__change_page_attr_set_clr
	__set_memory_enc_dec
	set_memory_decrypted
	sev_es_init_ghcbs
	trap_init   -> before mm_init (in init/main.c)
	start_kernel
	x86_64_start_reservations
	x86_64_start_kernel
	secondary_startup_64

At this time, mem_init hasn't been called yet (which would be called by 
mm_init). Thus, the free pages are still owned by memblock. It's in 
mem_init (x86/mm/init_64.c) that memblock_free_all gets called and free 
pages are released.

During testing, I've also noticed that debug_pagealloc=1 will make the 
issue disappear. That's because with debug_pagealloc=1, 
probe_page_size_mask in x86/mm/init.c will not allow large pages 
(2M/1G). Therefore, no split_large_page would happen. Similarly, if CPU 
doesn't have X86_FEATURE_PSE, there won't be large pages either.

Any thoughts? Maybe split_large_page should get pages from memblock at 
early boot?

Bo

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
  2020-04-23  1:33         ` Bo Gan
@ 2020-04-23 11:30           ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-23 11:30 UTC (permalink / raw)
  To: Bo Gan
  Cc: Mike Stunes, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Tom Lendacky, Juergen Gross, Kees Cook,
	linux-kernel, kvm, virtualization

On Wed, Apr 22, 2020 at 06:33:13PM -0700, Bo Gan wrote:
> On 4/15/20 8:53 AM, Joerg Roedel wrote:
> > Hi Mike,
> > 
> > On Tue, Apr 14, 2020 at 07:03:44PM +0000, Mike Stunes wrote:
> > > set_memory_decrypted needs to check the return value. I see it
> > > consistently return ENOMEM. I've traced that back to split_large_page
> > > in arch/x86/mm/pat/set_memory.c.
> > 
> > I agree that the return code needs to be checked. But I wonder why this
> > happens. The split_large_page() function returns -ENOMEM when
> > alloc_pages() fails. Do you boot the guest with minal RAM assigned?
> > 
> > Regards,
> > 
> > 	Joerg
> > 
> 
> I just want to add some context around this. The call path that lead to the
> failure is like the following:
> 
> 	__alloc_pages_slowpath
> 	__alloc_pages_nodemask
> 	alloc_pages_current
> 	alloc_pages
> 	split_large_page
> 	__change_page_attr
> 	__change_page_attr_set_clr
> 	__set_memory_enc_dec
> 	set_memory_decrypted
> 	sev_es_init_ghcbs
> 	trap_init   -> before mm_init (in init/main.c)
> 	start_kernel
> 	x86_64_start_reservations
> 	x86_64_start_kernel
> 	secondary_startup_64
> 
> At this time, mem_init hasn't been called yet (which would be called by
> mm_init). Thus, the free pages are still owned by memblock. It's in mem_init
> (x86/mm/init_64.c) that memblock_free_all gets called and free pages are
> released.
> 
> During testing, I've also noticed that debug_pagealloc=1 will make the issue
> disappear. That's because with debug_pagealloc=1, probe_page_size_mask in
> x86/mm/init.c will not allow large pages (2M/1G). Therefore, no
> split_large_page would happen. Similarly, if CPU doesn't have
> X86_FEATURE_PSE, there won't be large pages either.
> 
> Any thoughts? Maybe split_large_page should get pages from memblock at early
> boot?

Thanks for you analysis. I fixed it (verified by Mike) by using
early_set_memory_decrypted() instead of set_memory_decrypted(). I still
wonder why I didn't see that issue on my kernel. It has
DEBUG_PAGEALLOC=y set, but it is not enabled by default and I also
didn't pass the command-line parameter.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Re: [PATCH 40/70] x86/sev-es: Setup per-cpu GHCBs for the runtime handler
@ 2020-04-23 11:30           ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-23 11:30 UTC (permalink / raw)
  To: Bo Gan
  Cc: Mike Stunes, Joerg Roedel, x86, hpa, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Hellstrom, Jiri Slaby,
	Dan Williams, Tom Lendacky, Juergen Gross, Kees Cook,
	linux-kernel, kvm, virtualization

On Wed, Apr 22, 2020 at 06:33:13PM -0700, Bo Gan wrote:
> On 4/15/20 8:53 AM, Joerg Roedel wrote:
> > Hi Mike,
> > 
> > On Tue, Apr 14, 2020 at 07:03:44PM +0000, Mike Stunes wrote:
> > > set_memory_decrypted needs to check the return value. I see it
> > > consistently return ENOMEM. I've traced that back to split_large_page
> > > in arch/x86/mm/pat/set_memory.c.
> > 
> > I agree that the return code needs to be checked. But I wonder why this
> > happens. The split_large_page() function returns -ENOMEM when
> > alloc_pages() fails. Do you boot the guest with minal RAM assigned?
> > 
> > Regards,
> > 
> > 	Joerg
> > 
> 
> I just want to add some context around this. The call path that lead to the
> failure is like the following:
> 
> 	__alloc_pages_slowpath
> 	__alloc_pages_nodemask
> 	alloc_pages_current
> 	alloc_pages
> 	split_large_page
> 	__change_page_attr
> 	__change_page_attr_set_clr
> 	__set_memory_enc_dec
> 	set_memory_decrypted
> 	sev_es_init_ghcbs
> 	trap_init   -> before mm_init (in init/main.c)
> 	start_kernel
> 	x86_64_start_reservations
> 	x86_64_start_kernel
> 	secondary_startup_64
> 
> At this time, mem_init hasn't been called yet (which would be called by
> mm_init). Thus, the free pages are still owned by memblock. It's in mem_init
> (x86/mm/init_64.c) that memblock_free_all gets called and free pages are
> released.
> 
> During testing, I've also noticed that debug_pagealloc=1 will make the issue
> disappear. That's because with debug_pagealloc=1, probe_page_size_mask in
> x86/mm/init.c will not allow large pages (2M/1G). Therefore, no
> split_large_page would happen. Similarly, if CPU doesn't have
> X86_FEATURE_PSE, there won't be large pages either.
> 
> Any thoughts? Maybe split_large_page should get pages from memblock at early
> boot?

Thanks for you analysis. I fixed it (verified by Mike) by using
early_set_memory_decrypted() instead of set_memory_decrypted(). I still
wonder why I didn't see that issue on my kernel. It has
DEBUG_PAGEALLOC=y set, but it is not enabled by default and I also
didn't pass the command-line parameter.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-03-19  9:13 ` [PATCH 55/70] x86/sev-es: Handle RDTSCP Events Joerg Roedel
@ 2020-04-24 21:03     ` Mike Stunes
  0 siblings, 0 replies; 243+ messages in thread
From: Mike Stunes @ 2020-04-24 21:03 UTC (permalink / raw)
  To: joro
  Cc: dan.j.williams, dave.hansen, hpa, jgross, jroedel, jslaby,
	keescook, kvm, linux-kernel, luto, peterz, thellstrom,
	thomas.lendacky, virtualization, x86, Mike Stunes

Hi Joerg,

I needed to allow RDTSC(P) from userspace and in early boot in order to
get userspace started properly. Patch below.

---
SEV-ES guests will need to execute rdtsc and rdtscp from userspace and
during early boot. Move the rdtsc(p) #VC handler into common code and
extend the #VC handlers.

Signed-off-by: Mike Stunes <mstunes@vmware.com>
---
 arch/x86/boot/compressed/sev-es.c |  4 ++++
 arch/x86/kernel/sev-es-shared.c   | 23 +++++++++++++++++++++++
 arch/x86/kernel/sev-es.c          | 25 ++-----------------------
 3 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 53c65fc09341..1d0290cc46c1 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -158,6 +158,10 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
 	case SVM_EXIT_CPUID:
 		result = vc_handle_cpuid(boot_ghcb, &ctxt);
 		break;
+	case SVM_EXIT_RDTSC:
+	case SVM_EXIT_RDTSCP:
+		result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
+		break;
 	default:
 		result = ES_UNSUPPORTED;
 		break;
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index a632b8f041ec..373ced468659 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -442,3 +442,26 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
 
 	return ES_OK;
 }
+
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt,
+				      unsigned long exit_code)
+{
+	bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
+	enum es_result ret;
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) &&
+	     (!rdtscp || ghcb_is_valid_rcx(ghcb))))
+		return ES_VMM_ERROR;
+
+	ctxt->regs->ax = ghcb->save.rax;
+	ctxt->regs->dx = ghcb->save.rdx;
+	if (rdtscp)
+		ctxt->regs->cx = ghcb->save.rcx;
+
+	return ES_OK;
+}
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 409a7a2aa630..82199527d012 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -815,29 +815,6 @@ static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
 }
 
-static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
-				      struct es_em_ctxt *ctxt,
-				      unsigned long exit_code)
-{
-	bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
-	enum es_result ret;
-
-	ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
-	if (ret != ES_OK)
-		return ret;
-
-	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) &&
-	     (!rdtscp || ghcb_is_valid_rcx(ghcb))))
-		return ES_VMM_ERROR;
-
-	ctxt->regs->ax = ghcb->save.rax;
-	ctxt->regs->dx = ghcb->save.rdx;
-	if (rdtscp)
-		ctxt->regs->cx = ghcb->save.rcx;
-
-	return ES_OK;
-}
-
 static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	enum es_result ret;
@@ -1001,6 +978,8 @@ static enum es_result vc_context_filter(struct pt_regs *regs, long exit_code)
 		/* List of #VC exit-codes we support in user-space */
 		case SVM_EXIT_EXCP_BASE ... SVM_EXIT_LAST_EXCP:
 		case SVM_EXIT_CPUID:
+		case SVM_EXIT_RDTSC:
+		case SVM_EXIT_RDTSCP:
 			r = ES_OK;
 			break;
 		default:
-- 
2.26.1


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCH] Allow RDTSC and RDTSCP from userspace
@ 2020-04-24 21:03     ` Mike Stunes
  0 siblings, 0 replies; 243+ messages in thread
From: Mike Stunes @ 2020-04-24 21:03 UTC (permalink / raw)
  To: joro
  Cc: dan.j.williams, dave.hansen, hpa, jgross, jroedel, jslaby,
	keescook, kvm, linux-kernel, luto, peterz, thellstrom,
	thomas.lendacky, virtualization, x86, Mike Stunes

Hi Joerg,

I needed to allow RDTSC(P) from userspace and in early boot in order to
get userspace started properly. Patch below.

---
SEV-ES guests will need to execute rdtsc and rdtscp from userspace and
during early boot. Move the rdtsc(p) #VC handler into common code and
extend the #VC handlers.

Signed-off-by: Mike Stunes <mstunes@vmware.com>
---
 arch/x86/boot/compressed/sev-es.c |  4 ++++
 arch/x86/kernel/sev-es-shared.c   | 23 +++++++++++++++++++++++
 arch/x86/kernel/sev-es.c          | 25 ++-----------------------
 3 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
index 53c65fc09341..1d0290cc46c1 100644
--- a/arch/x86/boot/compressed/sev-es.c
+++ b/arch/x86/boot/compressed/sev-es.c
@@ -158,6 +158,10 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code)
 	case SVM_EXIT_CPUID:
 		result = vc_handle_cpuid(boot_ghcb, &ctxt);
 		break;
+	case SVM_EXIT_RDTSC:
+	case SVM_EXIT_RDTSCP:
+		result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code);
+		break;
 	default:
 		result = ES_UNSUPPORTED;
 		break;
diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
index a632b8f041ec..373ced468659 100644
--- a/arch/x86/kernel/sev-es-shared.c
+++ b/arch/x86/kernel/sev-es-shared.c
@@ -442,3 +442,26 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb,
 
 	return ES_OK;
 }
+
+static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
+				      struct es_em_ctxt *ctxt,
+				      unsigned long exit_code)
+{
+	bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
+	enum es_result ret;
+
+	ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
+	if (ret != ES_OK)
+		return ret;
+
+	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) &&
+	     (!rdtscp || ghcb_is_valid_rcx(ghcb))))
+		return ES_VMM_ERROR;
+
+	ctxt->regs->ax = ghcb->save.rax;
+	ctxt->regs->dx = ghcb->save.rdx;
+	if (rdtscp)
+		ctxt->regs->cx = ghcb->save.rcx;
+
+	return ES_OK;
+}
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 409a7a2aa630..82199527d012 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -815,29 +815,6 @@ static enum es_result vc_handle_wbinvd(struct ghcb *ghcb,
 	return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0);
 }
 
-static enum es_result vc_handle_rdtsc(struct ghcb *ghcb,
-				      struct es_em_ctxt *ctxt,
-				      unsigned long exit_code)
-{
-	bool rdtscp = (exit_code == SVM_EXIT_RDTSCP);
-	enum es_result ret;
-
-	ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0);
-	if (ret != ES_OK)
-		return ret;
-
-	if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) &&
-	     (!rdtscp || ghcb_is_valid_rcx(ghcb))))
-		return ES_VMM_ERROR;
-
-	ctxt->regs->ax = ghcb->save.rax;
-	ctxt->regs->dx = ghcb->save.rdx;
-	if (rdtscp)
-		ctxt->regs->cx = ghcb->save.rcx;
-
-	return ES_OK;
-}
-
 static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
 {
 	enum es_result ret;
@@ -1001,6 +978,8 @@ static enum es_result vc_context_filter(struct pt_regs *regs, long exit_code)
 		/* List of #VC exit-codes we support in user-space */
 		case SVM_EXIT_EXCP_BASE ... SVM_EXIT_LAST_EXCP:
 		case SVM_EXIT_CPUID:
+		case SVM_EXIT_RDTSC:
+		case SVM_EXIT_RDTSCP:
 			r = ES_OK;
 			break;
 		default:
-- 
2.26.1

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-24 21:03     ` Mike Stunes
  (?)
@ 2020-04-24 21:24     ` Dave Hansen
  2020-04-24 21:27       ` Tom Lendacky
  -1 siblings, 1 reply; 243+ messages in thread
From: Dave Hansen @ 2020-04-24 21:24 UTC (permalink / raw)
  To: Mike Stunes, joro
  Cc: dan.j.williams, dave.hansen, hpa, jgross, jroedel, jslaby,
	keescook, kvm, linux-kernel, luto, peterz, thellstrom,
	thomas.lendacky, virtualization, x86, Sean Christopherson

On 4/24/20 2:03 PM, Mike Stunes wrote:
> I needed to allow RDTSC(P) from userspace and in early boot in order to
> get userspace started properly. Patch below.
> 
> ---
> SEV-ES guests will need to execute rdtsc and rdtscp from userspace and
> during early boot. Move the rdtsc(p) #VC handler into common code and
> extend the #VC handlers.

Do SEV-ES guests _always_ #VC on rdtsc(p)?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-24 21:24     ` Dave Hansen
@ 2020-04-24 21:27       ` Tom Lendacky
  2020-04-24 22:53         ` Dave Hansen
  0 siblings, 1 reply; 243+ messages in thread
From: Tom Lendacky @ 2020-04-24 21:27 UTC (permalink / raw)
  To: Dave Hansen, Mike Stunes, joro
  Cc: dan.j.williams, dave.hansen, hpa, jgross, jroedel, jslaby,
	keescook, kvm, linux-kernel, luto, peterz, thellstrom,
	virtualization, x86, Sean Christopherson

On 4/24/20 4:24 PM, Dave Hansen wrote:
> On 4/24/20 2:03 PM, Mike Stunes wrote:
>> I needed to allow RDTSC(P) from userspace and in early boot in order to
>> get userspace started properly. Patch below.
>>
>> ---
>> SEV-ES guests will need to execute rdtsc and rdtscp from userspace and
>> during early boot. Move the rdtsc(p) #VC handler into common code and
>> extend the #VC handlers.
> 
> Do SEV-ES guests _always_ #VC on rdtsc(p)?

Only if the hypervisor is intercepting those instructions.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-24 21:27       ` Tom Lendacky
@ 2020-04-24 22:53         ` Dave Hansen
  2020-04-25 12:49           ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Dave Hansen @ 2020-04-24 22:53 UTC (permalink / raw)
  To: Tom Lendacky, Mike Stunes, joro
  Cc: dan.j.williams, dave.hansen, hpa, jgross, jroedel, jslaby,
	keescook, kvm, linux-kernel, luto, peterz, thellstrom,
	virtualization, x86, Sean Christopherson

On 4/24/20 2:27 PM, Tom Lendacky wrote:
> On 4/24/20 4:24 PM, Dave Hansen wrote:
>> On 4/24/20 2:03 PM, Mike Stunes wrote:
>>> I needed to allow RDTSC(P) from userspace and in early boot in order to
>>> get userspace started properly. Patch below.
>>>
>>> ---
>>> SEV-ES guests will need to execute rdtsc and rdtscp from userspace and
>>> during early boot. Move the rdtsc(p) #VC handler into common code and
>>> extend the #VC handlers.
>>
>> Do SEV-ES guests _always_ #VC on rdtsc(p)?
> 
> Only if the hypervisor is intercepting those instructions.

Ahh, so any instruction that can have an instruction intercept set
potentially needs to be able to tolerate a #VC?  Those instruction
intercepts are under the control of the (untrusted relative to the
guest) hypervisor, right?

From the main sev-es series:

+#ifdef CONFIG_AMD_MEM_ENCRYPT
+idtentry vmm_communication     do_vmm_communication    has_error_code=1
+#endif

Since this is set as non-paranoid, that both limits the instructions
that can be used in entry paths *and* limits the future architecture
from being able add instructions that a current SEV-ES guest doesn't
know about.  Does SEV-ES have versioning so guests can tell if they
might be subject to new interrupt intercepts for which they are not
prepared?  I didn't see anything obvious in section 15.35 of the manual.

There's also a nugget in the manual that says:

> Similarly, the hypervisor should avoid setting intercept bits for
> events that would occur in the #VC handler (such as IRET).

That's a fun point because it means that the (untrusted) hypervisor can
cause endless faults.  I *guess* we have mitigation for this with our
stack guard pages, but it's still a bit nasty that the hypervisor can
arbitrarily land a guest in the double-fault handler.

It just all seems a bit weak for the hypervisor to be considered
untrusted.  But, it's _certainly_ a steep in the right direction from SEV.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-24 21:03     ` Mike Stunes
  (?)
  (?)
@ 2020-04-25 12:28     ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-25 12:28 UTC (permalink / raw)
  To: Mike Stunes
  Cc: joro, dan.j.williams, dave.hansen, hpa, jgross, jslaby, keescook,
	kvm, linux-kernel, luto, peterz, thellstrom, thomas.lendacky,
	virtualization, x86

Hi Mike,

On Fri, Apr 24, 2020 at 02:03:16PM -0700, Mike Stunes wrote:
> I needed to allow RDTSC(P) from userspace and in early boot in order to
> get userspace started properly. Patch below.

Thanks, but this is not needed anymore. I removed the vc_context_filter
from the code. The emulation code is now capable of safely handling any
exception from user-space.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-24 22:53         ` Dave Hansen
@ 2020-04-25 12:49           ` Joerg Roedel
  2020-04-25 18:15             ` Andy Lutomirski
  2020-04-27 18:47             ` [PATCH] Allow RDTSC and RDTSCP from userspace Dave Hansen
  0 siblings, 2 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-25 12:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tom Lendacky, Mike Stunes, joro, dan.j.williams, dave.hansen,
	hpa, jgross, jslaby, keescook, kvm, linux-kernel, luto, peterz,
	thellstrom, virtualization, x86, Sean Christopherson

Hi Dave,

On Fri, Apr 24, 2020 at 03:53:09PM -0700, Dave Hansen wrote:
> Ahh, so any instruction that can have an instruction intercept set
> potentially needs to be able to tolerate a #VC?  Those instruction
> intercepts are under the control of the (untrusted relative to the
> guest) hypervisor, right?
> 
> >From the main sev-es series:
> 
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> +idtentry vmm_communication     do_vmm_communication    has_error_code=1
> +#endif

The next version of the patch-set (which I will hopefully have ready
next week) will have this changed. The #VC exception handler uses an IST
stack and is set to paranoid=1 and shift_ist. The IST stacks for the #VC
handler are only allocated when SEV-ES is active.

> That's a fun point because it means that the (untrusted) hypervisor can
> cause endless faults.  I *guess* we have mitigation for this with our
> stack guard pages, but it's still a bit nasty that the hypervisor can
> arbitrarily land a guest in the double-fault handler.
> 
> It just all seems a bit weak for the hypervisor to be considered
> untrusted.  But, it's _certainly_ a steep in the right direction from SEV.

Yeah, a malicious hypervisor can do bad things to an SEV-ES VM, but it
can't easily steal its secrets from memory or registers. The #VC handler
does its best to just crash the VM if unexpected hypervisor behavior is
detected.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-25 12:49           ` Joerg Roedel
@ 2020-04-25 18:15             ` Andy Lutomirski
  2020-04-25 19:10               ` Joerg Roedel
  2020-04-27 18:47             ` [PATCH] Allow RDTSC and RDTSCP from userspace Dave Hansen
  1 sibling, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-04-25 18:15 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Dave Hansen, Tom Lendacky, Mike Stunes, Joerg Roedel,
	Dan Williams, Dave Hansen, H. Peter Anvin, Juergen Gross,
	Jiri Slaby, Kees Cook, kvm list, LKML, Andrew Lutomirski,
	Peter Zijlstra, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On Sat, Apr 25, 2020 at 5:49 AM Joerg Roedel <jroedel@suse.de> wrote:
>
> Hi Dave,
>
> On Fri, Apr 24, 2020 at 03:53:09PM -0700, Dave Hansen wrote:
> > Ahh, so any instruction that can have an instruction intercept set
> > potentially needs to be able to tolerate a #VC?  Those instruction
> > intercepts are under the control of the (untrusted relative to the
> > guest) hypervisor, right?
> >
> > >From the main sev-es series:
> >
> > +#ifdef CONFIG_AMD_MEM_ENCRYPT
> > +idtentry vmm_communication     do_vmm_communication    has_error_code=1
> > +#endif
>
> The next version of the patch-set (which I will hopefully have ready
> next week) will have this changed. The #VC exception handler uses an IST
> stack and is set to paranoid=1 and shift_ist. The IST stacks for the #VC
> handler are only allocated when SEV-ES is active.

shift_ist is gross.  What's it for?  If it's not needed, I'd rather
not use it, and I eventually want to get rid of it for #DB as well.

--Andy

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-25 18:15             ` Andy Lutomirski
@ 2020-04-25 19:10               ` Joerg Roedel
  2020-04-25 19:47                 ` Andy Lutomirski
  0 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-04-25 19:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Dave Hansen, Tom Lendacky, Mike Stunes,
	Dan Williams, Dave Hansen, H. Peter Anvin, Juergen Gross,
	Jiri Slaby, Kees Cook, kvm list, LKML, Peter Zijlstra,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On Sat, Apr 25, 2020 at 11:15:35AM -0700, Andy Lutomirski wrote:
> shift_ist is gross.  What's it for?  If it's not needed, I'd rather
> not use it, and I eventually want to get rid of it for #DB as well.

The #VC handler needs to be able to nest, there is no way around that
for various reasons, the two most important ones are:

	1. The #VC -> NMI -> #VC case. #VCs can happen in the NMI
	   handler, for example (but not exclusivly) for RDPMC.

	2. In case of an error the #VC handler needs to print out error
	   information by calling one of the printk wrappers. Those will
	   end up doing IO to some console/serial port/whatever which
	   will also cause #VC exceptions to emulate the access to the
	   output devices.

Using shift_ist is perfect for that, the only problem is the race
condition with the NMI handler, as shift_ist does not work well with
exceptions that can also trigger within the NMI handler. But I have
taken care of that for #VC.


Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-25 19:10               ` Joerg Roedel
@ 2020-04-25 19:47                 ` Andy Lutomirski
  2020-04-25 20:23                   ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-04-25 19:47 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Peter Zijlstra, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson



> On Apr 25, 2020, at 12:10 PM, Joerg Roedel <joro@8bytes.org> wrote:
> 
> On Sat, Apr 25, 2020 at 11:15:35AM -0700, Andy Lutomirski wrote:
>> shift_ist is gross.  What's it for?  If it's not needed, I'd rather
>> not use it, and I eventually want to get rid of it for #DB as well.
> 
> The #VC handler needs to be able to nest, there is no way around that
> for various reasons, the two most important ones are:
> 
>    1. The #VC -> NMI -> #VC case. #VCs can happen in the NMI
>       handler, for example (but not exclusivly) for RDPMC.
> 
>    2. In case of an error the #VC handler needs to print out error
>       information by calling one of the printk wrappers. Those will
>       end up doing IO to some console/serial port/whatever which
>       will also cause #VC exceptions to emulate the access to the
>       output devices.
> 
> Using shift_ist is perfect for that, the only problem is the race
> condition with the NMI handler, as shift_ist does not work well with
> exceptions that can also trigger within the NMI handler. But I have
> taken care of that for #VC.
> 

I assume the race you mean is:

#VC
Immediate NMI before IST gets shifted
#VC

Kaboom.

How are you dealing with this?  Ultimately, I think that NMI will need to turn off IST before engaging in any funny business. Let me ponder this a bit.

> 
> Regards,
> 
>    Joerg
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-25 19:47                 ` Andy Lutomirski
@ 2020-04-25 20:23                   ` Joerg Roedel
  2020-04-25 22:10                     ` Andy Lutomirski
  0 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-04-25 20:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Peter Zijlstra, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
> I assume the race you mean is:
> 
> #VC
> Immediate NMI before IST gets shifted
> #VC
> 
> Kaboom.
> 
> How are you dealing with this?  Ultimately, I think that NMI will need
> to turn off IST before engaging in any funny business. Let me ponder
> this a bit.

Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry
in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
one of the IST stacks to be unused during nesting, but that is fine. The
stack memory for #VC is only allocated when SEV-ES is active (in an
SEV-ES VM).

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-25 20:23                   ` Joerg Roedel
@ 2020-04-25 22:10                     ` Andy Lutomirski
  2020-04-27 17:37                       ` Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace) Andy Lutomirski
  0 siblings, 1 reply; 243+ messages in thread
From: Andy Lutomirski @ 2020-04-25 22:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Peter Zijlstra, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote:
>
> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
> > I assume the race you mean is:
> >
> > #VC
> > Immediate NMI before IST gets shifted
> > #VC
> >
> > Kaboom.
> >
> > How are you dealing with this?  Ultimately, I think that NMI will need
> > to turn off IST before engaging in any funny business. Let me ponder
> > this a bit.
>
> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry
> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
> one of the IST stacks to be unused during nesting, but that is fine. The
> stack memory for #VC is only allocated when SEV-ES is active (in an
> SEV-ES VM).

Blech.  It probably works, but still, yuck.  It's a bit sad that we
seem to be growing more and more poorly designed happens-anywhere
exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
#HV, and #DB.  This doesn't really scale.

--Andy

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-25 22:10                     ` Andy Lutomirski
@ 2020-04-27 17:37                       ` Andy Lutomirski
  2020-04-27 18:15                         ` Andrew Cooper
                                           ` (3 more replies)
  0 siblings, 4 replies; 243+ messages in thread
From: Andy Lutomirski @ 2020-04-27 17:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Peter Zijlstra, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote:
> >
> > On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
> > > I assume the race you mean is:
> > >
> > > #VC
> > > Immediate NMI before IST gets shifted
> > > #VC
> > >
> > > Kaboom.
> > >
> > > How are you dealing with this?  Ultimately, I think that NMI will need
> > > to turn off IST before engaging in any funny business. Let me ponder
> > > this a bit.
> >
> > Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry
> > in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
> > one of the IST stacks to be unused during nesting, but that is fine. The
> > stack memory for #VC is only allocated when SEV-ES is active (in an
> > SEV-ES VM).
>
> Blech.  It probably works, but still, yuck.  It's a bit sad that we
> seem to be growing more and more poorly designed happens-anywhere
> exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
> #HV, and #DB.  This doesn't really scale.

I have a somewhat serious question: should we use IST for #VC at all?
As I understand it, Rome and Naples make it mandatory for hypervisors
to intercept #DB, which means that, due to the MOV SS mess, it's sort
of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
if we're running under a sufficiently sensible hypervisor, we don't
need IST for #VC.

So I think we have two choices:

1. Use IST for #VC and deal with all the mess that entails.

2. Say that we SEV-ES client support on Rome and Naples is for
development only and do a quick boot-time check for whether #DB is
intercepted.  (Just set TF and see what vector we get.)  If #DB is
intercepted, print a very loud warning and refuse to boot unless some
special sev_es.insecure_development_mode or similar option is set.

#2 results in simpler and more robust entry code.  #1 is more secure.

So my question is: will anyone actually use SEV-ES in production on
Rome or Naples?  As I understand it, it's not really ready for prime
time on those chips.  And do we care if the combination of a malicious
hypervisor and malicious guest userspace on Milan can compromise the
guest kernel?  I don't think SEV-ES is really mean to resist a
concerted effort by the hypervisor to compromise the guest.

--Andy

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-27 17:37                       ` Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace) Andy Lutomirski
@ 2020-04-27 18:15                         ` Andrew Cooper
  2020-04-27 18:43                         ` Tom Lendacky
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 243+ messages in thread
From: Andrew Cooper @ 2020-04-27 18:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Peter Zijlstra, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On 27/04/2020 18:37, Andy Lutomirski wrote:
> On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski <luto@kernel.org> wrote:
>> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote:
>>> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
>>>> I assume the race you mean is:
>>>>
>>>> #VC
>>>> Immediate NMI before IST gets shifted
>>>> #VC
>>>>
>>>> Kaboom.
>>>>
>>>> How are you dealing with this?  Ultimately, I think that NMI will need
>>>> to turn off IST before engaging in any funny business. Let me ponder
>>>> this a bit.
>>> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry
>>> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
>>> one of the IST stacks to be unused during nesting, but that is fine. The
>>> stack memory for #VC is only allocated when SEV-ES is active (in an
>>> SEV-ES VM).
>> Blech.  It probably works, but still, yuck.  It's a bit sad that we
>> seem to be growing more and more poorly designed happens-anywhere
>> exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
>> #HV, and #DB.  This doesn't really scale.
> I have a somewhat serious question: should we use IST for #VC at all?
> As I understand it, Rome and Naples make it mandatory for hypervisors
> to intercept #DB, which means that, due to the MOV SS mess, it's sort
> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
> if we're running under a sufficiently sensible hypervisor, we don't
> need IST for #VC.
>
> So I think we have two choices:
>
> 1. Use IST for #VC and deal with all the mess that entails.
>
> 2. Say that we SEV-ES client support on Rome and Naples is for
> development only and do a quick boot-time check for whether #DB is
> intercepted.  (Just set TF and see what vector we get.)  If #DB is
> intercepted, print a very loud warning and refuse to boot unless some
> special sev_es.insecure_development_mode or similar option is set.
>
> #2 results in simpler and more robust entry code.  #1 is more secure.
>
> So my question is: will anyone actually use SEV-ES in production on
> Rome or Naples?  As I understand it, it's not really ready for prime
> time on those chips.  And do we care if the combination of a malicious
> hypervisor and malicious guest userspace on Milan can compromise the
> guest kernel?  I don't think SEV-ES is really mean to resist a
> concerted effort by the hypervisor to compromise the guest.

More specifically, it is mandatory for hypervisors to intercept #DB to
defend against CVE-2015-8104, unless they're willing to trust the guest
not to tickle that corner case.

This is believed fixed with SEV-SNP to allow the encrypted guest to use
debugging functionality without posing a DoS risk to the host.  In this
case, the hypervisor is expected not to intercept #DB.

If #DB is intercepted, and #VC doesn't use IST, malicious userspace can
cause problems with a movss-delayed breakpoint over SYSCALL.

The question basically whether it is worth going to the effort of making
#VC IST and all the problems that entails, to cover one corner case in
earlier hardware.

Ultimately, this depends on whether anyone plans to put SEV-ES into
production on pre SEV-SNP hardware, and if developers using pre-SEV-SNP
hardware are happy with "don't run malicious userspace" or "don't run
malicious kernels and skip the #DB intercept" as a fair tradeoff to
avoid the #VC IST fun.

~Andrew

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-27 17:37                       ` Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace) Andy Lutomirski
  2020-04-27 18:15                         ` Andrew Cooper
@ 2020-04-27 18:43                         ` Tom Lendacky
  2020-04-28  7:55                         ` Joerg Roedel
  2020-06-23  9:45                         ` Joerg Roedel
  3 siblings, 0 replies; 243+ messages in thread
From: Tom Lendacky @ 2020-04-27 18:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Joerg Roedel, Dave Hansen, Mike Stunes,
	Dan Williams, Dave Hansen, H. Peter Anvin, Juergen Gross,
	Jiri Slaby, Kees Cook, kvm list, LKML, Peter Zijlstra,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper



On 4/27/20 12:37 PM, Andy Lutomirski wrote:
> On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote:
>>>
>>> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote:
>>>> I assume the race you mean is:
>>>>
>>>> #VC
>>>> Immediate NMI before IST gets shifted
>>>> #VC
>>>>
>>>> Kaboom.
>>>>
>>>> How are you dealing with this?  Ultimately, I think that NMI will need
>>>> to turn off IST before engaging in any funny business. Let me ponder
>>>> this a bit.
>>>
>>> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry
>>> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause
>>> one of the IST stacks to be unused during nesting, but that is fine. The
>>> stack memory for #VC is only allocated when SEV-ES is active (in an
>>> SEV-ES VM).
>>
>> Blech.  It probably works, but still, yuck.  It's a bit sad that we
>> seem to be growing more and more poorly designed happens-anywhere
>> exception types at an alarming rate.  We seem to have #NM, #MC, #VC,
>> #HV, and #DB.  This doesn't really scale.
> 
> I have a somewhat serious question: should we use IST for #VC at all?
> As I understand it, Rome and Naples make it mandatory for hypervisors
> to intercept #DB, which means that, due to the MOV SS mess, it's sort
> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
> if we're running under a sufficiently sensible hypervisor, we don't
> need IST for #VC.
> 
> So I think we have two choices:
> 
> 1. Use IST for #VC and deal with all the mess that entails.
> 
> 2. Say that we SEV-ES client support on Rome and Naples is for
> development only and do a quick boot-time check for whether #DB is
> intercepted.  (Just set TF and see what vector we get.)  If #DB is
> intercepted, print a very loud warning and refuse to boot unless some
> special sev_es.insecure_development_mode or similar option is set.
> 
> #2 results in simpler and more robust entry code.  #1 is more secure.
> 
> So my question is: will anyone actually use SEV-ES in production on
> Rome or Naples?  As I understand it, it's not really ready for prime
> time on those chips.  And do we care if the combination of a malicious

Naples was limited in the number of encryption keys available for guests 
(15), but Rome increased that significantly (509). SEV-ES is ready on 
those chips - Rome more so with the increased key count given the 
requirement that SEV and SEV-ES guests have non-overlapping ASID ranges 
(which corresponds to key usage).

Thanks,
Tom

> hypervisor and malicious guest userspace on Milan can compromise the
> guest kernel?  I don't think SEV-ES is really mean to resist a
> concerted effort by the hypervisor to compromise the guest.
> 
> --Andy
> 

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCH] Allow RDTSC and RDTSCP from userspace
  2020-04-25 12:49           ` Joerg Roedel
  2020-04-25 18:15             ` Andy Lutomirski
@ 2020-04-27 18:47             ` Dave Hansen
  1 sibling, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2020-04-27 18:47 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Tom Lendacky, Mike Stunes, joro, dan.j.williams, dave.hansen,
	hpa, jgross, jslaby, keescook, kvm, linux-kernel, luto, peterz,
	thellstrom, virtualization, x86, Sean Christopherson

On 4/25/20 5:49 AM, Joerg Roedel wrote:
>> That's a fun point because it means that the (untrusted) hypervisor can
>> cause endless faults.  I *guess* we have mitigation for this with our
>> stack guard pages, but it's still a bit nasty that the hypervisor can
>> arbitrarily land a guest in the double-fault handler.
>>
>> It just all seems a bit weak for the hypervisor to be considered
>> untrusted.  But, it's _certainly_ a steep in the right direction from SEV.
> Yeah, a malicious hypervisor can do bad things to an SEV-ES VM, but it
> can't easily steal its secrets from memory or registers. The #VC handler
> does its best to just crash the VM if unexpected hypervisor behavior is
> detected.

This is the kind of design information that would be very useful to
reviewers.  Will some of this information make it into the cover letter
eventually?  Or, Documentation/?

Also, for the security purists, an SEV-ES host is still trusted (in the
same TCB as the guest).  Truly guest-untrusted VMMs won't be available
until SEV-SNP, right?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-27 17:37                       ` Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace) Andy Lutomirski
  2020-04-27 18:15                         ` Andrew Cooper
  2020-04-27 18:43                         ` Tom Lendacky
@ 2020-04-28  7:55                         ` Joerg Roedel
  2020-04-28 16:34                           ` Andrew Cooper
  2020-06-23 11:07                             ` Peter Zijlstra
  2020-06-23  9:45                         ` Joerg Roedel
  3 siblings, 2 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-04-28  7:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Dave Hansen, Tom Lendacky, Mike Stunes,
	Dan Williams, Dave Hansen, H. Peter Anvin, Juergen Gross,
	Jiri Slaby, Kees Cook, kvm list, LKML, Peter Zijlstra,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
> I have a somewhat serious question: should we use IST for #VC at all?
> As I understand it, Rome and Naples make it mandatory for hypervisors
> to intercept #DB, which means that, due to the MOV SS mess, it's sort
> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
> if we're running under a sufficiently sensible hypervisor, we don't
> need IST for #VC.

The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds
page ownership tracking between guest and host, so that the hypervisor
can't remap guest pages without the guest noticing.

If there is a violation of ownership, which can happen at any memory
access, there will be a #VC exception to notify the guest. And as this
can happen anywhere, for example on a carefully crafted stack page set
by userspace before doing SYSCALL, the only robust choice for #VC is to
use IST.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-28  7:55                         ` Joerg Roedel
@ 2020-04-28 16:34                           ` Andrew Cooper
  2020-06-23 11:07                             ` Peter Zijlstra
  1 sibling, 0 replies; 243+ messages in thread
From: Andrew Cooper @ 2020-04-28 16:34 UTC (permalink / raw)
  To: Joerg Roedel, Andy Lutomirski
  Cc: Joerg Roedel, Dave Hansen, Tom Lendacky, Mike Stunes,
	Dan Williams, Dave Hansen, H. Peter Anvin, Juergen Gross,
	Jiri Slaby, Kees Cook, kvm list, LKML, Peter Zijlstra,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On 28/04/2020 08:55, Joerg Roedel wrote:
> On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
>> I have a somewhat serious question: should we use IST for #VC at all?
>> As I understand it, Rome and Naples make it mandatory for hypervisors
>> to intercept #DB, which means that, due to the MOV SS mess, it's sort
>> of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
>> if we're running under a sufficiently sensible hypervisor, we don't
>> need IST for #VC.
> The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds
> page ownership tracking between guest and host, so that the hypervisor
> can't remap guest pages without the guest noticing.
>
> If there is a violation of ownership, which can happen at any memory
> access, there will be a #VC exception to notify the guest. And as this
> can happen anywhere, for example on a carefully crafted stack page set
> by userspace before doing SYSCALL, the only robust choice for #VC is to
> use IST.

The kernel won't ever touch the guest stack before restoring %rsp in the
syscall path, but the (minimum 2) memory accesses required to save the
user %rsp and load the kernel stack may be subject to #VC exceptions, as
are instruction fetches at the head of the SYSCALL path.

So yes - #VC needs IST.

Sorry for the noise.  (That said, it is unfortunate that the hypervisor
messing with the memory backing the guest #VC handler results in an
infinite loop, rather than an ability to cleanly terminate.)

~Andrew

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-27 17:37                       ` Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace) Andy Lutomirski
                                           ` (2 preceding siblings ...)
  2020-04-28  7:55                         ` Joerg Roedel
@ 2020-06-23  9:45                         ` Joerg Roedel
  2020-06-23 10:45                           ` Peter Zijlstra
  3 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23  9:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Dave Hansen, Tom Lendacky, Mike Stunes,
	Dan Williams, Dave Hansen, H. Peter Anvin, Juergen Gross,
	Jiri Slaby, Kees Cook, kvm list, LKML, Peter Zijlstra,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

Hi Andy,

On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
> 1. Use IST for #VC and deal with all the mess that entails.

With the removal of IST shifting I wonder what you would suggest on how
to best implement an NMI-safe IST handler with nesting support.

My current plan is to implement an IST handler which switches itself off
the IST stack as soon as possible, freeing it for re-use.

The flow would be roughly like this upon entering the handler;

	build_pt_regs();

	RSP = pt_regs->sp;

	if (RSP in VC_IST_stack)
		error("unallowed nesting")

	if (RSP in current_kernel_stack)
		RSP = round_down_to_8(RSP)
	else
		RSP = current_top_of_stack() // non-ist kernel stack

	copy_pt_regs(pt_regs, RSP);
	switch_stack_to(RSP);

To make this NMI safe, the NMI handler needs some logic too. Upon
entering NMI, it needs to check the return RSP, and if it is in the #VC
IST stack, it must do the above flow by itself and update the return RSP
and RIP. It needs to take into account the case when PT_REGS is not
fully populated on the return side.

Alternativly the NMI handler could safe/restore the contents of the #VC
IST stack or just switch to a special #VC-in-NMI IST stack.

All in all it could get complicated, and imho shift_ist would have been
simpler, but who am I anyway...

Or maybe you have a better idea how to implement this, so I'd like to
hear your opinion first before I spend too many days implementing
something.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23  9:45                         ` Joerg Roedel
@ 2020-06-23 10:45                           ` Peter Zijlstra
  2020-06-23 11:11                             ` Joerg Roedel
  0 siblings, 1 reply; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 10:45 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote:
> Hi Andy,
> 
> On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
> > 1. Use IST for #VC and deal with all the mess that entails.
> 
> With the removal of IST shifting I wonder what you would suggest on how
> to best implement an NMI-safe IST handler with nesting support.
> 
> My current plan is to implement an IST handler which switches itself off
> the IST stack as soon as possible, freeing it for re-use.
> 
> The flow would be roughly like this upon entering the handler;
> 
> 	build_pt_regs();
> 
> 	RSP = pt_regs->sp;
> 
> 	if (RSP in VC_IST_stack)
> 		error("unallowed nesting")
> 
> 	if (RSP in current_kernel_stack)
> 		RSP = round_down_to_8(RSP)
> 	else
> 		RSP = current_top_of_stack() // non-ist kernel stack
> 
> 	copy_pt_regs(pt_regs, RSP);
> 	switch_stack_to(RSP);
> 
> To make this NMI safe, the NMI handler needs some logic too. Upon
> entering NMI, it needs to check the return RSP, and if it is in the #VC
> IST stack, it must do the above flow by itself and update the return RSP
> and RIP. It needs to take into account the case when PT_REGS is not
> fully populated on the return side.
> 
> Alternativly the NMI handler could safe/restore the contents of the #VC
> IST stack or just switch to a special #VC-in-NMI IST stack.
> 
> All in all it could get complicated, and imho shift_ist would have been
> simpler, but who am I anyway...
> 
> Or maybe you have a better idea how to implement this, so I'd like to
> hear your opinion first before I spend too many days implementing
> something.

OK, excuse my ignorance, but I'm not seeing how that IST shifting
nonsense would've helped in the first place.

If I understand correctly the problem is:

	<#VC>
	  shift IST
	  <NMI>
	    ... does stuff
	    <#VC> # again, safe because the shift

But what happens if you get the NMI before your IST adjustment?

	<#VC>
	  <NMI>
	    ... does stuff
	    <#VC> # again, happily wrecks your earlier #VC
	  shift IST # whoopsy, too late

Either way around we get to fix this up in NMI (and any other IST
exception that can happen while in #VC, hello #MC). And more complexity
there is the very last thing we need :-(

There's no way you can fix up the IDT without getting an NMI first.

This entire exception model is fundamentally buggered :-/

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-04-28  7:55                         ` Joerg Roedel
@ 2020-06-23 11:07                             ` Peter Zijlstra
  2020-06-23 11:07                             ` Peter Zijlstra
  1 sibling, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:07 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote:
> On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
> > I have a somewhat serious question: should we use IST for #VC at all?
> > As I understand it, Rome and Naples make it mandatory for hypervisors
> > to intercept #DB, which means that, due to the MOV SS mess, it's sort
> > of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
> > if we're running under a sufficiently sensible hypervisor, we don't
> > need IST for #VC.
> 
> The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds
> page ownership tracking between guest and host, so that the hypervisor
> can't remap guest pages without the guest noticing.
> 
> If there is a violation of ownership, which can happen at any memory
> access, there will be a #VC exception to notify the guest. And as this
> can happen anywhere, for example on a carefully crafted stack page set
> by userspace before doing SYSCALL, the only robust choice for #VC is to
> use IST.

So what happens if this #VC triggers on the first access to the #VC
stack, because the malicious host has craftily mucked with only the #VC
IST stack page?

Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix
you up.

AFAICT all of that is non-recoverable.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 11:07                             ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:07 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote:
> On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
> > I have a somewhat serious question: should we use IST for #VC at all?
> > As I understand it, Rome and Naples make it mandatory for hypervisors
> > to intercept #DB, which means that, due to the MOV SS mess, it's sort
> > of mandatory to use IST for #VC.  But Milan fixes the #DB issue, so,
> > if we're running under a sufficiently sensible hypervisor, we don't
> > need IST for #VC.
> 
> The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds
> page ownership tracking between guest and host, so that the hypervisor
> can't remap guest pages without the guest noticing.
> 
> If there is a violation of ownership, which can happen at any memory
> access, there will be a #VC exception to notify the guest. And as this
> can happen anywhere, for example on a carefully crafted stack page set
> by userspace before doing SYSCALL, the only robust choice for #VC is to
> use IST.

So what happens if this #VC triggers on the first access to the #VC
stack, because the malicious host has craftily mucked with only the #VC
IST stack page?

Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix
you up.

AFAICT all of that is non-recoverable.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 10:45                           ` Peter Zijlstra
@ 2020-06-23 11:11                             ` Joerg Roedel
  2020-06-23 11:14                                 ` Peter Zijlstra
  0 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

Hi Peter,

On Tue, Jun 23, 2020 at 12:45:59PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote:
> > Or maybe you have a better idea how to implement this, so I'd like to
> > hear your opinion first before I spend too many days implementing
> > something.
> 
> OK, excuse my ignorance, but I'm not seeing how that IST shifting
> nonsense would've helped in the first place.
> 
> If I understand correctly the problem is:
> 
> 	<#VC>
> 	  shift IST
> 	  <NMI>
> 	    ... does stuff
> 	    <#VC> # again, safe because the shift
> 
> But what happens if you get the NMI before your IST adjustment?

The v3 patchset implements an unconditional shift of the #VC IST entry
in the NMI handler, before it can trigger a #VC exception.

> Either way around we get to fix this up in NMI (and any other IST
> exception that can happen while in #VC, hello #MC). And more complexity
> there is the very last thing we need :-(

Yes, in whatever way this gets implemented, it needs some fixup in the
NMI handler. But that can happen in C code, so it does not make the
assembly more complex, at least.

> There's no way you can fix up the IDT without getting an NMI first.

Not sure what you mean by this.

> This entire exception model is fundamentally buggered :-/

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:11                             ` Joerg Roedel
@ 2020-06-23 11:14                                 ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:14 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote:
> Hi Peter,
> 
> On Tue, Jun 23, 2020 at 12:45:59PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote:
> > > Or maybe you have a better idea how to implement this, so I'd like to
> > > hear your opinion first before I spend too many days implementing
> > > something.
> > 
> > OK, excuse my ignorance, but I'm not seeing how that IST shifting
> > nonsense would've helped in the first place.
> > 
> > If I understand correctly the problem is:
> > 
> > 	<#VC>
> > 	  shift IST
> > 	  <NMI>
> > 	    ... does stuff
> > 	    <#VC> # again, safe because the shift
> > 
> > But what happens if you get the NMI before your IST adjustment?
> 
> The v3 patchset implements an unconditional shift of the #VC IST entry
> in the NMI handler, before it can trigger a #VC exception.

Going by that other thread -- where you said that any memory access can
trigger a #VC, there just isn't such a guarantee.

> > Either way around we get to fix this up in NMI (and any other IST
> > exception that can happen while in #VC, hello #MC). And more complexity
> > there is the very last thing we need :-(
> 
> Yes, in whatever way this gets implemented, it needs some fixup in the
> NMI handler. But that can happen in C code, so it does not make the
> assembly more complex, at least.
> 
> > There's no way you can fix up the IDT without getting an NMI first.
> 
> Not sure what you mean by this.

I was talking about the case where #VC would try and fix up its own IST.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 11:14                                 ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:14 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote:
> Hi Peter,
> 
> On Tue, Jun 23, 2020 at 12:45:59PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote:
> > > Or maybe you have a better idea how to implement this, so I'd like to
> > > hear your opinion first before I spend too many days implementing
> > > something.
> > 
> > OK, excuse my ignorance, but I'm not seeing how that IST shifting
> > nonsense would've helped in the first place.
> > 
> > If I understand correctly the problem is:
> > 
> > 	<#VC>
> > 	  shift IST
> > 	  <NMI>
> > 	    ... does stuff
> > 	    <#VC> # again, safe because the shift
> > 
> > But what happens if you get the NMI before your IST adjustment?
> 
> The v3 patchset implements an unconditional shift of the #VC IST entry
> in the NMI handler, before it can trigger a #VC exception.

Going by that other thread -- where you said that any memory access can
trigger a #VC, there just isn't such a guarantee.

> > Either way around we get to fix this up in NMI (and any other IST
> > exception that can happen while in #VC, hello #MC). And more complexity
> > there is the very last thing we need :-(
> 
> Yes, in whatever way this gets implemented, it needs some fixup in the
> NMI handler. But that can happen in C code, so it does not make the
> assembly more complex, at least.
> 
> > There's no way you can fix up the IDT without getting an NMI first.
> 
> Not sure what you mean by this.

I was talking about the case where #VC would try and fix up its own IST.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:07                             ` Peter Zijlstra
  (?)
@ 2020-06-23 11:30                             ` Joerg Roedel
  2020-06-23 11:48                                 ` Peter Zijlstra
  2020-06-23 11:51                               ` Andrew Cooper
  -1 siblings, 2 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 11:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:07:06PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote:

> So what happens if this #VC triggers on the first access to the #VC
> stack, because the malicious host has craftily mucked with only the #VC
> IST stack page?
> 
> Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix
> you up.
> 
> AFAICT all of that is non-recoverable.

I am not 100% sure, but I think if the #VC stack page is not validated,
the #VC should be promoted to a #DF.

Note that this is an issue only with secure nested paging (SNP), which
is not enabled yet with this patch-set. When it gets enabled a stack
recursion check in the #VC handler is needed which panics the VM. That
also fixes the #VC-in-early-NMI problem.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:14                                 ` Peter Zijlstra
  (?)
@ 2020-06-23 11:43                                 ` Joerg Roedel
  2020-06-23 11:50                                     ` Peter Zijlstra
  -1 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 11:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:14:43PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote:

> > The v3 patchset implements an unconditional shift of the #VC IST entry
> > in the NMI handler, before it can trigger a #VC exception.
> 
> Going by that other thread -- where you said that any memory access can
> trigger a #VC, there just isn't such a guarantee.

As I wrote in the other mail, this can only happen when SNP gets enabled
(which is follow-on work to this) and is handled by a stack recursion
check in the #VC handler.

The reason I mentioned the #VC-anywhere case is to make it more clear
why #VC needs an IST handler.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:30                             ` Joerg Roedel
@ 2020-06-23 11:48                                 ` Peter Zijlstra
  2020-06-23 11:51                               ` Andrew Cooper
  1 sibling, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:48 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote:
> Note that this is an issue only with secure nested paging (SNP), which
> is not enabled yet with this patch-set. When it gets enabled a stack
> recursion check in the #VC handler is needed which panics the VM. That
> also fixes the #VC-in-early-NMI problem.

But you cannot do a recursion check in #VC, because the NMI can happen
on the first instruction of #VC, before we can increment our counter,
and then the #VC can happen on NMI because the IST stack is a goner, and
we're fscked again (or on a per-cpu variable we touch in our elaborate
NMI setup, etc..).

There is no way I can see SNP-#VC 'work'. The best I can come up with is
'mostly', but do you like your bridges/dikes/etc.. to be mostly ok? Or
do you want a guarantee they'll actually work?

I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in
specific is utter crap.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 11:48                                 ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:48 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote:
> Note that this is an issue only with secure nested paging (SNP), which
> is not enabled yet with this patch-set. When it gets enabled a stack
> recursion check in the #VC handler is needed which panics the VM. That
> also fixes the #VC-in-early-NMI problem.

But you cannot do a recursion check in #VC, because the NMI can happen
on the first instruction of #VC, before we can increment our counter,
and then the #VC can happen on NMI because the IST stack is a goner, and
we're fscked again (or on a per-cpu variable we touch in our elaborate
NMI setup, etc..).

There is no way I can see SNP-#VC 'work'. The best I can come up with is
'mostly', but do you like your bridges/dikes/etc.. to be mostly ok? Or
do you want a guarantee they'll actually work?

I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in
specific is utter crap.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:43                                 ` Joerg Roedel
@ 2020-06-23 11:50                                     ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:50 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:43:24PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:14:43PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote:
> 
> > > The v3 patchset implements an unconditional shift of the #VC IST entry
> > > in the NMI handler, before it can trigger a #VC exception.
> > 
> > Going by that other thread -- where you said that any memory access can
> > trigger a #VC, there just isn't such a guarantee.
> 
> As I wrote in the other mail, this can only happen when SNP gets enabled
> (which is follow-on work to this) and is handled by a stack recursion
> check in the #VC handler.
> 
> The reason I mentioned the #VC-anywhere case is to make it more clear
> why #VC needs an IST handler.

If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
you to only make it IST if/when you try and make SNP happen, not before.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 11:50                                     ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 11:50 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 01:43:24PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:14:43PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote:
> 
> > > The v3 patchset implements an unconditional shift of the #VC IST entry
> > > in the NMI handler, before it can trigger a #VC exception.
> > 
> > Going by that other thread -- where you said that any memory access can
> > trigger a #VC, there just isn't such a guarantee.
> 
> As I wrote in the other mail, this can only happen when SNP gets enabled
> (which is follow-on work to this) and is handled by a stack recursion
> check in the #VC handler.
> 
> The reason I mentioned the #VC-anywhere case is to make it more clear
> why #VC needs an IST handler.

If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
you to only make it IST if/when you try and make SNP happen, not before.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:30                             ` Joerg Roedel
  2020-06-23 11:48                                 ` Peter Zijlstra
@ 2020-06-23 11:51                               ` Andrew Cooper
  2020-06-23 12:47                                   ` Peter Zijlstra
  2020-06-23 15:51                                 ` Borislav Petkov
  1 sibling, 2 replies; 243+ messages in thread
From: Andrew Cooper @ 2020-06-23 11:51 UTC (permalink / raw)
  To: Joerg Roedel, Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On 23/06/2020 12:30, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:07:06PM +0200, Peter Zijlstra wrote:
>> On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote:
>> So what happens if this #VC triggers on the first access to the #VC
>> stack, because the malicious host has craftily mucked with only the #VC
>> IST stack page?
>>
>> Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix
>> you up.
>>
>> AFAICT all of that is non-recoverable.
> I am not 100% sure, but I think if the #VC stack page is not validated,
> the #VC should be promoted to a #DF.
>
> Note that this is an issue only with secure nested paging (SNP), which
> is not enabled yet with this patch-set. When it gets enabled a stack
> recursion check in the #VC handler is needed which panics the VM. That
> also fixes the #VC-in-early-NMI problem.

There are cases which are definitely non-recoverable.

For both ES and SNP, a malicious hypervisor can mess with the guest
physmap to make the the NMI, #VC and #DF stacks all alias.

For ES, this had better result in the #DF handler deciding that crashing
is the way out, whereas for SNP, this had better escalate to Shutdown.


What matters here is the security model in SNP.  The hypervisor is
relied upon for availability (because it could simply refuse to schedule
the VM), but market/business forces will cause it to do its best to keep
the VM running.  Therefore, the securely model is simply(?) that the
hypervisor can't do anything to undermine the confidentiality or
integrity of the VM.

Crashing out hard if the hypervisor is misbehaving is acceptable.  In a
cloud, I as a customer would (threaten to?) take my credit card
elsewhere, while for enterprise, I'd shout at my virtualisation vendor
until a fix happened (also perhaps threaten to take my credit card
elsewhere).

Therefore, it is reasonable to build on the expectation that the
hypervisor won't be messing around with remapping stacks/etc.  Most
#VC's are synchronous with guest actions (they equate to actions which
would have caused a VMExit), so you can safely reason about when the
first #VC might occur, presuming no funny business with the frames
backing any memory operands touched.

~Andrew

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:48                                 ` Peter Zijlstra
  (?)
@ 2020-06-23 12:04                                 ` Joerg Roedel
  2020-06-23 12:52                                     ` Peter Zijlstra
  -1 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 12:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:48:18PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote:

> But you cannot do a recursion check in #VC, because the NMI can happen
> on the first instruction of #VC, before we can increment our counter,
> and then the #VC can happen on NMI because the IST stack is a goner, and
> we're fscked again (or on a per-cpu variable we touch in our elaborate
> NMI setup, etc..).

No, the recursion check is fine, because overwriting an already used IST
stack doesn't matter (as long as it can be detected) if we are going to
panic anyway. It doesn't matter because the kernel will not leave the
currently running handler anymore.

I agree there is no way to keep the system running if that happens, but
that is also not what is wanted. If stack recursion happens, something
malicious from the HV side is going on, and all the kernel needs to be
able to is to safely and reliably detect the situation and panic the VM
to prevent any data corruption or loss or even leakage.

> I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in
> specific is utter crap.

I agree, but don't forget the most prominent underlying reason for IST:
The SYSCALL gap. If SYSCALL would switch stacks most of those issues
would not exist. IST would still be needed because there are no task
gates in x86-64, but still...

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:50                                     ` Peter Zijlstra
  (?)
@ 2020-06-23 12:12                                     ` Joerg Roedel
  2020-06-23 13:03                                         ` Peter Zijlstra
  -1 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
> you to only make it IST if/when you try and make SNP happen, not before.

It is not the only reason, when ES guests gain debug register support
then #VC also needs to be IST, because #DB can be promoted into #VC
then, and as #DB is IST for a reason, #VC needs to be too.

Besides that, I am not a fan of delegating problems I already see coming
to future-Joerg and future-Peter, but if at all possible deal with them
now and be safe later.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 11:51                               ` Andrew Cooper
@ 2020-06-23 12:47                                   ` Peter Zijlstra
  2020-06-23 15:51                                 ` Borislav Petkov
  1 sibling, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 12:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Joerg Roedel, Andy Lutomirski, Joerg Roedel, Dave Hansen,
	Tom Lendacky, Mike Stunes, Dan Williams, Dave Hansen,
	H. Peter Anvin, Juergen Gross, Jiri Slaby, Kees Cook, kvm list,
	LKML, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote:

> There are cases which are definitely non-recoverable.
> 
> For both ES and SNP, a malicious hypervisor can mess with the guest
> physmap to make the the NMI, #VC and #DF stacks all alias.
> 
> For ES, this had better result in the #DF handler deciding that crashing
> is the way out, whereas for SNP, this had better escalate to Shutdown.

> Crashing out hard if the hypervisor is misbehaving is acceptable.

Then I'm thinking the only sensible option is to crash hard for any SNP
#VC from kernel mode.

Sadly that doesn't help with #VC needing to be IST :-( IST is such a
frigging nightmare.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 12:47                                   ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 12:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Mike Stunes, Kees Cook, kvm list, Joerg Roedel, Dave Hansen,
	LKML, Sean Christopherson, Linux Virtualization, Dave Hansen,
	Andy Lutomirski, H. Peter Anvin, Dan Williams, Jiri Slaby,
	X86 ML

On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote:

> There are cases which are definitely non-recoverable.
> 
> For both ES and SNP, a malicious hypervisor can mess with the guest
> physmap to make the the NMI, #VC and #DF stacks all alias.
> 
> For ES, this had better result in the #DF handler deciding that crashing
> is the way out, whereas for SNP, this had better escalate to Shutdown.

> Crashing out hard if the hypervisor is misbehaving is acceptable.

Then I'm thinking the only sensible option is to crash hard for any SNP
#VC from kernel mode.

Sadly that doesn't help with #VC needing to be IST :-( IST is such a
frigging nightmare.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 12:04                                 ` Joerg Roedel
@ 2020-06-23 12:52                                     ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 12:52 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 02:04:33PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:48:18PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote:
> 
> > But you cannot do a recursion check in #VC, because the NMI can happen
> > on the first instruction of #VC, before we can increment our counter,
> > and then the #VC can happen on NMI because the IST stack is a goner, and
> > we're fscked again (or on a per-cpu variable we touch in our elaborate
> > NMI setup, etc..).
> 
> No, the recursion check is fine, because overwriting an already used IST
> stack doesn't matter (as long as it can be detected) if we are going to
> panic anyway. It doesn't matter because the kernel will not leave the
> currently running handler anymore.

You only have that guarantee when any SNP #VC from kernel is an
automatic panic. But in that case, what's the point of having the
recursion count?

> > I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in
> > specific is utter crap.
> 
> I agree, but don't forget the most prominent underlying reason for IST:
> The SYSCALL gap. If SYSCALL would switch stacks most of those issues
> would not exist. IST would still be needed because there are no task
> gates in x86-64, but still...

We could all go back to int80 ;-) /me runs like heck

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 12:52                                     ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 12:52 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 02:04:33PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:48:18PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote:
> 
> > But you cannot do a recursion check in #VC, because the NMI can happen
> > on the first instruction of #VC, before we can increment our counter,
> > and then the #VC can happen on NMI because the IST stack is a goner, and
> > we're fscked again (or on a per-cpu variable we touch in our elaborate
> > NMI setup, etc..).
> 
> No, the recursion check is fine, because overwriting an already used IST
> stack doesn't matter (as long as it can be detected) if we are going to
> panic anyway. It doesn't matter because the kernel will not leave the
> currently running handler anymore.

You only have that guarantee when any SNP #VC from kernel is an
automatic panic. But in that case, what's the point of having the
recursion count?

> > I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in
> > specific is utter crap.
> 
> I agree, but don't forget the most prominent underlying reason for IST:
> The SYSCALL gap. If SYSCALL would switch stacks most of those issues
> would not exist. IST would still be needed because there are no task
> gates in x86-64, but still...

We could all go back to int80 ;-) /me runs like heck

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 12:12                                     ` Joerg Roedel
@ 2020-06-23 13:03                                         ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 13:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
> > If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
> > you to only make it IST if/when you try and make SNP happen, not before.
> 
> It is not the only reason, when ES guests gain debug register support
> then #VC also needs to be IST, because #DB can be promoted into #VC
> then, and as #DB is IST for a reason, #VC needs to be too.

Didn't I read somewhere that that is only so for Rome/Naples but not for
the later chips (Milan) which have #DB pass-through?

> Besides that, I am not a fan of delegating problems I already see coming
> to future-Joerg and future-Peter, but if at all possible deal with them
> now and be safe later.

Well, we could just say no :-) At some point in the very near future
this house of cards is going to implode.

We're talking about the 3rd case where the only reason things 'work' is
because we'll have to panic():

 - #MC
 - #DB with BUS LOCK DEBUG EXCEPTION
 - #VC SNP

(and it ain't a happy accident they're all IST)

Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware
folks? How come we're getting more and more of them? (/me puts fingers
in ears and goes la-la-la-la in anticipation of Andrew mentioning CET)



^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 13:03                                         ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 13:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
> > If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
> > you to only make it IST if/when you try and make SNP happen, not before.
> 
> It is not the only reason, when ES guests gain debug register support
> then #VC also needs to be IST, because #DB can be promoted into #VC
> then, and as #DB is IST for a reason, #VC needs to be too.

Didn't I read somewhere that that is only so for Rome/Naples but not for
the later chips (Milan) which have #DB pass-through?

> Besides that, I am not a fan of delegating problems I already see coming
> to future-Joerg and future-Peter, but if at all possible deal with them
> now and be safe later.

Well, we could just say no :-) At some point in the very near future
this house of cards is going to implode.

We're talking about the 3rd case where the only reason things 'work' is
because we'll have to panic():

 - #MC
 - #DB with BUS LOCK DEBUG EXCEPTION
 - #VC SNP

(and it ain't a happy accident they're all IST)

Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware
folks? How come we're getting more and more of them? (/me puts fingers
in ears and goes la-la-la-la in anticipation of Andrew mentioning CET)

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 12:52                                     ` Peter Zijlstra
@ 2020-06-23 13:40                                       ` Joerg Roedel
  -1 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 02:52:01PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 02:04:33PM +0200, Joerg Roedel wrote:
> > No, the recursion check is fine, because overwriting an already used IST
> > stack doesn't matter (as long as it can be detected) if we are going to
> > panic anyway. It doesn't matter because the kernel will not leave the
> > currently running handler anymore.
> 
> You only have that guarantee when any SNP #VC from kernel is an
> automatic panic. But in that case, what's the point of having the
> recursion count?

It is not a recursion count, it is a stack-recursion check. Basically
walk down the stack and look if your current stack is already in use.
Yes, this can be optimized, but that is what is needed.

IIRC the current prototype code for SNP just pre-validates all memory in
the VM and doesn't support moving pages around on the host. So any #VC
SNP exception would be fatal, yes.

In a scenario with on-demand validation of guest pages and support for
guest-assisted page-moving on the HV side it would be more complicated.
Basically all memory that is accessed during #VC exception handling must
stay validated at all times, including the IST stack.

So saying this, I don't understand why _all_ SNP #VC exceptions from
kernel space must be fatal?

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 13:40                                       ` Joerg Roedel
  0 siblings, 0 replies; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 02:52:01PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 02:04:33PM +0200, Joerg Roedel wrote:
> > No, the recursion check is fine, because overwriting an already used IST
> > stack doesn't matter (as long as it can be detected) if we are going to
> > panic anyway. It doesn't matter because the kernel will not leave the
> > currently running handler anymore.
> 
> You only have that guarantee when any SNP #VC from kernel is an
> automatic panic. But in that case, what's the point of having the
> recursion count?

It is not a recursion count, it is a stack-recursion check. Basically
walk down the stack and look if your current stack is already in use.
Yes, this can be optimized, but that is what is needed.

IIRC the current prototype code for SNP just pre-validates all memory in
the VM and doesn't support moving pages around on the host. So any #VC
SNP exception would be fatal, yes.

In a scenario with on-demand validation of guest pages and support for
guest-assisted page-moving on the HV side it would be more complicated.
Basically all memory that is accessed during #VC exception handling must
stay validated at all times, including the IST stack.

So saying this, I don't understand why _all_ SNP #VC exceptions from
kernel space must be fatal?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 12:47                                   ` Peter Zijlstra
@ 2020-06-23 13:57                                     ` Andrew Cooper
  -1 siblings, 0 replies; 243+ messages in thread
From: Andrew Cooper @ 2020-06-23 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, Andy Lutomirski, Joerg Roedel, Dave Hansen,
	Tom Lendacky, Mike Stunes, Dan Williams, Dave Hansen,
	H. Peter Anvin, Juergen Gross, Jiri Slaby, Kees Cook, kvm list,
	LKML, Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On 23/06/2020 13:47, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote:
>
>> There are cases which are definitely non-recoverable.
>>
>> For both ES and SNP, a malicious hypervisor can mess with the guest
>> physmap to make the the NMI, #VC and #DF stacks all alias.
>>
>> For ES, this had better result in the #DF handler deciding that crashing
>> is the way out, whereas for SNP, this had better escalate to Shutdown.
>> Crashing out hard if the hypervisor is misbehaving is acceptable.
> Then I'm thinking the only sensible option is to crash hard for any SNP
> #VC from kernel mode.
>
> Sadly that doesn't help with #VC needing to be IST :-( IST is such a
> frigging nightmare.

I presume you mean any #VC caused by RMP faults (i.e. something went
wrong with the memory owner/etc metadata) ?

If so, then yes.  Any failure here is a bug in the kernel or hypervisor
(and needs fixing) or a malicious hypervisor and the guest should
terminate for its own safety.

~Andrew

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 13:57                                     ` Andrew Cooper
  0 siblings, 0 replies; 243+ messages in thread
From: Andrew Cooper @ 2020-06-23 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, Joerg Roedel,
	Mike Stunes, Kees Cook, kvm list, Joerg Roedel, Dave Hansen,
	LKML, Sean Christopherson, Linux Virtualization, Dave Hansen,
	Andy Lutomirski, H. Peter Anvin, Dan Williams, Jiri Slaby,
	X86 ML

On 23/06/2020 13:47, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote:
>
>> There are cases which are definitely non-recoverable.
>>
>> For both ES and SNP, a malicious hypervisor can mess with the guest
>> physmap to make the the NMI, #VC and #DF stacks all alias.
>>
>> For ES, this had better result in the #DF handler deciding that crashing
>> is the way out, whereas for SNP, this had better escalate to Shutdown.
>> Crashing out hard if the hypervisor is misbehaving is acceptable.
> Then I'm thinking the only sensible option is to crash hard for any SNP
> #VC from kernel mode.
>
> Sadly that doesn't help with #VC needing to be IST :-( IST is such a
> frigging nightmare.

I presume you mean any #VC caused by RMP faults (i.e. something went
wrong with the memory owner/etc metadata) ?

If so, then yes.  Any failure here is a bug in the kernel or hypervisor
(and needs fixing) or a malicious hypervisor and the guest should
terminate for its own safety.

~Andrew
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 13:40                                       ` Joerg Roedel
@ 2020-06-23 13:59                                         ` Peter Zijlstra
  -1 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 13:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 03:40:03PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 02:52:01PM +0200, Peter Zijlstra wrote:

> > You only have that guarantee when any SNP #VC from kernel is an
> > automatic panic. But in that case, what's the point of having the
> > recursion count?
> 
> It is not a recursion count, it is a stack-recursion check. Basically
> walk down the stack and look if your current stack is already in use.
> Yes, this can be optimized, but that is what is needed.
> 
> IIRC the current prototype code for SNP just pre-validates all memory in
> the VM and doesn't support moving pages around on the host. So any #VC
> SNP exception would be fatal, yes.
> 
> In a scenario with on-demand validation of guest pages and support for
> guest-assisted page-moving on the HV side it would be more complicated.
> Basically all memory that is accessed during #VC exception handling must
> stay validated at all times, including the IST stack.
> 
> So saying this, I don't understand why _all_ SNP #VC exceptions from
> kernel space must be fatal?

Ah, because I hadn't thought of the stack-recursion check.

So basically when your exception frame points to your own IST, you die.
That sounds like something we should have in generic IST code.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 13:59                                         ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 13:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 03:40:03PM +0200, Joerg Roedel wrote:
> On Tue, Jun 23, 2020 at 02:52:01PM +0200, Peter Zijlstra wrote:

> > You only have that guarantee when any SNP #VC from kernel is an
> > automatic panic. But in that case, what's the point of having the
> > recursion count?
> 
> It is not a recursion count, it is a stack-recursion check. Basically
> walk down the stack and look if your current stack is already in use.
> Yes, this can be optimized, but that is what is needed.
> 
> IIRC the current prototype code for SNP just pre-validates all memory in
> the VM and doesn't support moving pages around on the host. So any #VC
> SNP exception would be fatal, yes.
> 
> In a scenario with on-demand validation of guest pages and support for
> guest-assisted page-moving on the HV side it would be more complicated.
> Basically all memory that is accessed during #VC exception handling must
> stay validated at all times, including the IST stack.
> 
> So saying this, I don't understand why _all_ SNP #VC exceptions from
> kernel space must be fatal?

Ah, because I hadn't thought of the stack-recursion check.

So basically when your exception frame points to your own IST, you die.
That sounds like something we should have in generic IST code.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 13:03                                         ` Peter Zijlstra
  (?)
@ 2020-06-23 14:49                                         ` Joerg Roedel
  2020-06-23 15:16                                             ` Peter Zijlstra
  -1 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 03:03:22PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
> > On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
> > > If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
> > > you to only make it IST if/when you try and make SNP happen, not before.
> > 
> > It is not the only reason, when ES guests gain debug register support
> > then #VC also needs to be IST, because #DB can be promoted into #VC
> > then, and as #DB is IST for a reason, #VC needs to be too.
> 
> Didn't I read somewhere that that is only so for Rome/Naples but not for
> the later chips (Milan) which have #DB pass-through?

Probably, not sure which chips will get debug register virtualization
under SEV-ES. But even when it is supported, the HV can (and sometimes
will) intercept #DB, which then causes it to be promoted to #VC.

> We're talking about the 3rd case where the only reason things 'work' is
> because we'll have to panic():
> 
>  - #MC

Okay, #MC is special and can only be handled on a best-effort basis, as
#MC could happen anytime, also while already executing the #MC handler.

>  - #DB with BUS LOCK DEBUG EXCEPTION

If I understand the problem correctly, this can be solved by moving off
the IST stack to the current task stack in the #DB handler, like I plan
to do for #VC, no?

>  - #VC SNP

This has to panic for other reasons that can't be worked around. It
boils down to detecting that the HV is doing something fishy and bail
out to avoid further harm (like in the #MC handler).


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 13:59                                         ` Peter Zijlstra
@ 2020-06-23 14:53                                           ` Peter Zijlstra
  -1 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 14:53 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 03:59:16PM +0200, Peter Zijlstra wrote:

> So basically when your exception frame points to your own IST, you die.
> That sounds like something we should have in generic IST code.

Something like this... #DF already dies and NMI is 'magic'

---
 arch/x86/entry/common.c         |  7 +++++++
 arch/x86/include/asm/idtentry.h | 12 +++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index af0d57ed5e69..e38e4f34c90c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -742,6 +742,13 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
 	__nmi_exit();
 }
 
+noinstr void idtentry_validate_ist(struct pt_regs *regs)
+{
+	if ((regs->sp & ~(EXCEPTION_STKSZ-1)) ==
+	    (_RET_IP_ & ~(EXCEPTION_STKSZ-1)))
+		die("IST stack recursion", regs, 0);
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4e399f120ff8..974c1a4eacbb 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -19,6 +19,8 @@ void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit);
 bool idtentry_enter_nmi(struct pt_regs *regs);
 void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 
+void idtentry_validate_ist(struct pt_regs *regs);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -322,7 +324,15 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW
  */
 #define DEFINE_IDTENTRY_IST(func)					\
-	DEFINE_IDTENTRY_RAW(func)
+static __always_inline void __##func(struct pt_regs *regs);		\
+									\
+__visible noinstr void func(struct pt_regs *regs)			\
+{									\
+	idtentry_validate_ist(regs);					\
+	__##func(regs);							\
+}									\
+									\
+static __always_inline void __##func(struct pt_regs *regs)
 
 /**
  * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 14:53                                           ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 14:53 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 03:59:16PM +0200, Peter Zijlstra wrote:

> So basically when your exception frame points to your own IST, you die.
> That sounds like something we should have in generic IST code.

Something like this... #DF already dies and NMI is 'magic'

---
 arch/x86/entry/common.c         |  7 +++++++
 arch/x86/include/asm/idtentry.h | 12 +++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index af0d57ed5e69..e38e4f34c90c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -742,6 +742,13 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
 	__nmi_exit();
 }
 
+noinstr void idtentry_validate_ist(struct pt_regs *regs)
+{
+	if ((regs->sp & ~(EXCEPTION_STKSZ-1)) ==
+	    (_RET_IP_ & ~(EXCEPTION_STKSZ-1)))
+		die("IST stack recursion", regs, 0);
+}
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4e399f120ff8..974c1a4eacbb 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -19,6 +19,8 @@ void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit);
 bool idtentry_enter_nmi(struct pt_regs *regs);
 void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 
+void idtentry_validate_ist(struct pt_regs *regs);
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -322,7 +324,15 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW
  */
 #define DEFINE_IDTENTRY_IST(func)					\
-	DEFINE_IDTENTRY_RAW(func)
+static __always_inline void __##func(struct pt_regs *regs);		\
+									\
+__visible noinstr void func(struct pt_regs *regs)			\
+{									\
+	idtentry_validate_ist(regs);					\
+	__##func(regs);							\
+}									\
+									\
+static __always_inline void __##func(struct pt_regs *regs)
 
 /**
  * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 14:53                                           ` Peter Zijlstra
  (?)
@ 2020-06-23 14:59                                           ` Joerg Roedel
  2020-06-23 15:23                                               ` Peter Zijlstra
  -1 siblings, 1 reply; 243+ messages in thread
From: Joerg Roedel @ 2020-06-23 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote:
> +noinstr void idtentry_validate_ist(struct pt_regs *regs)
> +{
> +	if ((regs->sp & ~(EXCEPTION_STKSZ-1)) ==
> +	    (_RET_IP_ & ~(EXCEPTION_STKSZ-1)))
> +		die("IST stack recursion", regs, 0);
> +}

Yes, this is a start, it doesn't cover the case where the NMI stack is
in-between, so I think you need to walk down regs->sp too. The dumpstack
code already has some logic for this.


	Joerg


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 14:49                                         ` Joerg Roedel
@ 2020-06-23 15:16                                             ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 15:16 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson, Andrew Cooper

On Tue, Jun 23, 2020 at 04:49:40PM +0200, Joerg Roedel wrote:
> > We're talking about the 3rd case where the only reason things 'work' is
> > because we'll have to panic():
> > 
> >  - #MC
> 
> Okay, #MC is special and can only be handled on a best-effort basis, as
> #MC could happen anytime, also while already executing the #MC handler.

I think the hardware has a MCE-mask bit somewhere. Flaky though because
clearing it isn't 'atomic' with IRET, so there's a 'funny' window.

It also interacts really bad with the NMI handler. If we get an #MC
early in the NMI, where we hard-rely on the NMI-mask being set to set-up
the recursion stack, then the #MC IRET will clear the NMI-mask, and
we're toast.

Andy has wild and crazy ideas, but I don't think we need more crazy
here.

#VC SNP has a similar problem vs NMI, that needs to die() irrespective
of the #VC IST recursion.

> >  - #DB with BUS LOCK DEBUG EXCEPTION
> 
> If I understand the problem correctly, this can be solved by moving off
> the IST stack to the current task stack in the #DB handler, like I plan
> to do for #VC, no?

Hmm, probably. Would take a bit of care, but should be doable.

> >  - #VC SNP
> 
> This has to panic for other reasons that can't be worked around. It
> boils down to detecting that the HV is doing something fishy and bail
> out to avoid further harm (like in the #MC handler).

Right, but it doesn't take away that IST-any-time vectors are
fundamentally screwy.

Both the MCE and NMI have masks that are, as per the above, differently
funny, but the other ISTs do not. Also, even if they had masks, the
interaction between them is still screwy.

#VC would've been so much better if it would've had a mask bit
somewhere, then at least we could've had the exception entry covered.
Another #VC with the mask set should probably result in #DF or Shutdown,
but that's all water under the bridge I suspect.




^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
@ 2020-06-23 15:16                                             ` Peter Zijlstra
  0 siblings, 0 replies; 243+ messages in thread
From: Peter Zijlstra @ 2020-06-23 15:16 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Juergen Gross, Tom Lendacky, Thomas Hellstrom, X86 ML,
	Mike Stunes, Kees Cook, kvm list, Andrew Cooper, Joerg Roedel,
	Dave Hansen, LKML, Sean Christopherson, Linux Virtualization,
	Dave Hansen, Andy Lutomirski, H. Peter Anvin, Dan Williams,
	Jiri Slaby

On Tue, Jun 23, 2020 at 04:49:40PM +0200, Joerg Roedel wrote:
> > We're talking about the 3rd case where the only reason things 'work' is
> > because we'll have to panic():
> > 
> >  - #MC
> 
> Okay, #MC is special and can only be handled on a best-effort basis, as
> #MC could happen anytime, also while already executing the #MC handler.

I think the hardware has a MCE-mask bit somewhere. Flaky though because
clearing it isn't 'atomic' with IRET, so there's a 'funny' window.

It also interacts really bad with the NMI handler. If we get an #MC
early in the NMI, where we hard-rely on the NMI-mask being set to set-up
the recursion stack, then the #MC IRET will clear the NMI-mask, and
we're toast.

Andy has wild and crazy ideas, but I don't think we need more crazy
here.

#VC SNP has a similar problem vs NMI, that needs to die() irrespective
of the #VC IST recursion.

> >  - #DB with BUS LOCK DEBUG EXCEPTION
> 
> If I understand the problem correctly, this can be solved by moving off
> the IST stack to the current task stack in the #DB handler, like I plan
> to do for #VC, no?

Hmm, probably. Would take a bit of care, but should be doable.

> >  - #VC SNP
> 
> This has to panic for other reasons that can't be worked around. It
> boils down to detecting that the HV is doing something fishy and bail
> out to avoid further harm (like in the #MC handler).

Right, but it doesn't take away that IST-any-time vectors are
fundamentally screwy.

Both the MCE and NMI have masks that are, as per the above, differently
funny, but the other ISTs do not. Also, even if they had masks, the
interaction between them is still screwy.

#VC would've been so much better if it would've had a mask bit
somewhere, then at least we could've had the exception entry covered.
Another #VC with the mask set should probably result in #DF or Shutdown,
but that's all water under the bridge I suspect.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 13:03                                         ` Peter Zijlstra
  (?)
  (?)
@ 2020-06-23 15:22                                         ` Andrew Cooper
  2020-06-23 18:26                                           ` Andy Lutomirski
  -1 siblings, 1 reply; 243+ messages in thread
From: Andrew Cooper @ 2020-06-23 15:22 UTC (permalink / raw)
  To: Peter Zijlstra, Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Dave Hansen, Tom Lendacky,
	Mike Stunes, Dan Williams, Dave Hansen, H. Peter Anvin,
	Juergen Gross, Jiri Slaby, Kees Cook, kvm list, LKML,
	Thomas Hellstrom, Linux Virtualization, X86 ML,
	Sean Christopherson

On 23/06/2020 14:03, Peter Zijlstra wrote:
> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote:
>> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote:
>>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge
>>> you to only make it IST if/when you try and make SNP happen, not before.
>> It is not the only reason, when ES guests gain debug register support
>> then #VC also needs to be IST, because #DB can be promoted into #VC
>> then, and as #DB is IST for a reason, #VC needs to be too.
> Didn't I read somewhere that that is only so for Rome/Naples but not for
> the later chips (Milan) which have #DB pass-through?

I don't know about hardware timelines, but some future part can now opt
in to having debug registers as part of the encrypted state, and swapped
by VMExit, which would make debug facilities generally usable, and
supposedly safe to the #DB infinite loop issues, at which point the
hypervisor need not intercept #DB for safety reasons.

Its worth nothing that on current parts, the hypervisor can set up debug
facilities on behalf of the guest (or behind its back) as the DR state
is unencrypted, but that attempting to intercept #DB will redirect to
#VC inside the guest and cause fun. (Also spare a thought for 32bit
kernels which have to cope with userspace singlestepping the SYSENTER
path with every #DB turning into #VC.)

>> Besides that, I am not a fan of delegating problems I already see coming
>> to future-Joerg and future-Peter, but if at all possible deal with them
>> now and be safe later.
> Well, we could just say no :-) At some point in the very near future
> this house of cards is going to implode.

What currently exists is a picture of a house of cards in front of
something which has fallen down.

> Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware
> folks? How come we're getting more and more of them?

I have tried to get this point across.  Then again - its far easier for
the software folk in the same company as the hardware folk to make this
point.

> (/me puts fingers
> in ears and goes la-la-la-la in anticipation of Andrew mentioning CET)

I wasn't going to bring it up, but seeing as you have - while there are
prohibitively-complicating issues preventing it from working on native,
I don't see any point even considering it for the mess which is #VC, or
the even bigger mess which is #HV.

~Andrew

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: Should SEV-ES #VC use IST? (Re: [PATCH] Allow RDTSC and RDTSCP from userspace)
  2020-06-23 14:59                                           `