KVM Archive on lore.kernel.org
 help / color / Atom feed
* [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support
@ 2021-03-24 17:04 Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support Brijesh Singh
                   ` (29 more replies)
  0 siblings, 30 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Part-1 https://marc.info/?l=kvm&m=161660430125343&w=2 .

This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.

The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.

The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:
- Register GHCB GPA
- Page State Change Request

The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
access requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality is
already protected via memory encryption. When hardware encounters an RMP
checks failure, it raises a page-fault exception. If RMP check failure
is due to the page-size mismatch, then split the large page to resolve
the fault. See patch 4 and 7 for further details.

The series does not provide support for the following SEV-SNP specific
NAE's yet:

* Query Attestation 
* AP bring up
* Interrupt security

The series is based on kvm/master commit:
  87aa9ec939ec KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs

The complete source is available at
https://github.com/AMDESE/linux/tree/sev-snp-part-2-rfc1

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org

Additional resources
---------------------
SEV-SNP whitepaper
https://www.amd.com/system/files/TechDocs/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf
 
APM 2: https://www.amd.com/system/files/TechDocs/24593.pdf
(section 15.36)

GHCB spec v2:
  The draft specification is posted on AMD-SEV-SNP mailing list:
   https://lists.suse.com/mailman/private/amd-sev-snp/

  Copy of draft spec is also available at 
  https://github.com/AMDESE/AMDSEV/blob/sev-snp-devel/docs/56421-Guest_Hypervisor_Communication_Block_Standardization.pdf

GHCB spec v1:
SEV-SNP firmware specification:
 https://developer.amd.com/sev/

Brijesh Singh (30):
  x86: Add the host SEV-SNP initialization support
  x86/sev-snp: add RMP entry lookup helpers
  x86: add helper functions for RMPUPDATE and PSMASH instruction
  x86/mm: split the physmap when adding the page in RMP table
  x86: define RMP violation #PF error code
  x86/fault: dump the RMP entry on #PF
  mm: add support to split the large THP based on RMP violation
  crypto:ccp: define the SEV-SNP commands
  crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
  crypto: ccp: shutdown SNP firmware on kexec
  crypto:ccp: provide APIs to issue SEV-SNP commands
  crypto ccp: handle the legacy SEV command when SNP is enabled
  KVM: SVM: add initial SEV-SNP support
  KVM: SVM: make AVIC backing, VMSA and VMCB memory allocation SNP safe
  KVM: SVM: define new SEV_FEATURES field in the VMCB Save State Area
  KVM: SVM: add KVM_SNP_INIT command
  KVM: SVM: add KVM_SEV_SNP_LAUNCH_START command
  KVM: SVM: add KVM_SEV_SNP_LAUNCH_UPDATE command
  KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  KVM: SVM: add KVM_SEV_SNP_LAUNCH_FINISH command
  KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
  KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
  KVM: X86: define new RMP check related #NPF error bits
  KVM: X86: update page-fault trace to log the 64-bit error code
  KVM: SVM: add support to handle GHCB GPA register VMGEXIT
  KVM: SVM: add support to handle MSR based Page State Change VMGEXIT
  KVM: SVM: add support to handle Page State Change VMGEXIT
  KVM: X86: export the kvm_zap_gfn_range() for the SNP use
  KVM: X86: Add support to handle the RMP nested page fault

 arch/x86/include/asm/kvm_host.h  |  14 +
 arch/x86/include/asm/msr-index.h |   6 +
 arch/x86/include/asm/sev-snp.h   |  68 +++
 arch/x86/include/asm/svm.h       |  12 +-
 arch/x86/include/asm/trap_pf.h   |   2 +
 arch/x86/kvm/lapic.c             |   5 +-
 arch/x86/kvm/mmu.h               |   5 +-
 arch/x86/kvm/mmu/mmu.c           |  76 ++-
 arch/x86/kvm/svm/sev.c           | 925 ++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.c           |  28 +-
 arch/x86/kvm/svm/svm.h           |  49 ++
 arch/x86/kvm/trace.h             |   6 +-
 arch/x86/kvm/vmx/vmx.c           |   8 +
 arch/x86/mm/fault.c              | 157 ++++++
 arch/x86/mm/mem_encrypt.c        | 163 ++++++
 drivers/crypto/ccp/sev-dev.c     | 312 ++++++++++-
 drivers/crypto/ccp/sev-dev.h     |   3 +
 drivers/crypto/ccp/sp-pci.c      |  12 +
 include/linux/mm.h               |   6 +-
 include/linux/psp-sev.h          | 311 +++++++++++
 include/uapi/linux/kvm.h         |  42 ++
 include/uapi/linux/psp-sev.h     |  27 +
 mm/memory.c                      |  11 +
 23 files changed, 2224 insertions(+), 24 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-25 14:58   ` Dave Hansen
  2021-04-14  7:27   ` Borislav Petkov
  2021-03-24 17:04 ` [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers Brijesh Singh
                   ` (28 subsequent siblings)
  29 siblings, 2 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
track the owner of each page of memory. Pages of memory can be owned by
the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
section 15.36.3 for more detail on RMP.

The RMP table is used to enforce access control to memory. The table itself
is not directly writable by the software. New CPU instructions (RMPUPDATE,
PVALIDATE, RMPADJUST) are used to manipulate the RMP entries.

Based on the platform configuration, the BIOS reserves the memory used
for the RMP table. The start and end address of the RMP table can be
queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
RMP_END are not set then we disable the SEV-SNP feature.

The SEV-SNP feature is enabled only after the RMP table is successfully
initialized.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/msr-index.h |  6 +++
 arch/x86/include/asm/sev-snp.h   | 10 ++++
 arch/x86/mm/mem_encrypt.c        | 84 ++++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b03694e116fe..1142d31eb06c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -481,6 +481,8 @@
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
 #define MSR_AMD64_SEV_ES_ENABLED	BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
 #define MSR_AMD64_SEV_SNP_ENABLED	BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
+#define MSR_AMD64_RMP_BASE		0xc0010132
+#define MSR_AMD64_RMP_END		0xc0010133
 
 #define MSR_AMD64_VIRT_SPEC_CTRL	0xc001011f
 
@@ -538,6 +540,10 @@
 #define MSR_K8_SYSCFG			0xc0010010
 #define MSR_K8_SYSCFG_MEM_ENCRYPT_BIT	23
 #define MSR_K8_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_K8_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_K8_SYSCFG_SNP_EN_BIT	24
+#define MSR_K8_SYSCFG_SNP_EN		BIT_ULL(MSR_K8_SYSCFG_SNP_EN_BIT)
+#define MSR_K8_SYSCFG_SNP_VMPL_EN_BIT	25
+#define MSR_K8_SYSCFG_SNP_VMPL_EN	BIT_ULL(MSR_K8_SYSCFG_SNP_VMPL_EN_BIT)
 #define MSR_K8_INT_PENDING_MSG		0xc0010055
 /* C1E active bits in int pending message */
 #define K8_INTP_C1E_ACTIVE_MASK		0x18000000
diff --git a/arch/x86/include/asm/sev-snp.h b/arch/x86/include/asm/sev-snp.h
index 59b57a5f6524..f7280d5c6158 100644
--- a/arch/x86/include/asm/sev-snp.h
+++ b/arch/x86/include/asm/sev-snp.h
@@ -68,6 +68,8 @@ struct __packed snp_page_state_change {
 #define RMP_X86_PG_LEVEL(level)	(((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
+#include <linux/jump_label.h>
+
 static inline int __pvalidate(unsigned long vaddr, int rmp_psize, int validate,
 			      unsigned long *rflags)
 {
@@ -93,6 +95,13 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
 int snp_set_memory_shared(unsigned long vaddr, unsigned int npages);
 int snp_set_memory_private(unsigned long vaddr, unsigned int npages);
 
+extern struct static_key_false snp_enable_key;
+static inline bool snp_key_active(void)
+{
+	return static_branch_unlikely(&snp_enable_key);
+}
+
+
 #else	/* !CONFIG_AMD_MEM_ENCRYPT */
 
 static inline int __pvalidate(unsigned long vaddr, int psize, int validate, unsigned long *eflags)
@@ -114,6 +123,7 @@ early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr, unsigned i
 }
 static inline int snp_set_memory_shared(unsigned long vaddr, unsigned int npages) { return 0; }
 static inline int snp_set_memory_private(unsigned long vaddr, unsigned int npages) { return 0; }
+static inline bool snp_key_active(void) { return false; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 35af2f21b8f1..39461b9cb34e 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -30,6 +30,7 @@
 #include <asm/msr.h>
 #include <asm/cmdline.h>
 #include <asm/sev-snp.h>
+#include <linux/io.h>
 
 #include "mm_internal.h"
 
@@ -44,12 +45,16 @@ u64 sev_check_data __section(".data") = 0;
 EXPORT_SYMBOL(sme_me_mask);
 DEFINE_STATIC_KEY_FALSE(sev_enable_key);
 EXPORT_SYMBOL_GPL(sev_enable_key);
+DEFINE_STATIC_KEY_FALSE(snp_enable_key);
+EXPORT_SYMBOL_GPL(snp_enable_key);
 
 bool sev_enabled __section(".data");
 
 /* Buffer used for early in-place encryption by BSP, no locking needed */
 static char sme_early_buffer[PAGE_SIZE] __initdata __aligned(PAGE_SIZE);
 
+static unsigned long rmptable_start, rmptable_end;
+
 /*
  * When SNP is active, this routine changes the page state from private to shared before
  * copying the data from the source to destination and restore after the copy. This is required
@@ -528,3 +533,82 @@ void __init mem_encrypt_init(void)
 	print_mem_encrypt_feature_info();
 }
 
+static __init void snp_enable(void *arg)
+{
+	u64 val;
+
+	rdmsrl_safe(MSR_K8_SYSCFG, &val);
+
+	val |= MSR_K8_SYSCFG_SNP_EN;
+	val |= MSR_K8_SYSCFG_SNP_VMPL_EN;
+
+	wrmsrl(MSR_K8_SYSCFG, val);
+}
+
+static __init int rmptable_init(void)
+{
+	u64 rmp_base, rmp_end;
+	unsigned long sz;
+	void *start;
+	u64 val;
+
+	rdmsrl_safe(MSR_AMD64_RMP_BASE, &rmp_base);
+	rdmsrl_safe(MSR_AMD64_RMP_END, &rmp_end);
+
+	if (!rmp_base || !rmp_end) {
+		pr_info("SEV-SNP: Memory for the RMP table has not been reserved by BIOS\n");
+		return 1;
+	}
+
+	sz = rmp_end - rmp_base + 1;
+
+	start = memremap(rmp_base, sz, MEMREMAP_WB);
+	if (!start) {
+		pr_err("SEV-SNP: Failed to map RMP table 0x%llx-0x%llx\n", rmp_base, rmp_end);
+		return 1;
+	}
+
+	/*
+	 * Check if SEV-SNP is already enabled, this can happen if we are coming from kexec boot.
+	 * Do not initialize the RMP table when SEV-SNP is already.
+	 */
+	rdmsrl_safe(MSR_K8_SYSCFG, &val);
+	if (val & MSR_K8_SYSCFG_SNP_EN)
+		goto skip_enable;
+
+	/* Initialize the RMP table to zero */
+	memset(start, 0, sz);
+
+	/* Flush the caches to ensure that data is written before we enable the SNP */
+	wbinvd_on_all_cpus();
+
+	/* Enable the SNP feature */
+	on_each_cpu(snp_enable, NULL, 1);
+
+skip_enable:
+	rmptable_start = (unsigned long)start;
+	rmptable_end = rmptable_start + sz;
+
+	pr_info("SEV-SNP: RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
+
+	return 0;
+}
+
+static int __init mem_encrypt_snp_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
+		return 1;
+
+	if (rmptable_init()) {
+		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+		return 1;
+	}
+
+	static_branch_enable(&snp_enable_key);
+
+	return 0;
+}
+/*
+ * SEV-SNP must be enabled across all CPUs, so make the initialization as a late initcall.
+ */
+late_initcall(mem_encrypt_snp_init);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-04-15 16:57   ` Borislav Petkov
  2021-04-15 17:03   ` Borislav Petkov
  2021-03-24 17:04 ` [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
                   ` (27 subsequent siblings)
  29 siblings, 2 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

The lookup_page_in_rmptable() can be used by the host to read the RMP
entry for a given page. The RMP entry format is documented in PPR
section 2.1.5.2.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-snp.h | 31 +++++++++++++++++++++++++++++++
 arch/x86/mm/mem_encrypt.c      | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/arch/x86/include/asm/sev-snp.h b/arch/x86/include/asm/sev-snp.h
index f7280d5c6158..2aa14b38c5ed 100644
--- a/arch/x86/include/asm/sev-snp.h
+++ b/arch/x86/include/asm/sev-snp.h
@@ -67,6 +67,35 @@ struct __packed snp_page_state_change {
 #define X86_RMP_PG_LEVEL(level)	(((level) == PG_LEVEL_4K) ? RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)
 #define RMP_X86_PG_LEVEL(level)	(((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
 
+/* RMP table entry format (PPR section 2.1.5.2) */
+struct __packed rmpentry {
+	union {
+		struct {
+			uint64_t assigned:1;
+			uint64_t pagesize:1;
+			uint64_t immutable:1;
+			uint64_t rsvd1:9;
+			uint64_t gpa:39;
+			uint64_t asid:10;
+			uint64_t vmsa:1;
+			uint64_t validated:1;
+			uint64_t rsvd2:1;
+		} info;
+		uint64_t low;
+	};
+	uint64_t high;
+};
+
+typedef struct rmpentry rmpentry_t;
+
+#define rmpentry_assigned(x)	((x)->info.assigned)
+#define rmpentry_pagesize(x)	(RMP_X86_PG_LEVEL((x)->info.pagesize))
+#define rmpentry_vmsa(x)	((x)->info.vmsa)
+#define rmpentry_asid(x)	((x)->info.asid)
+#define rmpentry_validated(x)	((x)->info.validated)
+#define rmpentry_gpa(x)		((unsigned long)(x)->info.gpa)
+#define rmpentry_immutable(x)	((x)->info.immutable)
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
@@ -94,6 +123,7 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
 		unsigned int npages);
 int snp_set_memory_shared(unsigned long vaddr, unsigned int npages);
 int snp_set_memory_private(unsigned long vaddr, unsigned int npages);
+rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level);
 
 extern struct static_key_false snp_enable_key;
 static inline bool snp_key_active(void)
@@ -124,6 +154,7 @@ early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr, unsigned i
 static inline int snp_set_memory_shared(unsigned long vaddr, unsigned int npages) { return 0; }
 static inline int snp_set_memory_private(unsigned long vaddr, unsigned int npages) { return 0; }
 static inline bool snp_key_active(void) { return false; }
+static inline rpmentry_t *lookup_page_in_rmptable(struct page *page, int *level) { return NULL; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 39461b9cb34e..06394b6d56b2 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -34,6 +34,8 @@
 
 #include "mm_internal.h"
 
+#define rmptable_page_offset(x)	(0x4000 + (((unsigned long) x) >> 8))
+
 /*
  * Since SME related variables are set early in the boot process they must
  * reside in the .data section so as not to be zeroed out when the .bss
@@ -612,3 +614,33 @@ static int __init mem_encrypt_snp_init(void)
  * SEV-SNP must be enabled across all CPUs, so make the initialization as a late initcall.
  */
 late_initcall(mem_encrypt_snp_init);
+
+rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)
+{
+	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
+	rmpentry_t *entry, *large_entry;
+	unsigned long vaddr;
+
+	if (!static_branch_unlikely(&snp_enable_key))
+		return NULL;
+
+	vaddr = rmptable_start + rmptable_page_offset(phys);
+	if (WARN_ON(vaddr > rmptable_end))
+		return NULL;
+
+	entry = (rmpentry_t *)vaddr;
+
+	/*
+	 * Check if this page is covered by the large RMP entry. This is needed to get
+	 * the page level used in the RMP entry.
+	 *
+	 * e.g. if the page is covered by the large RMP entry then page size is set in the
+	 *       base RMP entry.
+	 */
+	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
+	large_entry = (rmpentry_t *)vaddr;
+	*level = rmpentry_pagesize(large_entry);
+
+	return entry;
+}
+EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-04-15 18:00   ` Borislav Petkov
  2021-03-24 17:04 ` [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table Brijesh Singh
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

The RMPUPDATE instruction writes a new RMP entry in the RMP Table. The
hypervisor will use the instruction to add pages to the RMP table. See
APM3 for details on the instruction operations.

The PSMASH instruction expands a 2MB RMP entry into a corresponding set of
contiguous 4KB-Page RMP entries. The hypervisor will use this instruction
to adjust the RMP entry without invalidating the previous RMP entry.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-snp.h | 27 ++++++++++++++++++++++
 arch/x86/mm/mem_encrypt.c      | 41 ++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/arch/x86/include/asm/sev-snp.h b/arch/x86/include/asm/sev-snp.h
index 2aa14b38c5ed..199d88a38c76 100644
--- a/arch/x86/include/asm/sev-snp.h
+++ b/arch/x86/include/asm/sev-snp.h
@@ -96,6 +96,29 @@ typedef struct rmpentry rmpentry_t;
 #define rmpentry_gpa(x)		((unsigned long)(x)->info.gpa)
 #define rmpentry_immutable(x)	((x)->info.immutable)
 
+
+/* Return code of RMPUPDATE */
+#define RMPUPDATE_SUCCESS		0
+#define RMPUPDATE_FAIL_INPUT		1
+#define RMPUPDATE_FAIL_PERMISSION	2
+#define RMPUPDATE_FAIL_INUSE		3
+#define RMPUPDATE_FAIL_OVERLAP		4
+
+struct rmpupdate {
+	u64 gpa;
+	u8 assigned;
+	u8 pagesize;
+	u8 immutable;
+	u8 rsvd;
+	u32 asid;
+} __packed;
+
+/* Return code of PSMASH */
+#define PSMASH_FAIL_INPUT		1
+#define PSMASH_FAIL_PERMISSION		2
+#define PSMASH_FAIL_INUSE		3
+#define PSMASH_FAIL_BADADDR		4
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 #include <linux/jump_label.h>
 
@@ -124,6 +147,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
 int snp_set_memory_shared(unsigned long vaddr, unsigned int npages);
 int snp_set_memory_private(unsigned long vaddr, unsigned int npages);
 rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level);
+int rmptable_psmash(struct page *page);
+int rmptable_rmpupdate(struct page *page, struct rmpupdate *e);
 
 extern struct static_key_false snp_enable_key;
 static inline bool snp_key_active(void)
@@ -155,6 +180,8 @@ static inline int snp_set_memory_shared(unsigned long vaddr, unsigned int npages
 static inline int snp_set_memory_private(unsigned long vaddr, unsigned int npages) { return 0; }
 static inline bool snp_key_active(void) { return false; }
 static inline rpmentry_t *lookup_page_in_rmptable(struct page *page, int *level) { return NULL; }
+static inline int rmptable_psmash(struct page *page) { return -ENXIO; }
+static inline int rmptable_rmpupdate(struct page *page, struct rmpupdate *e) { return -ENXIO; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 06394b6d56b2..7a0138cb3e17 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -644,3 +644,44 @@ rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)
 	return entry;
 }
 EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);
+
+int rmptable_psmash(struct page *page)
+{
+	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
+	int ret;
+
+	if (!static_branch_unlikely(&snp_enable_key))
+		return -ENXIO;
+
+	/* Retry if another processor is modifying the RMP entry. */
+	do {
+		asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
+			      : "=a"(ret)
+			      : "a"(spa)
+			      : "memory", "cc");
+	} while (ret == PSMASH_FAIL_INUSE);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(rmptable_psmash);
+
+int rmptable_rmpupdate(struct page *page, struct rmpupdate *val)
+{
+	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
+	bool flush = true;
+	int ret;
+
+	if (!static_branch_unlikely(&snp_enable_key))
+		return -ENXIO;
+
+	/* Retry if another processor is modifying the RMP entry. */
+	do {
+		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
+			     : "=a"(ret)
+			     : "a"(spa), "c"((unsigned long)val), "d"(flush)
+			     : "memory", "cc");
+	} while (ret == PSMASH_FAIL_INUSE);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(rmptable_rmpupdate);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (2 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-25 15:17   ` Dave Hansen
  2021-04-19 12:32   ` Borislav Petkov
  2021-03-24 17:04 ` [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code Brijesh Singh
                   ` (25 subsequent siblings)
  29 siblings, 2 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used in conjuntion with standard x86 and IOMMU page
tables to enforce memory restrictions and page access rights. The
RMP is indexed by system physical address, and is checked at the end
of CPU and IOMMU table walks. The RMP check is enforced as soon as
SEV-SNP is enabled globally in the system. Not every memory access
requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality
is already protected via memory encryption. When hardware encounters
an RMP checks failure, it raise a page-fault exception. The RMP bit in
fault error code can be used to determine if the fault was due to an
RMP checks failure.

A write from the hypervisor goes through the RMP checks. When the
hypervisor writes to pages, hardware checks to ensures that the assigned
bit in the RMP is zero (i.e page is shared). If the page table entry that
gives the sPA indicates that the target page size is a large page, then
all RMP entries for the 4KB constituting pages of the target must have the
assigned bit 0. If one of entry does not have assigned bit 0 then hardware
will raise an RMP violation. To resolve it, we must split the page table
entry leading to target page into 4K.

This poses a challenge in the Linux memory model. The Linux kernel
creates a direct mapping of all the physical memory -- referred to as
the physmap. The physmap may contain a valid mapping of guest owned pages.
During the page table walk, we may get into the situation where one
of the pages within the large page is owned by the guest (i.e assigned
bit is set in RMP). A write to a non-guest within the large page will
raise an RMP violation. To workaround it, we call set_memory_4k() to split
the physmap before adding the page in the RMP table. This ensures that the
pages added in the RMP table are used as 4K in the physmap.

The spliting of the physmap is a temporary solution until we work to
improve the kernel page fault handler to split the pages on demand.
One of the disadvtange of splitting is that eventually, we will end up
breaking down the entire physmap unless we combine the split pages back to
a large page. I am open to the suggestation on various approaches we could
take to address this problem.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/mem_encrypt.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 7a0138cb3e17..4047acb37c30 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -674,6 +674,12 @@ int rmptable_rmpupdate(struct page *page, struct rmpupdate *val)
 	if (!static_branch_unlikely(&snp_enable_key))
 		return -ENXIO;
 
+	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
+	if (ret) {
+		pr_err("SEV-SNP: failed to split physical address 0x%lx (%d)\n", spa, ret);
+		return ret;
+	}
+
 	/* Retry if another processor is modifying the RMP entry. */
 	do {
 		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (3 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 18:03   ` Dave Hansen
  2021-04-20 10:32   ` Borislav Petkov
  2021-03-24 17:04 ` [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF Brijesh Singh
                   ` (24 subsequent siblings)
  29 siblings, 2 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

Bit 31 in the page fault-error bit will be set when processor encounters
an RMP violation.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/trap_pf.h | 2 ++
 arch/x86/mm/fault.c            | 1 +
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..107f9d947e8d 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -12,6 +12,7 @@
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
  *   bit 15 ==				1: SGX MMU page-fault
+ *   bit 31 ==				1: fault was an RMP violation
  */
 enum x86_pf_error_code {
 	X86_PF_PROT	=		1 << 0,
@@ -21,6 +22,7 @@ enum x86_pf_error_code {
 	X86_PF_INSTR	=		1 << 4,
 	X86_PF_PK	=		1 << 5,
 	X86_PF_SGX	=		1 << 15,
+	X86_PF_RMP	=		1ull << 31,
 };
 
 #endif /* _ASM_X86_TRAP_PF_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..f39b551f89a6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -547,6 +547,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 		 !(error_code & X86_PF_PROT) ? "not-present page" :
 		 (error_code & X86_PF_RSVD)  ? "reserved bit violation" :
 		 (error_code & X86_PF_PK)    ? "protection keys violation" :
+		 (error_code & X86_PF_RMP)   ? "rmp violation" :
 					       "permissions violation");
 
 	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (4 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:47   ` Andy Lutomirski
  2021-03-24 17:04 ` [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation Brijesh Singh
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

If hardware detects an RMP violation, it will raise a page-fault exception
with the RMP bit set. To help the debug, dump the RMP entry of the faulting
address.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/fault.c | 75 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f39b551f89a6..7605e06a6dd9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -31,6 +31,7 @@
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
+#include <asm/sev-snp.h>		/* lookup_rmpentry ...		*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -147,6 +148,76 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 DEFINE_SPINLOCK(pgd_lock);
 LIST_HEAD(pgd_list);
 
+static void dump_rmpentry(struct page *page, rmpentry_t *e)
+{
+	unsigned long paddr = page_to_pfn(page) << PAGE_SHIFT;
+
+	pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx asid=%d "
+		"vmsa=%d validated=%d]\n", paddr, rmpentry_assigned(e), rmpentry_immutable(e),
+		rmpentry_pagesize(e), rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
+		rmpentry_validated(e));
+	pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", paddr, e->high, e->low);
+}
+
+static void show_rmpentry(unsigned long address)
+{
+	struct page *page = virt_to_page(address);
+	rmpentry_t *entry, *large_entry;
+	int level, rmp_level;
+	pgd_t *pgd;
+	pte_t *pte;
+
+	/* Get the RMP entry for the fault address */
+	entry = lookup_page_in_rmptable(page, &rmp_level);
+	if (!entry) {
+		pr_alert("SEV-SNP: failed to read RMP entry for address 0x%lx\n", address);
+		return;
+	}
+
+	dump_rmpentry(page, entry);
+
+	/*
+	 * If fault occurred during the large page walk, dump the RMP entry at base of 2MB page.
+	 */
+	pgd = __va(read_cr3_pa());
+	pgd += pgd_index(address);
+	pte = lookup_address_in_pgd(pgd, address, &level);
+	if ((level > PG_LEVEL_4K) && (!IS_ALIGNED(address, PMD_SIZE))) {
+		address = address & PMD_MASK;
+		large_entry = lookup_page_in_rmptable(virt_to_page(address), &rmp_level);
+		if (!large_entry) {
+			pr_alert("SEV-SNP: failed to read large RMP entry 0x%lx\n",
+				address & PMD_MASK);
+			return;
+		}
+
+		dump_rmpentry(virt_to_page(address), large_entry);
+	}
+
+	/*
+	 * If the RMP entry at the faulting address was not assigned, then dump may not provide
+	 * any useful debug information. Iterate through the entire 2MB region, and dump the RMP
+	 * entries if one of the bit in the RMP entry is set.
+	 */
+	if (!rmpentry_assigned(entry)) {
+		unsigned long start, end;
+
+		start = address & PMD_MASK;
+		end = start + PMD_SIZE;
+
+		for (; start < end; start += PAGE_SIZE) {
+			entry = lookup_page_in_rmptable(virt_to_page(start), &rmp_level);
+			if (!entry)
+				return;
+
+			/* If any of the bits in RMP entry is set then dump it */
+			if (entry->high || entry->low)
+				pr_alert("RMPEntry paddr %lx: %016llx %016llx\n",
+					page_to_pfn(page) << PAGE_SHIFT, entry->high, entry->low);
+		}
+	}
+}
+
 #ifdef CONFIG_X86_32
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 {
@@ -580,6 +651,10 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 	}
 
 	dump_pagetable(address);
+
+	if (error_code & X86_PF_RMP)
+		show_rmpentry(address);
+
 }
 
 static noinline void
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (5 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-25 14:30   ` Dave Hansen
  2021-03-25 14:48   ` Dave Hansen
  2021-03-24 17:04 ` [RFC Part2 PATCH 08/30] crypto:ccp: define the SEV-SNP commands Brijesh Singh
                   ` (22 subsequent siblings)
  29 siblings, 2 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

When SEV-SNP is enabled globally in the system, a write from the hypervisor
can raise an RMP violation. We can resolve the RMP violation by splitting
the virtual address to a lower page level.

e.g
- guest made a page shared in the RMP entry so that the hypervisor
  can write to it.
- the hypervisor has mapped the pfn as a large page. A write access
  will cause an RMP violation if one of the pages within the 2MB region
  is a guest private page.

The above RMP violation can be resolved by simply splitting the large
page.

The architecture specific code will read the RMP entry to determine
if the fault can be resolved by splitting and propagating the request
to split the page by setting newly introduced fault flag
(FAULT_FLAG_PAGE_SPLIT). If the fault cannot be resolved by splitting,
then a SIGBUS signal is sent to terminate the process.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/fault.c | 81 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h  |  6 +++-
 mm/memory.c         | 11 ++++++
 3 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7605e06a6dd9..f6571563f433 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1305,6 +1305,70 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
+#define RMP_FAULT_RETRY		0
+#define RMP_FAULT_KILL		1
+#define RMP_FAULT_PAGE_SPLIT	2
+
+static inline size_t pages_per_hpage(int level)
+{
+	return page_level_size(level) / PAGE_SIZE;
+}
+
+/*
+ * The RMP fault can happen when a hypervisor attempts to write to:
+ * 1. a guest owned page or
+ * 2. any pages in the large page is a guest owned page.
+ *
+ * #1 will happen only when a process or VMM is attempting to modify the guest page
+ * without the guests cooperation. If a guest wants a VMM to be able to write to its memory
+ * then it should make the page shared. If we detect #1, kill the process because we can not
+ * resolve the fault.
+ *
+ * #2 can happen when the page level does not match between the RMP entry and x86
+ * page table walk, e.g the page is mapped as a large page in the x86 page table but its
+ * added as a 4K shared page in the RMP entry. This can be resolved by splitting the address
+ * into a smaller page level.
+ */
+static int handle_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
+{
+	unsigned long pfn, mask;
+	int rmp_level, level;
+	rmpentry_t *e;
+	pte_t *pte;
+
+	/* Get the native page level */
+	pte = lookup_address_in_mm(current->mm, address, &level);
+	if (unlikely(!pte))
+		return RMP_FAULT_KILL;
+
+	pfn = pte_pfn(*pte);
+	if (level > PG_LEVEL_4K) {
+		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+		pfn |= (address >> PAGE_SHIFT) & mask;
+	}
+
+	/* Get the page level from the RMP entry. */
+	e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (!e) {
+		pr_alert("SEV-SNP: failed to lookup RMP entry for address 0x%lx pfn 0x%lx\n",
+			 address, pfn);
+		return RMP_FAULT_KILL;
+	}
+
+	/* Its a guest owned page */
+	if (rmpentry_assigned(e))
+		return RMP_FAULT_KILL;
+
+	/*
+	 * Its a shared page but the page level does not match between the native walk
+	 * and RMP entry.
+	 */
+	if (level > rmp_level)
+		return RMP_FAULT_PAGE_SPLIT;
+
+	return RMP_FAULT_RETRY;
+}
+
 /* Handle faults in the user portion of the address space */
 static inline
 void do_user_addr_fault(struct pt_regs *regs,
@@ -1315,6 +1379,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	vm_fault_t fault;
+	int ret;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
@@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (hw_error_code & X86_PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	/*
+	 * If its an RMP violation, see if we can resolve it.
+	 */
+	if ((hw_error_code & X86_PF_RMP)) {
+		ret = handle_rmp_page_fault(hw_error_code, address);
+		if (ret == RMP_FAULT_PAGE_SPLIT) {
+			flags |= FAULT_FLAG_PAGE_SPLIT;
+		} else if (ret == RMP_FAULT_KILL) {
+			fault |= VM_FAULT_SIGBUS;
+			mm_fault_error(regs, hw_error_code, address, fault);
+			return;
+		} else {
+			return;
+		}
+	}
+
 #ifdef CONFIG_X86_64
 	/*
 	 * Faults in the vsyscall page might need emulation.  The
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..1be3218f3738 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -434,6 +434,8 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the region to smaller
+ *   page size and retry.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -464,6 +466,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_PAGE_SPLIT			0x400
 
 /*
  * The default fault flags that should be used by most of the
@@ -501,7 +504,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
 	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
-	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
+	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
+	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..c9dcf9b30719 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4427,6 +4427,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+static int handle_split_page_fault(struct vm_fault *vmf)
+{
+	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+	return 0;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -4448,6 +4454,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	pgd_t *pgd;
 	p4d_t *p4d;
 	vm_fault_t ret;
+	int split_page = flags & FAULT_FLAG_PAGE_SPLIT;
 
 	pgd = pgd_offset(mm, address);
 	p4d = p4d_alloc(mm, pgd, address);
@@ -4504,6 +4511,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
+
+		if (split_page)
+			return handle_split_page_fault(&vmf);
+
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 08/30] crypto:ccp: define the SEV-SNP commands
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (6 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 09/30] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP Brijesh Singh
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

AMD introduced the next generation of SEV called SEV-SNP (Secure Nested
Paging). SEV-SNP builds upon existing SEV and SEV-ES functionality
while adding new hardware security protection.

Define the commands and structures used to communicate with the AMD-SP
when creating and managing the SEV-SNP guests. The SEV-SNP firmware spec
is available at developer.amd.com/sev.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c |  11 ++
 include/linux/psp-sev.h      | 210 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/psp-sev.h |  27 +++++
 3 files changed, 248 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 476113e12489..8a9fd843ad9e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -128,6 +128,17 @@ static int sev_cmd_buffer_len(int cmd)
 	case SEV_CMD_LAUNCH_UPDATE_SECRET:	return sizeof(struct sev_data_launch_secret);
 	case SEV_CMD_DOWNLOAD_FIRMWARE:		return sizeof(struct sev_data_download_firmware);
 	case SEV_CMD_GET_ID:			return sizeof(struct sev_data_get_id);
+	case SEV_CMD_SNP_GCTX_CREATE:		return sizeof(struct sev_data_snp_gctx_create);
+	case SEV_CMD_SNP_LAUNCH_START:		return sizeof(struct sev_data_snp_launch_start);
+	case SEV_CMD_SNP_LAUNCH_UPDATE:		return sizeof(struct sev_data_snp_launch_update);
+	case SEV_CMD_SNP_ACTIVATE:		return sizeof(struct sev_data_snp_activate);
+	case SEV_CMD_SNP_DECOMMISSION:		return sizeof(struct sev_data_snp_decommission);
+	case SEV_CMD_SNP_PAGE_RECLAIM:		return sizeof(struct sev_data_snp_page_reclaim);
+	case SEV_CMD_SNP_GUEST_STATUS:		return sizeof(struct sev_data_snp_guest_status);
+	case SEV_CMD_SNP_LAUNCH_FINISH:		return sizeof(struct sev_data_snp_launch_finish);
+	case SEV_CMD_SNP_PAGE_UNSMASH:		return sizeof(struct sev_data_snp_page_unsmash);
+	case SEV_CMD_SNP_PLATFORM_STATUS:	return sizeof(struct sev_data_snp_platform_status_buf);
+	case SEV_CMD_SNP_GUEST_REQUEST:		return sizeof(struct sev_data_snp_guest_request);
 	default:				return 0;
 	}
 
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 49d155cd2dfe..df89f0207099 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -83,6 +83,34 @@ enum sev_cmd {
 	SEV_CMD_DBG_DECRYPT		= 0x060,
 	SEV_CMD_DBG_ENCRYPT		= 0x061,
 
+	/* SNP specific commands */
+	SEV_CMD_SNP_INIT		= 0x81,
+	SEV_CMD_SNP_SHUTDOWN		= 0x82,
+	SEV_CMD_SNP_PLATFORM_STATUS	= 0x83,
+	SEV_CMD_SNP_DF_FLUSH		= 0x84,
+	SEV_CMD_SNP_DOWNLOAD_FIRMWARE	= 0x85,
+	SEV_CMD_SNP_GET_ID		= 0x86,
+	SEV_CMD_SNP_DECOMMISSION	= 0x90,
+	SEV_CMD_SNP_ACTIVATE		= 0x91,
+	SEV_CMD_SNP_GUEST_STATUS	= 0x92,
+	SEV_CMD_SNP_GCTX_CREATE		= 0x93,
+	SEV_CMD_SNP_GUEST_REQUEST	= 0x94,
+	SEV_CMD_SNP_ACTIVATE_EX		= 0x95,
+	SEV_CMD_SNP_LAUNCH_START	= 0xA0,
+	SEV_CMD_SNP_LAUNCH_UPDATE	= 0xA1,
+	SEV_CMD_SNP_LAUNCH_FINISH	= 0xA2,
+	SEV_CMD_SNP_DBG_DECRYPT		= 0xB0,
+	SEV_CMD_SNP_DBG_ENCRYT		= 0xB1,
+	SEV_CMD_SNP_PAGE_SWAP_OUT	= 0xC0,
+	SEV_CMD_SNP_PAGE_SWAP_IN	= 0xC1,
+	SEV_CMD_SNP_PAGE_MOVE		= 0xC2,
+	SEV_CMD_SNP_PAGE_MD_INIT	= 0xC3,
+	SEV_CMD_SNP_PAGE_MD_RECLAIM	= 0xC4,
+	SEV_CMD_SNP_PAGE_RO_RECLAIM	= 0xC5,
+	SEV_CMD_SNP_PAGE_RO_RESTORE	= 0xC6,
+	SEV_CMD_SNP_PAGE_RECLAIM	= 0xC7,
+	SEV_CMD_SNP_PAGE_UNSMASH	= 0xC8,
+
 	SEV_CMD_MAX,
 };
 
@@ -483,6 +511,188 @@ struct sev_data_dbg {
 	u32 len;				/* In */
 } __packed;
 
+/**
+ * struct sev_data_snp_platform_status_buf - SNP_PLATFORM_STATUS command params
+ *
+ * @address: physical address where the status should be copied
+ */
+struct sev_data_snp_platform_status_buf {
+	u64 status_paddr;			/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_download_firmware - SNP_DOWNLOAD_FIRMWARE command params
+ *
+ * @address: physical address of firmware image
+ * @len: len of the firmware image
+ */
+struct sev_data_snp_download_firmware {
+	u64 address;				/* In */
+	u32 len;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_gctx_create - SNP_GCTX_CREATE command params
+ *
+ * @gctx_paddr: system physical address of the page donated to firmware by
+ *		the hypervisor to contain the guest context.
+ */
+struct sev_data_snp_gctx_create {
+	u64 gctx_paddr;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_activate - SNP_ACTIVATE command params
+ *
+ * @gctx_paddr: system physical address guest context page
+ * @asid: ASID to bind to the guest
+ */
+struct sev_data_snp_activate {
+	u64 gctx_paddr;				/* In */
+	u32 asid;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_decommission - SNP_DECOMMISSION command params
+ *
+ * @address: system physical address guest context page
+ */
+struct sev_data_snp_decommission {
+	u64 gctx_paddr;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_start - SNP_LAUNCH_START command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @policy: guest policy
+ * @ma_gctx_addr: system physical address of migration agent
+ * @imi_en: launch flow is launching an IMI for the purpose of
+ *   guest-assisted migration.
+ * @ma_en: the guest is associated with a migration agent
+ */
+struct sev_data_snp_launch_start {
+	u64 gctx_paddr;				/* In */
+	u64 policy;				/* In */
+	u64 ma_gctx_paddr;			/* In */
+	u32 ma_en:1;				/* In */
+	u32 imi_en:1;				/* In */
+	u32 rsvd:30;
+} __packed;
+
+/* SNP support page type */
+enum {
+	SNP_PAGE_TYPE_NORMAL		= 0x1,
+	SNP_PAGE_TYPE_VMSA		= 0x2,
+	SNP_PAGE_TYPE_ZERO		= 0x3,
+	SNP_PAGE_TYPE_UNMEASURED	= 0x4,
+	SNP_PAGE_TYPE_SECRET		= 0x5,
+	SNP_PAGE_TYPE_CPUID		= 0x6,
+
+	SNP_PAGE_TYPE_MAX
+};
+
+/**
+ * struct sev_data_snp_launch_update - SNP_LAUNCH_UPDATE command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @imi_page: indicates that this page is part of the IMI of the guest
+ * @page_type: encoded page type
+ * @page_size: page size 0 indicates 4K and 1 indicates 2MB page
+ * @address: system physical address of destination page to encrypt
+ * @vmpl3_perms: VMPL permission mask for VMPL3
+ * @vmpl2_perms: VMPL permission mask for VMPL2
+ * @vmpl1_perms: VMPL permission mask for VMPL1
+ */
+struct sev_data_snp_launch_update {
+	u64 gctx_paddr;				/* In */
+	u32 page_size:1;			/* In */
+	u32 page_type:3;			/* In */
+	u32 imi_page:1;				/* In */
+	u32 rsvd:27;
+	u32 rsvd2;
+	u64 address;				/* In */
+	u32 rsvd3:8;
+	u32 vmpl3_perms:8;			/* In */
+	u32 vmpl2_perms:8;			/* In */
+	u32 vmpl1_perms:8;			/* In */
+	u32 rsvd4;
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_finish - SNP_LAUNCH_FINISH command params
+ *
+ * @gctx_addr: system pphysical address of guest context page
+ */
+struct sev_data_snp_launch_finish {
+	u64 gctx_paddr;
+	u64 id_block_paddr;
+	u64 id_auth_paddr;
+	u8 id_block_en:1;
+	u8 auth_key_en:1;
+	u64 rsvd:62;
+	u8 host_data[32];
+} __packed;
+
+/**
+ * struct sev_data_snp_guest_status - SNP_GUEST_STATUS command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @address: system physical address of guest status page
+ */
+struct sev_data_snp_guest_status {
+	u64 gctx_paddr;
+	u64 address;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_reclaim - SNP_PAGE_RECLAIM command params
+ *
+ * @paddr: system physical address of page to be claimed. The BIT0 indicate
+ *	the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_reclaim {
+	u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_unsmash - SNP_PAGE_UNMASH command params
+ *
+ * @paddr: system physical address of page to be unmashed. The BIT0 indicate
+ *	the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_unsmash {
+	u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_dbg - DBG_ENCRYPT/DBG_DECRYPT command parameters
+ *
+ * @handle: handle of the VM to perform debug operation
+ * @src_addr: source address of data to operate on
+ * @dst_addr: destination address of data to operate on
+ * @len: len of data to operate on
+ */
+struct sev_data_snp_dbg {
+	u64 gctx_paddr;				/* In */
+	u64 src_addr;				/* In */
+	u64 dst_addr;				/* In */
+	u32 len;				/* In */
+} __packed;
+
+/**
+ * struct sev_snp_guest_request - SNP_GUEST_REQUEST command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @req_paddr: system physical address of request page
+ * @res_paddr: system physical address of response page
+ */
+struct sev_data_snp_guest_request {
+	u64 gctx_paddr;				/* In */
+	u64 req_paddr;				/* In */
+	u64 res_paddr;				/* In */
+} __packed;
+
 #ifdef CONFIG_CRYPTO_DEV_SP_PSP
 
 /**
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 91b4c63d5cbf..9bc63b026091 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -61,6 +61,11 @@ typedef enum {
 	SEV_RET_INVALID_PARAM,
 	SEV_RET_RESOURCE_LIMIT,
 	SEV_RET_SECURE_DATA_INVALID,
+	SEV_RET_INVALID_PAGE_SIZE,
+	SEV_RET_INVALID_PAGE_STATE,
+	SEV_RET_INVALID_MDATA_ENTRY,
+	SEV_RET_INVALID_PAGE_OWNER,
+	SEV_RET_INVALID_PAGE_AEAD_OFLOW,
 	SEV_RET_MAX,
 } sev_ret_code;
 
@@ -147,6 +152,28 @@ struct sev_user_data_get_id2 {
 	__u32 length;				/* In/Out */
 } __packed;
 
+/**
+ * struct sev_data_snp_platform_status - Platform status
+ *
+ * @major: API major version
+ * @minor: API minor version
+ * @state: current platform state
+ * @build: firmware build id for the API version
+ * @guest_count: the number of guest currently managed by the firmware
+ * @tcb_version: current TCB version
+ */
+struct sev_user_snp_status {
+	__u8 api_major;		/* Out */
+	__u8 api_minor;		/* Out */
+	__u8 state;		/* Out */
+	__u8 rsvd;
+	__u32 build_id;		/* Out */
+	__u32 rsvd1;
+	__u32 guest_count;	/* Out */
+	__u64 tcb_version;	/* Out */
+	__u64 rsvd2;
+} __packed;
+
 /**
  * struct sev_issue_cmd - SEV ioctl parameters
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 09/30] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (7 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 08/30] crypto:ccp: define the SEV-SNP commands Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 10/30] crypto: ccp: shutdown SNP firmware on kexec Brijesh Singh
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

Before SNP VMs can be launched, the platform must be appropriately
configured and initialized. Platform initialization is accomplished via
the SNP_INIT command.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 164 ++++++++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.h |   2 +
 include/linux/psp-sev.h      |  16 ++++
 3 files changed, 179 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 8a9fd843ad9e..c983a8b040c3 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -21,8 +21,10 @@
 #include <linux/ccp.h>
 #include <linux/firmware.h>
 #include <linux/gfp.h>
+#include <linux/mem_encrypt.h>
 
 #include <asm/smp.h>
+#include <asm/sev-snp.h>
 
 #include "psp-dev.h"
 #include "sev-dev.h"
@@ -30,6 +32,7 @@
 #define DEVICE_NAME		"sev"
 #define SEV_FW_FILE		"amd/sev.fw"
 #define SEV_FW_NAME_SIZE	64
+#define __sme_page_pa(x) __sme_set(page_to_pfn(x) << PAGE_SHIFT)
 
 static DEFINE_MUTEX(sev_cmd_mutex);
 static struct sev_misc_dev *misc_dev;
@@ -574,6 +577,93 @@ static int sev_update_firmware(struct device *dev)
 	return ret;
 }
 
+static void snp_set_hsave_pa(void *arg)
+{
+	wrmsrl(MSR_VM_HSAVE_PA, 0);
+}
+
+static int __sev_snp_init_locked(int *error)
+{
+	struct psp_device *psp = psp_master;
+	struct sev_device *sev;
+	int rc = 0;
+
+	if (!psp || !psp->sev_data)
+		return -ENODEV;
+
+	sev = psp->sev_data;
+
+	if (sev->snp_inited && sev->state >= SEV_STATE_INIT)
+		return 0;
+
+	if (!snp_key_active()) {
+		dev_notice(sev->dev, "SNP is not enabled\n");
+		return -ENODEV;
+	}
+
+	/* SNP_INIT requires the MSR_VM_HSAVE_PA must be set to 0h across all cores. */
+	on_each_cpu(snp_set_hsave_pa, NULL, 1);
+
+	/* Prepare for first SEV guest launch after INIT */
+	wbinvd_on_all_cpus();
+
+	/* Issue the SNP_INIT firmware command. */
+	rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT, NULL, error);
+	if (rc)
+		return rc;
+
+	sev->snp_inited = true;
+	sev->state = SEV_STATE_INIT;
+	dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
+
+	return rc;
+}
+
+int sev_snp_init(int *error)
+{
+	int rc;
+
+	mutex_lock(&sev_cmd_mutex);
+	rc = __sev_snp_init_locked(error);
+	mutex_unlock(&sev_cmd_mutex);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(sev_snp_init);
+
+static int __sev_snp_shutdown_locked(int *error)
+{
+	struct sev_device *sev = psp_master->sev_data;
+	int ret;
+
+	ret = __sev_do_cmd_locked(SEV_CMD_SNP_SHUTDOWN, NULL, error);
+	if (ret)
+		return ret;
+
+	wbinvd_on_all_cpus();
+
+	ret = __sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, error);
+	if (ret)
+		dev_err(sev->dev, "SEV-SNP firmware DF_FLUSH failed\n");
+
+	sev->snp_inited = false;
+	sev->state = SEV_STATE_UNINIT;
+	dev_dbg(sev->dev, "SEV-SNP firmware shutdown\n");
+
+	return ret;
+}
+
+static int sev_snp_shutdown(int *error)
+{
+	int rc;
+
+	mutex_lock(&sev_cmd_mutex);
+	rc = __sev_snp_shutdown_locked(NULL);
+	mutex_unlock(&sev_cmd_mutex);
+
+	return rc;
+}
+
 static int sev_ioctl_do_pek_import(struct sev_issue_cmd *argp, bool writable)
 {
 	struct sev_device *sev = psp_master->sev_data;
@@ -1043,6 +1133,42 @@ int sev_issue_cmd_external_user(struct file *filep, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(sev_issue_cmd_external_user);
 
+static int sev_snp_firmware_state(void)
+{
+	struct sev_data_snp_platform_status_buf *buf = NULL;
+	struct page *status_page = NULL;
+	int state = SEV_STATE_UNINIT;
+	int rc, error;
+
+	status_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!status_page)
+		return -ENOMEM;
+
+	buf = kzalloc(sizeof(*buf), GFP_KERNEL_ACCOUNT);
+	if (!buf) {
+		__free_page(status_page);
+		return -ENOMEM;
+	}
+
+	buf->status_paddr = __sme_page_pa(status_page);
+	rc = sev_do_cmd(SEV_CMD_SNP_PLATFORM_STATUS, buf, &error);
+
+	/*
+	 * The status buffer is allocated as a hypervisor page. As per the SEV spec,
+	 * if the firmware is in INIT state then status buffer must be either a
+	 * the firmware page or the default page. Since our status buffer is in
+	 * the hypervisor page, so, if firmware is in INIT state then we should
+	 * fail with INVALID_PAGE_STATE.
+	 */
+	if (rc && error == SEV_RET_INVALID_PAGE_STATE)
+		state = SEV_STATE_INIT;
+
+	kfree(buf);
+	__free_page(status_page);
+
+	return state;
+}
+
 void sev_pci_init(void)
 {
 	struct sev_device *sev = psp_master->sev_data;
@@ -1052,8 +1178,20 @@ void sev_pci_init(void)
 	if (!sev)
 		return;
 
+	/* Check if the SEV feature is enabled */
+	if (!boot_cpu_has(X86_FEATURE_SEV)) {
+		dev_err(sev->dev, "SEV: Memory Encryption is disabled by the BIOS\n");
+		return;
+	}
+
 	psp_timeout = psp_probe_timeout;
 
+	/* check if SNP firmware is in INIT state, If so shutdown */
+	if (boot_cpu_has(X86_FEATURE_SEV_SNP)) {
+		if (sev_snp_firmware_state() == SEV_STATE_INIT)
+			sev_snp_shutdown(NULL);
+	}
+
 	if (sev_get_api_version())
 		goto err;
 
@@ -1086,6 +1224,21 @@ void sev_pci_init(void)
 			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
 	}
 
+	/*
+	 * If boot CPU supports the SNP, then let first attempt to initialize
+	 * the SNP firmware.
+	 */
+	if (boot_cpu_has(X86_FEATURE_SEV_SNP)) {
+		rc = sev_snp_init(&error);
+		if (rc) {
+			/*
+			 * If we failed to INIT SNP then don't abort the probe.
+			 * Continue to initialize the legacy SEV firmware.
+			 */
+			dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
+		}
+	}
+
 	/* Initialize the platform */
 	rc = sev_platform_init(&error);
 	if (rc && (error == SEV_RET_SECURE_DATA_INVALID)) {
@@ -1105,8 +1258,8 @@ void sev_pci_init(void)
 		return;
 	}
 
-	dev_info(sev->dev, "SEV API:%d.%d build:%d\n", sev->api_major,
-		 sev->api_minor, sev->build);
+	dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_inited ?
+		"-SNP" : "", sev->api_major, sev->api_minor, sev->build);
 
 	return;
 
@@ -1119,7 +1272,12 @@ void sev_pci_exit(void)
 	if (!psp_master->sev_data)
 		return;
 
-	sev_platform_shutdown(NULL);
+	if (boot_cpu_has(X86_FEATURE_SEV))
+		sev_platform_shutdown(NULL);
+
+	if (boot_cpu_has(X86_FEATURE_SEV_SNP))
+		sev_snp_shutdown(NULL);
+
 
 	if (sev_es_tmr) {
 		/* The TMR area was encrypted, flush it from the cache */
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index dd5c4fe82914..18b116a817ff 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -52,6 +52,8 @@ struct sev_device {
 	u8 api_major;
 	u8 api_minor;
 	u8 build;
+
+	bool snp_inited;
 };
 
 int sev_dev_init(struct psp_device *psp);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index df89f0207099..ec45c18c3b0a 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -709,6 +709,20 @@ struct sev_data_snp_guest_request {
  */
 int sev_platform_init(int *error);
 
+/**
+ * sev_snp_init - perform SEV SNP_INIT command
+ *
+ * @error: SEV command return code
+ *
+ * Returns:
+ * 0 if the SEV successfully processed the command
+ * -%ENODEV    if the SEV device is not available
+ * -%ENOTSUPP  if the SEV does not support SEV
+ * -%ETIMEDOUT if the SEV command timed out
+ * -%EIO       if the SEV returned a non-zero return code
+ */
+int sev_snp_init(int *error);
+
 /**
  * sev_platform_status - perform SEV PLATFORM_STATUS command
  *
@@ -816,6 +830,8 @@ sev_platform_status(struct sev_user_data_status *status, int *error) { return -E
 
 static inline int sev_platform_init(int *error) { return -ENODEV; }
 
+static inline int sev_snp_init(int *error) { return -ENODEV; }
+
 static inline int
 sev_guest_deactivate(struct sev_data_deactivate *data, int *error) { return -ENODEV; }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 10/30] crypto: ccp: shutdown SNP firmware on kexec
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (8 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 09/30] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 11/30] crypto:ccp: provide APIs to issue SEV-SNP commands Brijesh Singh
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

When the kernel is getting ready to kexec, it calls the device_shutdown() to
allow drivers to cleanup before the kexec. If SEV firmware is initialized
then shut it down before kexec'ing the new kernel.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 18 ++++++++++++------
 drivers/crypto/ccp/sp-pci.c  | 12 ++++++++++++
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index c983a8b040c3..562501c43d8f 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1110,6 +1110,15 @@ int sev_dev_init(struct psp_device *psp)
 	return ret;
 }
 
+static void sev_firmware_shutdown(void)
+{
+	if (boot_cpu_has(X86_FEATURE_SEV))
+		sev_platform_shutdown(NULL);
+
+	if (boot_cpu_has(X86_FEATURE_SEV_SNP))
+		sev_snp_shutdown(NULL);
+}
+
 void sev_dev_destroy(struct psp_device *psp)
 {
 	struct sev_device *sev = psp->sev_data;
@@ -1117,6 +1126,8 @@ void sev_dev_destroy(struct psp_device *psp)
 	if (!sev)
 		return;
 
+	sev_firmware_shutdown();
+
 	if (sev->misc)
 		kref_put(&misc_dev->refcount, sev_exit);
 
@@ -1272,12 +1283,7 @@ void sev_pci_exit(void)
 	if (!psp_master->sev_data)
 		return;
 
-	if (boot_cpu_has(X86_FEATURE_SEV))
-		sev_platform_shutdown(NULL);
-
-	if (boot_cpu_has(X86_FEATURE_SEV_SNP))
-		sev_snp_shutdown(NULL);
-
+	sev_firmware_shutdown();
 
 	if (sev_es_tmr) {
 		/* The TMR area was encrypted, flush it from the cache */
diff --git a/drivers/crypto/ccp/sp-pci.c b/drivers/crypto/ccp/sp-pci.c
index f471dbaef1fb..9210bfda91a2 100644
--- a/drivers/crypto/ccp/sp-pci.c
+++ b/drivers/crypto/ccp/sp-pci.c
@@ -239,6 +239,17 @@ static int sp_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return ret;
 }
 
+static void sp_pci_shutdown(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct sp_device *sp = dev_get_drvdata(dev);
+
+	if (!sp)
+		return;
+
+	sp_destroy(sp);
+}
+
 static void sp_pci_remove(struct pci_dev *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -368,6 +379,7 @@ static struct pci_driver sp_pci_driver = {
 	.id_table = sp_pci_table,
 	.probe = sp_pci_probe,
 	.remove = sp_pci_remove,
+	.shutdown = sp_pci_shutdown,
 	.driver.pm = &sp_pci_pm_ops,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 11/30] crypto:ccp: provide APIs to issue SEV-SNP commands
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (9 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 10/30] crypto: ccp: shutdown SNP firmware on kexec Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 12/30] crypto ccp: handle the legacy SEV command when SNP is enabled Brijesh Singh
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
commands for SEV-SNP is defined in the SEV-SNP firmware specification.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 41 +++++++++++++++++
 include/linux/psp-sev.h      | 85 ++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 562501c43d8f..242c4775eb56 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1019,6 +1019,47 @@ int sev_guest_df_flush(int *error)
 }
 EXPORT_SYMBOL_GPL(sev_guest_df_flush);
 
+int sev_guest_snp_decommission(struct sev_data_snp_decommission *data, int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
+}
+EXPORT_SYMBOL_GPL(sev_guest_snp_decommission);
+
+int sev_guest_snp_df_flush(int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
+}
+EXPORT_SYMBOL_GPL(sev_guest_snp_df_flush);
+
+int sev_snp_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error);
+}
+EXPORT_SYMBOL_GPL(sev_snp_reclaim);
+
+int sev_snp_unsmash(unsigned long paddr, int *error)
+{
+	struct sev_data_snp_page_unsmash *data;
+	int rc;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	data->paddr = paddr;
+	rc = sev_do_cmd(SEV_CMD_SNP_PAGE_UNSMASH, data, error);
+
+	kfree(data);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(sev_snp_unsmash);
+
+int sev_guest_snp_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
+}
+EXPORT_SYMBOL_GPL(sev_guest_snp_dbg_decrypt);
+
 static void sev_exit(struct kref *ref)
 {
 	misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index ec45c18c3b0a..32532df37446 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -821,6 +821,80 @@ int sev_guest_df_flush(int *error);
  */
 int sev_guest_decommission(struct sev_data_decommission *data, int *error);
 
+/**
+ * sev_guest_df_flush - perform SNP DF_FLUSH command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int sev_guest_snp_df_flush(int *error);
+
+/**
+ * sev_guest_snp_decommission - perform SNP_DECOMMISSION command
+ *
+ * @decommission: sev_data_decommission structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int sev_guest_snp_decommission(struct sev_data_snp_decommission *data, int *error);
+
+/**
+ * sev_snp_reclaim - perform SNP_PAGE_RECLAIM command
+ *
+ * @decommission: sev_snp_page_reclaim structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int sev_snp_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
+
+/**
+ * sev_snp_reclaim - perform SNP_PAGE_UNSMASH command
+ *
+ * @decommission: sev_snp_page_unmash structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int sev_snp_unsmash(unsigned long paddr, int *error);
+
+/**
+ * sev_guest_snp_dbg_decrypt - perform SEV SNP_DBG_DECRYPT command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int sev_guest_snp_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
+
+
 void *psp_copy_user_blob(u64 uaddr, u32 len);
 
 #else	/* !CONFIG_CRYPTO_DEV_SP_PSP */
@@ -848,6 +922,17 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int
 
 static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }
 
+static inline int
+sev_guest_snp_decommission(struct sev_data_snp_decommission *data, int *error) { return -ENODEV; }
+
+static inline int sev_guest_snp_df_flush(int *error) { return -ENODEV; }
+
+static inline int sev_snp_reclaim(struct sev_data_snp_page_reclaim *data, int *error) { return -ENODEV; }
+
+static inline int sev_snp_unsmash(unsigned long paddr, int *error) { return -ENODEV; }
+
+static inline int sev_guest_snp_dbg_decrypt(struct sev_data_snp_dbg *data, int *error) { return -ENODEV; }
+
 #endif	/* CONFIG_CRYPTO_DEV_SP_PSP */
 
 #endif	/* __PSP_SEV_H__ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 12/30] crypto ccp: handle the legacy SEV command when SNP is enabled
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (10 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 11/30] crypto:ccp: provide APIs to issue SEV-SNP commands Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 13/30] KVM: SVM: add initial SEV-SNP support Brijesh Singh
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

See SEV-SNP spec section 5.3.7 for more detail.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 90 +++++++++++++++++++++++++++++++++---
 drivers/crypto/ccp/sev-dev.h |  1 +
 2 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 242c4775eb56..4aa9d4505d71 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -148,12 +148,35 @@ static int sev_cmd_buffer_len(int cmd)
 	return 0;
 }
 
+static bool sev_legacy_cmd_buf_writable(int cmd)
+{
+	switch (cmd) {
+	case SEV_CMD_PLATFORM_STATUS:
+	case SEV_CMD_GUEST_STATUS:
+	case SEV_CMD_LAUNCH_START:
+	case SEV_CMD_RECEIVE_START:
+	case SEV_CMD_LAUNCH_MEASURE:
+	case SEV_CMD_SEND_START:
+	case SEV_CMD_SEND_UPDATE_DATA:
+	case SEV_CMD_SEND_UPDATE_VMSA:
+	case SEV_CMD_PEK_CSR:
+	case SEV_CMD_PDH_CERT_EXPORT:
+	case SEV_CMD_GET_ID:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 {
+	size_t cmd_buf_len = sev_cmd_buffer_len(cmd);
 	struct psp_device *psp = psp_master;
 	struct sev_device *sev;
 	unsigned int phys_lsb, phys_msb;
 	unsigned int reg, ret = 0;
+	struct page *cmd_page = NULL;
+	struct rmpupdate e = {};
 
 	if (!psp || !psp->sev_data)
 		return -ENODEV;
@@ -163,15 +186,47 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 
 	sev = psp->sev_data;
 
+	/*
+	 * Check If SNP is initialized and we are asked to execute a legacy
+	 * command that requires write by the firmware in the command buffer.
+	 * In that case use an intermediate command buffer page to complete the
+	 * operation.
+	 *
+	 * NOTE: If the command buffer contains a pointer which will be modified
+	 * by the firmware then caller must take care of it.
+	 */
+	if (sev->snp_inited && sev_legacy_cmd_buf_writable(cmd)) {
+		cmd_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+		if (!cmd_page)
+			return -ENOMEM;
+
+		memcpy(page_address(cmd_page), data, cmd_buf_len);
+
+		/* make it as a firmware page */
+		e.immutable = true;
+		e.assigned = true;
+		ret = rmptable_rmpupdate(cmd_page, &e);
+		if (ret) {
+			dev_err(sev->dev, "sev cmd id %#x, failed to change to firmware state (spa 0x%lx ret %d).\n",
+				cmd, page_to_pfn(cmd_page) << PAGE_SHIFT, ret);
+			goto e_free;
+		}
+	}
+
 	/* Get the physical address of the command buffer */
-	phys_lsb = data ? lower_32_bits(__psp_pa(data)) : 0;
-	phys_msb = data ? upper_32_bits(__psp_pa(data)) : 0;
+	if (cmd_page) {
+		phys_lsb = data ? lower_32_bits(__sme_page_pa(cmd_page)) : 0;
+		phys_msb = data ? upper_32_bits(__sme_page_pa(cmd_page)) : 0;
+	} else {
+		phys_lsb = data ? lower_32_bits(__psp_pa(data)) : 0;
+		phys_msb = data ? upper_32_bits(__psp_pa(data)) : 0;
+	}
 
 	dev_dbg(sev->dev, "sev command id %#x buffer 0x%08x%08x timeout %us\n",
 		cmd, phys_msb, phys_lsb, psp_timeout);
 
 	print_hex_dump_debug("(in):  ", DUMP_PREFIX_OFFSET, 16, 2, data,
-			     sev_cmd_buffer_len(cmd), false);
+			     cmd_buf_len, false);
 
 	iowrite32(phys_lsb, sev->io_regs + sev->vdata->cmdbuff_addr_lo_reg);
 	iowrite32(phys_msb, sev->io_regs + sev->vdata->cmdbuff_addr_hi_reg);
@@ -185,6 +240,24 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 
 	/* wait for command completion */
 	ret = sev_wait_cmd_ioc(sev, &reg, psp_timeout);
+
+	/* if an intermediate page is used then copy the data back to original. */
+	if (cmd_page) {
+		int rc;
+
+		/* make it as a hypervisor page */
+		memset(&e, 0, sizeof(struct rmpupdate));
+		rc = rmptable_rmpupdate(cmd_page, &e);
+		if (rc) {
+			dev_err(sev->dev, "sev cmd id %#x, failed to change to hypervisor state ret=%d.\n",
+				cmd, rc);
+			/* Set the command page to NULL so that the page is leaked. */
+			cmd_page = NULL;
+		} else {
+			memcpy(data, page_address(cmd_page), cmd_buf_len);
+		}
+	}
+
 	if (ret) {
 		if (psp_ret)
 			*psp_ret = 0;
@@ -192,7 +265,7 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 		dev_err(sev->dev, "sev command %#x timed out, disabling PSP\n", cmd);
 		psp_dead = true;
 
-		return ret;
+		goto e_free;
 	}
 
 	psp_timeout = psp_cmd_timeout;
@@ -207,8 +280,11 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 	}
 
 	print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
-			     sev_cmd_buffer_len(cmd), false);
+			     cmd_buf_len, false);
 
+e_free:
+	if (cmd_page)
+		__free_page(cmd_page);
 	return ret;
 }
 
@@ -234,7 +310,7 @@ static int __sev_platform_init_locked(int *error)
 
 	sev = psp->sev_data;
 
-	if (sev->state == SEV_STATE_INIT)
+	if (sev->legacy_inited && (sev->state == SEV_STATE_INIT))
 		return 0;
 
 	if (sev_es_tmr) {
@@ -255,6 +331,7 @@ static int __sev_platform_init_locked(int *error)
 	if (rc)
 		return rc;
 
+	sev->legacy_inited = true;
 	sev->state = SEV_STATE_INIT;
 
 	/* Prepare for first SEV guest launch after INIT */
@@ -289,6 +366,7 @@ static int __sev_platform_shutdown_locked(int *error)
 	if (ret)
 		return ret;
 
+	sev->legacy_inited = false;
 	sev->state = SEV_STATE_UNINIT;
 	dev_dbg(sev->dev, "SEV firmware shutdown\n");
 
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 18b116a817ff..2ee9665a901d 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -54,6 +54,7 @@ struct sev_device {
 	u8 build;
 
 	bool snp_inited;
+	bool legacy_inited;
 };
 
 int sev_dev_init(struct psp_device *psp);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 13/30] KVM: SVM: add initial SEV-SNP support
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (11 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 12/30] crypto ccp: handle the legacy SEV command when SNP is enabled Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 14/30] KVM: SVM: make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

The next generation of SEV is called SEV-SNP (Secure Nested Paging).
SEV-SNP builds upon existing SEV and SEV-ES functionality  while adding new
hardware based security protection. SEV-SNP adds strong memory encryption
integrity protection to help prevent malicious hypervisor-based attacks
such as data replay, memory re-mapping, and more, to create an isolated
execution environment.

The SNP feature can be enabled in the KVM by passing the sev-snp module
parameter.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 17 +++++++++++++++++
 arch/x86/kvm/svm/svm.c |  5 +++++
 arch/x86/kvm/svm/svm.h | 13 +++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 48017fef1cd9..b720837faf5a 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -19,6 +19,7 @@
 #include <asm/fpu/internal.h>
 
 #include <asm/trapnr.h>
+#include <asm/sev-snp.h>
 
 #include "x86.h"
 #include "svm.h"
@@ -1249,6 +1250,7 @@ void sev_vm_destroy(struct kvm *kvm)
 void __init sev_hardware_setup(void)
 {
 	unsigned int eax, ebx, ecx, edx;
+	bool sev_snp_supported = false;
 	bool sev_es_supported = false;
 	bool sev_supported = false;
 
@@ -1298,9 +1300,24 @@ void __init sev_hardware_setup(void)
 	pr_info("SEV-ES supported: %u ASIDs\n", min_sev_asid - 1);
 	sev_es_supported = true;
 
+	/* SEV-SNP support requested? */
+	if (!sev_snp)
+		goto out;
+
+	/* Does the CPU support SEV-SNP? */
+	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
+		goto out;
+
+	if (!snp_key_active())
+		goto out;
+
+	pr_info("SEV-SNP supported: %u ASIDs\n", min_sev_asid - 1);
+	sev_snp_supported = true;
+
 out:
 	sev = sev_supported;
 	sev_es = sev_es_supported;
+	sev_snp = sev_snp_supported;
 }
 
 void sev_hardware_teardown(void)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3442d44ca53b..aa7ff4685c87 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -197,6 +197,10 @@ module_param(sev, int, 0444);
 int sev_es = IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT);
 module_param(sev_es, int, 0444);
 
+/* enable/disable SEV-SNP support */
+int sev_snp = IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT);
+module_param(sev_snp, int, 0444);
+
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
@@ -986,6 +990,7 @@ static __init int svm_hardware_setup(void)
 	} else {
 		sev = false;
 		sev_es = false;
+		sev_snp = false;
 	}
 
 	svm_adjust_mmio_mask();
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6e7d070f8b86..3dd60d2a567a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -73,6 +73,7 @@ enum {
 struct kvm_sev_info {
 	bool active;		/* SEV enabled guest */
 	bool es_active;		/* SEV-ES enabled guest */
+	bool snp_active;	/* SEV-SNP enabled guest */
 	unsigned int asid;	/* ASID used for this guest */
 	unsigned int handle;	/* SEV firmware handle */
 	int fd;			/* SEV device fd */
@@ -241,6 +242,17 @@ static inline bool sev_es_guest(struct kvm *kvm)
 #endif
 }
 
+static inline bool sev_snp_guest(struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_AMD_SEV
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+	return sev_es_guest(kvm) && sev->snp_active;
+#else
+	return false;
+#endif
+}
+
 static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
 {
 	vmcb->control.clean = 0;
@@ -407,6 +419,7 @@ static inline bool gif_set(struct vcpu_svm *svm)
 
 extern int sev;
 extern int sev_es;
+extern int sev_snp;
 extern bool dump_invalid_vmcb;
 
 u32 svm_msrpm_offset(u32 msr);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 14/30] KVM: SVM: make AVIC backing, VMSA and VMCB memory allocation SNP safe
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (12 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 13/30] KVM: SVM: add initial SEV-SNP support Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 15/30] KVM: SVM: define new SEV_FEATURES field in the VMCB Save State Area Brijesh Singh
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Marc Orr

When SEV-SNP is globally enabled on a system, the VMRUN instruction
performs additional security checks on AVIC backing, VMSA, and VMCB page.
On a successful VMRUN, these pages are marked "in-use" by the
hardware in the RMP entry, and any attempt to modify the RMP entry for
these pages will result in page-fault (RMP violation check).

While performing the RMP check, hardware will try to create a 2MB TLB
entry for the large page accesses. When it does this, it first reads
the RMP for the base of 2MB region and verifies that all this memory is
safe. If AVIC backing, VMSA, and VMCB memory happen to be the base of
2MB region, then RMP check will fail because of the "in-use" marking for
the base entry of this 2MB region.

e.g.

1. A VMCB was allocated on 2MB-aligned address.
2. The VMRUN instruction marks this RMP entry as "in-use".
3. Another process allocated some other page of memory that happened to be
   within the same 2MB region.
4. That process tried to write its page using physmap.

If the physmap entry in step #4 uses a large (1G/2M) page, then the
hardware will attempt to create a 2M TLB entry. The hardware will find
that the "in-use" bit is set in the RMP entry (because it was a
VMCB page) and will cause an RMP violation check.

See APM2 section 15.36.12 for more information on VMRUN checks when
SEV-SNP is globally active.

A generic allocator can return a page which are 2M aligned and will not
be safe to be used when SEV-SNP is globally enabled. Add a
snp_safe_alloc_page() helper that can be used for allocating the
SNP safe memory. The helper allocated 2 pages and splits them into order-1
allocation. It frees one page and keeps one of the page which is not
2M aligned.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Co-developed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/lapic.c            |  5 ++++-
 arch/x86/kvm/svm/sev.c          | 27 +++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          | 16 ++++++++++++++--
 arch/x86/kvm/svm/svm.h          |  1 +
 5 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d6616f6f6ef..ccd5f8090ff6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1301,6 +1301,7 @@ struct kvm_x86_ops {
 	int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 43cceadd073e..7e0151838273 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2425,7 +2425,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
 
 	vcpu->arch.apic = apic;
 
-	apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+	if (kvm_x86_ops.alloc_apic_backing_page)
+		apic->regs = kvm_x86_ops.alloc_apic_backing_page(vcpu);
+	else
+		apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 	if (!apic->regs) {
 		printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
 		       vcpu->vcpu_id);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b720837faf5a..4d5be5d2b05c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2079,3 +2079,30 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 	 */
 	ghcb_set_sw_exit_info_2(svm->ghcb, 1);
 }
+
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
+{
+	unsigned long pfn;
+	struct page *p;
+
+	if (!snp_key_active())
+		return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+
+	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
+	if (!p)
+		return NULL;
+
+	/* split the page order */
+	split_page(p, 1);
+
+	/* Find a non-2M aligned page */
+	pfn = page_to_pfn(p);
+	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
+		pfn++;
+		__free_page(p);
+	} else {
+		__free_page(pfn_to_page(pfn + 1));
+	}
+
+	return pfn_to_page(pfn);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index aa7ff4685c87..13df2cbfc361 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1324,7 +1324,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 	svm = to_svm(vcpu);
 
 	err = -ENOMEM;
-	vmcb_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	vmcb_page = snp_safe_alloc_page(vcpu);
 	if (!vmcb_page)
 		goto out;
 
@@ -1333,7 +1333,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 		 * SEV-ES guests require a separate VMSA page used to contain
 		 * the encrypted register state of the guest.
 		 */
-		vmsa_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+		vmsa_page = snp_safe_alloc_page(vcpu);
 		if (!vmsa_page)
 			goto error_free_vmcb_page;
 
@@ -4423,6 +4423,16 @@ static int svm_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
+{
+	struct page *page = snp_safe_alloc_page(vcpu);
+
+	if (!page)
+		return NULL;
+
+	return page_address(page);
+}
+
 static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.hardware_unsetup = svm_hardware_teardown,
 	.hardware_enable = svm_hardware_enable,
@@ -4546,6 +4556,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.complete_emulated_msr = svm_complete_emulated_msr,
 
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+
+	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 3dd60d2a567a..9e8cd39bd703 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -603,6 +603,7 @@ void sev_es_create_vcpu(struct vcpu_svm *svm);
 void sev_es_vcpu_load(struct vcpu_svm *svm, int cpu);
 void sev_es_vcpu_put(struct vcpu_svm *svm);
 void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 
 /* vmenter.S */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 15/30] KVM: SVM: define new SEV_FEATURES field in the VMCB Save State Area
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (13 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 14/30] KVM: SVM: make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 16/30] KVM: SVM: add KVM_SNP_INIT command Brijesh Singh
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The hypervisor uses the SEV_FEATURES field (offset 3B0h) in the Save State
Area to control the SEV-SNP guest features such as SNPActive, vTOM,
ReflectVC etc. An SEV-SNP guest can read the SEV_FEATURES fields through
the SEV_STATUS MSR.

See APM2 Table 15-34 and B-4 for more details.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/svm.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 1c561945b426..c38783a1d24f 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -212,6 +212,15 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define SVM_NESTED_CTL_SEV_ENABLE	BIT(1)
 #define SVM_NESTED_CTL_SEV_ES_ENABLE	BIT(2)
 
+#define SVM_SEV_FEATURES_SNP_ACTIVE		BIT(0)
+#define SVM_SEV_FEATURES_VTOM			BIT(1)
+#define SVM_SEV_FEATURES_REFLECT_VC		BIT(2)
+#define SVM_SEV_FEATURES_RESTRICTED_INJECTION	BIT(3)
+#define SVM_SEV_FEATURES_ALTERNATE_INJECTION	BIT(4)
+#define SVM_SEV_FEATURES_DEBUG_SWAP		BIT(5)
+#define SVM_SEV_FEATURES_PREVENT_HOST_IBS	BIT(6)
+#define SVM_SEV_FEATURES_BTB_ISOLATION		BIT(7)
+
 struct vmcb_seg {
 	u16 selector;
 	u16 attrib;
@@ -293,7 +302,8 @@ struct vmcb_save_area {
 	u64 sw_exit_info_1;
 	u64 sw_exit_info_2;
 	u64 sw_scratch;
-	u8 reserved_11[56];
+	u64 sev_features;
+	u8 reserved_11[48];
 	u64 xcr0;
 	u8 valid_bitmap[16];
 	u64 x87_state_gpa;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 16/30] KVM: SVM: add KVM_SNP_INIT command
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (14 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 15/30] KVM: SVM: define new SEV_FEATURES field in the VMCB Save State Area Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 17/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The KVM_SNP_INIT command is used by the hypervisor to initialize the
SEV-SNP platform context. In a typical workflow, this command should be the
first command issued. When creating SEV-SNP guest, the VMM must use this
command instead of the KVM_SEV_INIT or KVM_SEV_ES_INIT.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c   | 41 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/svm/svm.c   |  5 +++++
 arch/x86/kvm/svm/svm.h   |  1 +
 include/uapi/linux/kvm.h |  3 +++
 4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4d5be5d2b05c..36042a2b19b3 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -189,7 +189,10 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	if (asid < 0)
 		return ret;
 
-	ret = sev_platform_init(&argp->error);
+	if (sev->snp_active)
+		ret = sev_snp_init(&argp->error);
+	else
+		ret = sev_platform_init(&argp->error);
 	if (ret)
 		goto e_free;
 
@@ -206,12 +209,19 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 static int sev_es_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
+	int ret;
+
 	if (!sev_es)
 		return -ENOTTY;
 
+	/* Must be set so that sev_asid_new() allocates ASID from the ES ASID range. */
 	to_kvm_svm(kvm)->sev_info.es_active = true;
 
-	return sev_guest_init(kvm, argp);
+	ret = sev_guest_init(kvm, argp);
+	if (ret)
+		to_kvm_svm(kvm)->sev_info.es_active = false;
+
+	return ret;
 }
 
 static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error)
@@ -1042,6 +1052,23 @@ static int sev_launch_secret(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return ret;
 }
 
+static int sev_snp_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	int rc;
+
+	if (!sev_snp)
+		return -ENOTTY;
+
+	rc = sev_es_guest_init(kvm, argp);
+	if (rc)
+		return rc;
+
+	sev->snp_active = true;
+
+	return 0;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1092,6 +1119,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_LAUNCH_SECRET:
 		r = sev_launch_secret(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_INIT:
+		r = sev_snp_guest_init(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -1955,6 +1985,13 @@ int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in)
 				    svm->ghcb_sa, svm->ghcb_sa_len, in);
 }
 
+void sev_snp_init_vmcb(struct vcpu_svm *svm)
+{
+	struct vmcb_save_area *save = &svm->vmcb->save;
+
+	save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
+}
+
 void sev_es_init_vmcb(struct vcpu_svm *svm)
 {
 	struct kvm_vcpu *vcpu = &svm->vcpu;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 13df2cbfc361..72fc1bd8737c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1281,6 +1281,11 @@ static void init_vmcb(struct vcpu_svm *svm)
 			/* Perform SEV-ES specific VMCB updates */
 			sev_es_init_vmcb(svm);
 		}
+
+		if (sev_snp_guest(svm->vcpu.kvm)) {
+			/* Perform SEV-SNP specific VMCB Updates */
+			sev_snp_init_vmcb(svm);
+		}
 	}
 
 	vmcb_mark_all_dirty(svm->vmcb);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9e8cd39bd703..9d41735699c6 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -604,6 +604,7 @@ void sev_es_vcpu_load(struct vcpu_svm *svm, int cpu);
 void sev_es_vcpu_put(struct vcpu_svm *svm);
 void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+void sev_snp_init_vmcb(struct vcpu_svm *svm);
 
 /* vmenter.S */
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 374c67875cdb..e0e7dd71a863 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1594,6 +1594,9 @@ enum sev_cmd_id {
 	/* Guest certificates commands */
 	KVM_SEV_CERT_EXPORT,
 
+	/* SNP specific commands */
+	KVM_SEV_SNP_INIT,
+
 	KVM_SEV_NR_MAX,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 17/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_START command
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (15 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 16/30] KVM: SVM: add KVM_SNP_INIT command Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 18/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_UPDATE command Brijesh Singh
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct the
measurement of the guest. If the guest is expected to be migrated, the
command also binds a migration agent (MA) to the guest.

For more information see the SEV-SNP spec section 4.5 and 8.11.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c   | 221 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.h   |   1 +
 include/uapi/linux/kvm.h |   8 ++
 3 files changed, 229 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 36042a2b19b3..7652e57f7e01 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -37,6 +37,9 @@ static unsigned int min_sev_asid;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static void snp_free_context_page(struct page *page);
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -1069,6 +1072,181 @@ static int sev_snp_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return 0;
 }
 
+static int snp_page_reclaim(struct page *page, int rmppage_size)
+{
+	struct sev_data_snp_page_reclaim *data;
+	struct rmpupdate e = {};
+	int rc, error;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	data->paddr = __sme_page_pa(page) | rmppage_size;
+	rc = sev_snp_reclaim(data, &error);
+	if (rc)
+		goto e_free;
+
+	rc = rmptable_rmpupdate(page, &e);
+
+e_free:
+	kfree(data);
+
+	return rc;
+}
+
+static void snp_free_context_page(struct page *page)
+{
+	/* Reclaim the page before changing the attribute */
+	if (snp_page_reclaim(page, RMP_PG_SIZE_4K)) {
+		pr_info("SEV-SNP: failed to reclaim page, leaking it.\n");
+		return;
+	}
+
+	__free_page(page);
+}
+
+static struct page *snp_alloc_context_page(void)
+{
+	struct rmpupdate val = {};
+	struct page *page = NULL;
+	int rc;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return NULL;
+
+	/* Transition the context page to the firmware state.*/
+	val.immutable = 1;
+	val.assigned = 1;
+	val.pagesize = RMP_PG_SIZE_4K;
+	rc = rmptable_rmpupdate(page, &val);
+	if (rc)
+		goto e_free;
+
+	return page;
+
+e_free:
+	__free_page(page);
+
+	return NULL;
+}
+
+static struct page *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_gctx_create *data;
+	struct page *context = NULL;
+	int rc;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return NULL;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_context_page();
+	if (!context)
+		goto e_free;
+
+	data->gctx_paddr = __sme_page_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, data, &argp->error);
+	if (rc) {
+		snp_free_context_page(context);
+		context = NULL;
+	}
+
+e_free:
+	kfree(data);
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate *data;
+	int asid = sev_get_asid(kvm);
+	int ret, retry_count = 0;
+
+	data = kmalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	/* Activate ASID on the given context */
+	data->gctx_paddr = __sme_page_pa(sev->snp_context);
+	data->asid   = asid;
+again:
+	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, data, error);
+
+	/* Check if the DF_FLUSH is required, and try again */
+	if (ret && (*error == SEV_RET_DFFLUSH_REQUIRED) && (!retry_count)) {
+		/* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */
+		down_read(&sev_deactivate_lock);
+		wbinvd_on_all_cpus();
+		ret = sev_guest_snp_df_flush(error);
+		up_read(&sev_deactivate_lock);
+
+		if (ret)
+			goto e_free;
+
+		/* only one retry */
+		retry_count = 1;
+
+		goto again;
+	}
+
+e_free:
+	kfree(data);
+
+	return ret;
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start *start;
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	/* Initialize the guest context */
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	rc = -ENOMEM;
+	start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT);
+	if (!start)
+		goto e_free_context;
+
+	/* Issue the LAUNCH_START command */
+	start->gctx_paddr = __sme_page_pa(sev->snp_context);
+	start->policy = params.policy;
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, start, &argp->error);
+	if (rc)
+		goto e_free_context;
+
+	/* Bind ASID to this guest */
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc)
+		goto e_free_context;
+
+	goto e_free_start;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+e_free_start:
+	kfree(start);
+
+	return rc;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1122,6 +1300,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_SNP_INIT:
 		r = sev_snp_guest_init(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -1241,6 +1422,36 @@ int svm_unregister_enc_region(struct kvm *kvm,
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_decommission *data;
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	data->gctx_paddr = __sme_page_pa(sev->snp_context);
+	ret = sev_guest_snp_decommission(data, NULL);
+	if (ret) {
+		pr_err("SEV-SNP: failed to decommission context, leaking the context page\n");
+		return ret;
+	}
+
+	/* free the context page now */
+	snp_free_context_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	kfree(data);
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -1273,7 +1484,15 @@ void sev_vm_destroy(struct kvm *kvm)
 
 	mutex_unlock(&kvm->lock);
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			pr_err("SEV-SNP: failed to free guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev->asid);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9d41735699c6..97efdca498ed 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -80,6 +80,7 @@ struct kvm_sev_info {
 	unsigned long pages_locked; /* Number of pages locked */
 	struct list_head regions_list;  /* List of registered regions */
 	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
+	struct page *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e0e7dd71a863..84a242597d81 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1596,6 +1596,7 @@ enum sev_cmd_id {
 
 	/* SNP specific commands */
 	KVM_SEV_SNP_INIT,
+	KVM_SEV_SNP_LAUNCH_START,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1648,6 +1649,13 @@ struct kvm_sev_dbg {
 	__u32 len;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u64 ma_uaddr;
+	__u8 ma_en;
+	__u8 imi_en;
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 18/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (16 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 17/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 19/30] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates Brijesh Singh
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
guest's memory. The data is encrypted with the cryptographic context
created with the KVM_SEV_SNP_LAUNCH_START.

In addition to the inserting data, it can insert a two special pages
into the guests memory: the secrets page and the CPUID page.

For more information see the SEV-SNP spec section 4.5 and 8.12.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c   | 136 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h |  18 ++++++
 2 files changed, 154 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 7652e57f7e01..1a0c8c95d178 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1247,6 +1247,139 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return rc;
 }
 
+static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, unsigned long hva)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot;
+
+	kvm_for_each_memslot(memslot, slots) {
+		if (hva >= memslot->userspace_addr &&
+		    hva < memslot->userspace_addr + (memslot->npages << PAGE_SHIFT))
+			return memslot;
+	}
+
+	return NULL;
+}
+
+static bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa)
+{
+	struct kvm_memory_slot *memslot;
+	gpa_t gpa_offset;
+
+	memslot = hva_to_memslot(kvm, hva);
+	if (!memslot)
+		return false;
+
+	gpa_offset = hva - memslot->userspace_addr;
+	*gpa = ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
+
+	return true;
+}
+
+static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	unsigned long npages, vaddr, vaddr_end, i, next_vaddr;
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_update *data;
+	struct kvm_sev_snp_launch_update params;
+	int *error = &argp->error;
+	struct kvm_vcpu *vcpu;
+	struct page **inpages;
+	struct rmpupdate e;
+	int ret;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (!sev->snp_context)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	data->gctx_paddr = __sme_page_pa(sev->snp_context);
+	data->vmpl1_perms = 0xf;
+	data->vmpl2_perms = 0xf;
+	data->vmpl3_perms = 0xf;
+
+	/* Lock the user memory. */
+	inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
+	if (!inpages) {
+		ret = -ENOMEM;
+		goto e_free;
+	}
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	vaddr = params.uaddr;
+	vaddr_end = vaddr + params.len;
+
+	for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i++) {
+		unsigned long psize, pmask;
+		int level = PG_LEVEL_4K;
+		gpa_t gpa;
+
+		if (!hva_to_gpa(kvm, vaddr, &gpa)) {
+			ret = -EINVAL;
+			goto e_unpin;
+		}
+
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+		gpa = gpa & pmask;
+
+		/* Transition the page state to pre-guest */
+		memset(&e, 0, sizeof(e));
+		e.assigned = 1;
+		e.gpa = gpa;
+		e.asid = sev_get_asid(kvm);
+		e.immutable = true;
+		e.pagesize = X86_RMP_PG_LEVEL(level);
+		ret = rmptable_rmpupdate(inpages[i], &e);
+		if (ret) {
+			ret = -EFAULT;
+			goto e_unpin;
+		}
+
+		data->address = __sme_page_pa(inpages[i]);
+		data->page_size = e.pagesize;
+		data->page_type = params.page_type;
+		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, data, error);
+		if (ret) {
+			snp_page_reclaim(inpages[i], e.pagesize);
+			goto e_unpin;
+		}
+
+		next_vaddr = (vaddr & pmask) + psize;
+	}
+
+e_unpin:
+	/* Content of memory is updated, mark pages dirty */
+	memset(&e, 0, sizeof(e));
+	for (i = 0; i < npages; i++) {
+		set_page_dirty_lock(inpages[i]);
+		mark_page_accessed(inpages[i]);
+
+		/*
+		 * If its an error, then update RMP entry to change page ownership
+		 * to the hypervisor.
+		 */
+		if (ret)
+			rmptable_rmpupdate(inpages[i], &e);
+	}
+
+	/* Unlock the user pages */
+	sev_unpin_memory(kvm, inpages, npages);
+
+e_free:
+	kfree(data);
+
+	return ret;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1303,6 +1436,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_SNP_LAUNCH_START:
 		r = snp_launch_start(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_UPDATE:
+		r = snp_launch_update(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 84a242597d81..a9f7aa9e412d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1597,6 +1597,7 @@ enum sev_cmd_id {
 	/* SNP specific commands */
 	KVM_SEV_SNP_INIT,
 	KVM_SEV_SNP_LAUNCH_START,
+	KVM_SEV_SNP_LAUNCH_UPDATE,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1656,6 +1657,23 @@ struct kvm_sev_snp_launch_start {
 	__u8 imi_en;
 };
 
+#define KVM_SEV_SNP_PAGE_TYPE_NORMAL		0x1
+#define KVM_SEV_SNP_PAGE_TYPE_VMSA		0x2
+#define KVM_SEV_SNP_PAGE_TYPE_ZERO		0x3
+#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED	0x4
+#define KVM_SEV_SNP_PAGE_TYPE_SECRETS		0x5
+#define KVM_SEV_SNP_PAGE_TYPE_CPUID		0x6
+
+struct kvm_sev_snp_launch_update {
+	__u64 uaddr;
+	__u32 len;
+	__u8 imi_page;
+	__u8 page_type;
+	__u8 vmpl3_perms;
+	__u8 vmpl2_perms;
+	__u8 vmpl1_perms;
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 19/30] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (17 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 18/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_UPDATE command Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 20/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_FINISH command Brijesh Singh
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The guest pages of the SEV-SNP VM maybe added as a private page in the
RMP entry (assigned bit is set). While terminating the guest we must
unassign those pages so that pages are transitioned to the hypervisor
state before they can be freed.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 1a0c8c95d178..4037430b8d56 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1517,6 +1517,47 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
 static void __unregister_enc_region_locked(struct kvm *kvm,
 					   struct enc_region *region)
 {
+	struct rmpupdate val = {};
+	unsigned long i, pfn;
+	rmpentry_t *e;
+	int level, rc;
+
+	/*
+	 * On SEV-SNP, the guest memory pages are assigned in the RMP table. Un-assigned them
+	 * before releasing the memory.
+	 */
+	if (sev_snp_guest(kvm)) {
+		for (i = 0; i < region->npages; i++) {
+			pfn = page_to_pfn(region->pages[i]);
+
+			if (need_resched())
+				schedule();
+
+			e = lookup_page_in_rmptable(region->pages[i], &level);
+			if (!e) {
+				pr_err("SEV-SNP: failed to read RMP entry (pfn 0x%lx\n", pfn);
+				continue;
+			}
+
+			/* If its not a guest assigned page then skip it */
+			if (!rmpentry_assigned(e))
+				continue;
+
+			/* Is the page part of a 2MB RMP entry? */
+			if (level == PG_LEVEL_2M) {
+				val.pagesize = RMP_PG_SIZE_2M;
+				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+			} else {
+				val.pagesize = RMP_PG_SIZE_4K;
+			}
+
+			/* Transition the page to hypervisor owned. */
+			rc = rmptable_rmpupdate(pfn_to_page(pfn), &val);
+			if (rc)
+				pr_err("SEV-SNP: failed to release pfn 0x%lx ret=%d\n", pfn, rc);
+		}
+	}
+
 	sev_unpin_memory(kvm, region->pages, region->npages);
 	list_del(&region->list);
 	kfree(region);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 20/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (18 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 19/30] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 21/30] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP Brijesh Singh
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
it as the measurement of the guest at launch.

While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
to encrypt the VMSA pages.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c   | 131 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h |  13 ++++
 2 files changed, 144 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4037430b8d56..810fd2b8a9ff 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1380,6 +1380,117 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return ret;
 }
 
+static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_update *data;
+	int i, ret;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	data->gctx_paddr = __sme_page_pa(sev->snp_context);
+	data->page_type = SNP_PAGE_TYPE_VMSA;
+
+	for (i = 0; i < kvm->created_vcpus; i++) {
+		struct vcpu_svm *svm = to_svm(kvm->vcpus[i]);
+		struct rmpupdate e = {};
+
+		/* Perform some pre-encryption checks against the VMSA */
+		ret = sev_es_sync_vmsa(svm);
+		if (ret)
+			goto e_free;
+
+		/* Transition the VMSA page to a firmware state. */
+		e.assigned = 1;
+		e.immutable = 1;
+		e.asid = sev->asid;
+		e.gpa = -1;
+		e.pagesize = RMP_PG_SIZE_4K;
+		ret = rmptable_rmpupdate(virt_to_page(svm->vmsa), &e);
+		if (ret)
+			goto e_free;
+
+		/* Issue the SNP command to encrypt the VMSA */
+		data->address = __sme_pa(svm->vmsa);
+		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, data, &argp->error);
+		if (ret) {
+			snp_page_reclaim(virt_to_page(svm->vmsa), RMP_PG_SIZE_4K);
+			goto e_free;
+		}
+
+		svm->vcpu.arch.guest_state_protected = true;
+	}
+
+e_free:
+	kfree(data);
+
+	return ret;
+}
+
+static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_finish *data;
+	void *id_block = NULL, *id_auth = NULL;
+	struct kvm_sev_snp_launch_finish params;
+	int ret;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (!sev->snp_context)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	/* Measure all vCPUs using LAUNCH_UPDATE before we finalize the launch flow. */
+	ret = snp_launch_update_vmsa(kvm, argp);
+	if (ret)
+		return ret;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	if (params.id_block_en) {
+		id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
+		if (IS_ERR(id_block)) {
+			ret = PTR_ERR(id_block);
+			goto e_free;
+		}
+
+		data->id_block_en = 1;
+		data->id_block_paddr = __sme_pa(id_block);
+	}
+
+	if (params.auth_key_en) {
+		id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
+		if (IS_ERR(id_auth)) {
+			ret = PTR_ERR(id_auth);
+			goto e_free_id_block;
+		}
+
+		data->auth_key_en = 1;
+		data->id_auth_paddr = __sme_pa(id_auth);
+	}
+
+	data->gctx_paddr = __sme_page_pa(sev->snp_context);
+	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
+
+	kfree(id_auth);
+
+e_free_id_block:
+	kfree(id_block);
+
+e_free:
+	kfree(data);
+
+	return ret;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1439,6 +1550,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_SNP_LAUNCH_UPDATE:
 		r = snp_launch_update(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_FINISH:
+		r = snp_launch_finish(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -1820,6 +1934,23 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
 
 	if (vcpu->arch.guest_state_protected)
 		sev_flush_guest_memory(svm, svm->vmsa, PAGE_SIZE);
+
+	/*
+	 * If its an SNP guest, then VMSA was added in the RMP entry as a guest owned page.
+	 * Transition the page to hyperivosr state before releasing it back to the system.
+	 */
+	if (sev_snp_guest(vcpu->kvm)) {
+		struct rmpupdate e = {};
+		int rc;
+
+		rc = rmptable_rmpupdate(virt_to_page(svm->vmsa), &e);
+		if (rc) {
+			pr_err("SEV-SNP: failed to clear RMP entry error %d, leaking the page\n",
+				rc);
+			return;
+		}
+	}
+
 	__free_page(virt_to_page(svm->vmsa));
 
 	if (svm->ghcb_sa_free)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a9f7aa9e412d..bfd5340e153d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1598,6 +1598,7 @@ enum sev_cmd_id {
 	KVM_SEV_SNP_INIT,
 	KVM_SEV_SNP_LAUNCH_START,
 	KVM_SEV_SNP_LAUNCH_UPDATE,
+	KVM_SEV_SNP_LAUNCH_FINISH,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1609,6 +1610,18 @@ struct kvm_sev_cmd {
 	__u32 sev_fd;
 };
 
+#define KVM_SEV_SNP_ID_BLOCK_SIZE	96
+#define KVM_SEV_SNP_ID_AUTH_SIZE	4096
+#define KVM_SEV_SNP_FINISH_DATA_SIZE	32
+
+struct kvm_sev_snp_launch_finish {
+	__u64 id_block_uaddr;
+	__u64 id_auth_uaddr;
+	__u8 id_block_en;
+	__u8 auth_key_en;
+	__u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
+};
+
 struct kvm_sev_launch_start {
 	__u32 handle;
 	__u32 policy;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 21/30] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (19 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 20/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_FINISH command Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 22/30] x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV Brijesh Singh
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

When running an SEV-SNP VM, the sPA used to index the RMP entry is
obtained through the TDP translation (gva->gpa->spa). The TDP page
level is checked against the page level programmed in the RMP entry.
If the page level does not match, then it will cause a nested page
fault with the RMP bit set to indicate the RMP violation.

To resolve the fault, we must match the page levels between the TDP
and RMP entry. Add a new kvm_x86_op (get_tdp_max_page_level) that
can be used to query the current the RMP page size. The page fault
handler will call the architecture code to get the maximum allowed
page level for the GPA and limit the TDP page level.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu/mmu.c          |  6 ++++--
 arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
 6 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ccd5f8090ff6..93dc4f232964 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1302,6 +1302,7 @@ struct kvm_x86_ops {
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d16481aa29d..e55df7b4e297 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3747,11 +3747,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 				u32 error_code, bool prefault)
 {
+	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);
+
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
 	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
 	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
-				 PG_LEVEL_2M, false);
+				 max_level, false);
 }
 
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
@@ -3792,7 +3794,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 {
 	int max_level;
 
-	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
+	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);
 	     max_level > PG_LEVEL_4K;
 	     max_level--) {
 		int page_num = KVM_PAGES_PER_HPAGE(max_level);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 810fd2b8a9ff..e66be4d305b9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2670,3 +2670,23 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return pfn_to_page(pfn);
 }
+
+int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
+{
+	rmpentry_t *e;
+	kvm_pfn_t pfn;
+	int level;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		return max_level;
+
+	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	if (is_error_noslot_pfn(pfn))
+		return max_level;
+
+	e = lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+	if (unlikely(!e))
+		return max_level;
+
+	return min_t(uint32_t, level, max_level);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 72fc1bd8737c..73259a3564eb 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4563,6 +4563,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
 
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+	.get_tdp_max_page_level = sev_get_tdp_max_page_level,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 97efdca498ed..9b095f8fc0cf 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -606,6 +606,7 @@ void sev_es_vcpu_put(struct vcpu_svm *svm);
 void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 void sev_snp_init_vmcb(struct vcpu_svm *svm);
+int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index eb69fef57485..ca0c1c1fbf92 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7587,6 +7587,12 @@ static int vmx_cpu_dirty_log_size(void)
 	return enable_pml ? PML_ENTITY_NUM : 0;
 }
 
+
+static int vmx_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
+{
+	return max_level;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.hardware_unsetup = hardware_unsetup,
 
@@ -7720,6 +7726,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.get_tdp_max_page_level = vmx_get_tdp_max_page_level,
 };
 
 static __init int hardware_setup(void)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 22/30] x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (20 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 21/30] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 23/30] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use Brijesh Singh
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

Introduce a helper to directly fault-in a TDP page without going
through the full page fault path.  This allows SEV-SNP to build
the netsted-page-table while handling the page state change VMGEXIT.
A guest may issue a page state change VMGEXIT before accessing the
page. Creating a fault-in, we can get the TDP page level and PFN
which will be used while calculating the RMP page size.

SEV-SNP guest calls, page state change VMGEXIT followed by the PVALIDATE.
If the page is not present in the TDP then PVALIDATE will cause a nested
page fault. If we can build the TDP while handling the page state change
VMGEXIT, it can also avoid a nested page fault due to the page not
being present.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 261be1d2032b..70dce26a5882 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -109,6 +109,8 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		       bool prefault);
 
+int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level);
+
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u32 err, bool prefault)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e55df7b4e297..33104943904b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3808,6 +3808,26 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 max_level, true);
 }
 
+int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level)
+{
+	int r;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		r = direct_page_fault(vcpu, gpa, error_code, false, max_level, true);
+	} while (r == RET_PF_RETRY);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 23/30] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (21 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 22/30] x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 24/30] KVM: X86: define new RMP check related #NPF error bits Brijesh Singh
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The SEV-SNP VMs may call the page state change VMGEXIT to add the GPA
as private or shared in the RMP table. The page state change VMGEXIT
will contain the RMP page level to be used in the RMP entry. If the
page level between the TDP and RMP does not match then, it will result
in nested-page-fault (RMP violation).

The SEV-SNP VMGEXIT handler will use the kvm_mmu_get_tdp_walk() to get
the current page-level in the TDP for the given GPA and calculate a
workable page level. If a GPA is mapped as a 4K-page in the TDP, but
the guest requested to add the GPA as a 2M in the RMP entry then the
2M request will be broken into 4K-pages to keep the RMP and TDP
page-levels in sync.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 70dce26a5882..e7c4e55215bf 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -110,6 +110,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		       bool prefault);
 
 int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level);
+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level);
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u32 err, bool prefault)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 33104943904b..147f22bda6e7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3828,6 +3828,35 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int m
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
 
+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level)
+{
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
+	int leaf, root;
+
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		leaf = kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, &root);
+	else
+		leaf = get_walk(vcpu, gpa, sptes, &root);
+
+	if (unlikely(leaf < 0))
+		return false;
+
+	/* Check if the leaf SPTE is present */
+	if (!is_shadow_present_pte(sptes[leaf]))
+		return false;
+
+	*pfn = spte_to_pfn(sptes[leaf]);
+	if (leaf > PG_LEVEL_4K) {
+		u64 page_mask = KVM_PAGES_PER_HPAGE(leaf) - KVM_PAGES_PER_HPAGE(leaf - 1);
+		*pfn |= (gpa_to_gfn(gpa) & page_mask);
+	}
+
+	*level = leaf;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_get_tdp_walk);
+
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 24/30] KVM: X86: define new RMP check related #NPF error bits
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (22 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 23/30] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 25/30] KVM: X86: update page-fault trace to log the 64-bit error code Brijesh Singh
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

When SEV-SNP is enabled globally, the hardware places restrictions on all
memory accesses based on the RMP entry, whether the hyperviso or a VM,
performs the accesses. When hardware encounters an RMP access violation
during a guest access, it will cause a #VMEXIT(NPF).

See APM2 section 16.36.10 for more details.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 93dc4f232964..074605408970 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -216,8 +216,12 @@ enum x86_intercept_stage;
 #define PFERR_RSVD_BIT 3
 #define PFERR_FETCH_BIT 4
 #define PFERR_PK_BIT 5
+#define PFERR_GUEST_RMP_BIT 31
 #define PFERR_GUEST_FINAL_BIT 32
 #define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
+#define PFERR_GUEST_SIZEM_BIT 35
+#define PFERR_GUEST_VMPL_BIT 36
 
 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
@@ -227,6 +231,10 @@ enum x86_intercept_stage;
 #define PFERR_PK_MASK (1U << PFERR_PK_BIT)
 #define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT)
 #define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_RMP_MASK (1ULL << PFERR_GUEST_RMP_BIT)
+#define PFERR_GUEST_ENC_MASK (1ULL << PFERR_GUEST_ENC_BIT)
+#define PFERR_GUEST_SIZEM_MASK (1ULL << PFERR_GUEST_SIZEM_BIT)
+#define PFERR_GUEST_VMPL_MASK (1ULL << PFERR_GUEST_VMPL_BIT)
 
 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
 				 PFERR_WRITE_MASK |		\
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 25/30] KVM: X86: update page-fault trace to log the 64-bit error code
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (23 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 24/30] KVM: X86: define new RMP check related #NPF error bits Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 26/30] KVM: SVM: add support to handle GHCB GPA register VMGEXIT Brijesh Singh
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

The page-fault error code is a 64-bit value, but the trace prints only
the lower 32-bits. Some of the SEV-SNP RMP fault error codes are
available in the upper 32-bits.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/trace.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 2de30c20bc26..16236a8b42eb 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -329,12 +329,12 @@ TRACE_EVENT(kvm_inj_exception,
  * Tracepoint for page fault.
  */
 TRACE_EVENT(kvm_page_fault,
-	TP_PROTO(unsigned long fault_address, unsigned int error_code),
+	TP_PROTO(unsigned long fault_address, u64 error_code),
 	TP_ARGS(fault_address, error_code),
 
 	TP_STRUCT__entry(
 		__field(	unsigned long,	fault_address	)
-		__field(	unsigned int,	error_code	)
+		__field(	u64,		error_code	)
 	),
 
 	TP_fast_assign(
@@ -342,7 +342,7 @@ TRACE_EVENT(kvm_page_fault,
 		__entry->error_code	= error_code;
 	),
 
-	TP_printk("address %lx error_code %x",
+	TP_printk("address %lx error_code %llx",
 		  __entry->fault_address, __entry->error_code)
 );
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 26/30] KVM: SVM: add support to handle GHCB GPA register VMGEXIT
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (24 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 25/30] KVM: X86: update page-fault trace to log the 64-bit error code Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 27/30] KVM: SVM: add support to handle MSR based Page State Change VMGEXIT Brijesh Singh
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

SEV-SNP guests are required to perform a GHCB GPA registration (see
section 2.5.2 in GHCB specification). Before using a GHCB GPA for a vCPU
the first time, a guest must register the vCPU GHCB GPA. If hypervisor
can work with the guest requested GPA then it must respond back with the
same GPA otherwise return -1.

On every VMEXIT, we verify that GHCB GPA matches with the registered value.
If a mismatch is detected then abort the guest.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h | 15 +++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e66be4d305b9..7c242c470eba 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2378,6 +2378,28 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_GHCB_GPA_REGISTER_REQ: {
+		kvm_pfn_t pfn;
+		u64 gfn;
+
+		gfn = get_ghcb_msr_bits(svm,
+					GHCB_MSR_GHCB_GPA_REGISTER_VALUE_MASK,
+					GHCB_MSR_GHCB_GPA_REGISTER_VALUE_POS);
+
+		pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
+		if (is_error_noslot_pfn(pfn))
+			gfn = GHCB_MSR_GHCB_GPA_REGISTER_ERROR;
+		else
+			svm->ghcb_registered_gpa = gfn_to_gpa(gfn);
+
+		set_ghcb_msr_bits(svm, gfn,
+				  GHCB_MSR_GHCB_GPA_REGISTER_VALUE_MASK,
+				  GHCB_MSR_GHCB_GPA_REGISTER_VALUE_POS);
+		set_ghcb_msr_bits(svm, GHCB_MSR_GHCB_GPA_REGISTER_RESP,
+				  GHCB_MSR_INFO_MASK,
+				  GHCB_MSR_INFO_POS);
+		break;
+	}
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
@@ -2418,6 +2440,12 @@ int sev_handle_vmgexit(struct vcpu_svm *svm)
 		return -EINVAL;
 	}
 
+	/* SEV-SNP guest requires that the GHCB GPA must be registered */
+	if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
+		vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
+		return -EINVAL;
+	}
+
 	if (kvm_vcpu_map(&svm->vcpu, ghcb_gpa >> PAGE_SHIFT, &svm->ghcb_map)) {
 		/* Unable to map GHCB from guest */
 		vcpu_unimpl(&svm->vcpu, "vmgexit: error mapping GHCB [%#llx] from guest\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9b095f8fc0cf..0de7c77b0d59 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -194,6 +194,8 @@ struct vcpu_svm {
 	u64 ghcb_sa_len;
 	bool ghcb_sa_sync;
 	bool ghcb_sa_free;
+
+	u64 ghcb_registered_gpa;
 };
 
 struct svm_cpu_data {
@@ -254,6 +256,13 @@ static inline bool sev_snp_guest(struct kvm *kvm)
 #endif
 }
 
+#define GHCB_GPA_INVALID	0xffffffffffffffff
+
+static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
+{
+	return svm->ghcb_registered_gpa == val;
+}
+
 static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
 {
 	vmcb->control.clean = 0;
@@ -574,6 +583,12 @@ void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 #define GHCB_MSR_CPUID_REG_POS		30
 #define GHCB_MSR_CPUID_REG_MASK		0x3
 
+#define GHCB_MSR_GHCB_GPA_REGISTER_REQ		0x012
+#define GHCB_MSR_GHCB_GPA_REGISTER_VALUE_POS	12
+#define GHCB_MSR_GHCB_GPA_REGISTER_VALUE_MASK	0xfffffffffffff
+#define GHCB_MSR_GHCB_GPA_REGISTER_RESP		0x013
+#define GHCB_MSR_GHCB_GPA_REGISTER_ERROR	0xfffffffffffff
+
 #define GHCB_MSR_TERM_REQ		0x100
 #define GHCB_MSR_TERM_REASON_SET_POS	12
 #define GHCB_MSR_TERM_REASON_SET_MASK	0xf
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 27/30] KVM: SVM: add support to handle MSR based Page State Change VMGEXIT
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (25 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 26/30] KVM: SVM: add support to handle GHCB GPA register VMGEXIT Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 28/30] KVM: SVM: add support to handle " Brijesh Singh
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification section 2.5.1.

Before changing the page state in the RMP entry, we lookup the page in
the TDP to make sure that there is a valid mapping for it. If the mapping
exist then try to find a workable page level between the TDP and RMP for
the page. If the page is not mapped in the TDP, then create a fault such
that it gets mapped before we change the page state in the RMP entry.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 148 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h |  11 +++
 2 files changed, 159 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 7c242c470eba..8f046b45c424 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include "svm.h"
 #include "cpuid.h"
 #include "trace.h"
+#include "mmu.h"
 
 #define __ex(x) __kvm_handle_fault_on_reboot(x)
 
@@ -2322,6 +2323,128 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(struct kvm_vcpu *vcpu, kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	/* Split the 2MB-page RMP entry into a corresponding set of contiguous 4KB-page RMP entry */
+	return rmptable_psmash(pfn_to_page(pfn));
+}
+
+static int snp_make_page_shared(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+	struct rmpupdate val;
+	int rc, rmp_level;
+	rmpentry_t *e;
+
+	e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (!e)
+		return -EINVAL;
+
+	if (!rmpentry_assigned(e))
+		return 0;
+
+	/* Log if the entry is validated */
+	if (rmpentry_validated(e))
+		pr_debug_ratelimited("Remove RMP entry for a validated gpa 0x%llx\n", gpa);
+
+	/*
+	 * Is the page part of an existing 2M RMP entry ? Split the 2MB into multiple of 4K-page
+	 * before making the memory shared.
+	 */
+	if ((level == PG_LEVEL_4K) && (rmp_level == PG_LEVEL_2M)) {
+		rc = snp_rmptable_psmash(vcpu, pfn);
+		if (rc)
+			return rc;
+	}
+
+	memset(&val, 0, sizeof(val));
+	val.pagesize = X86_RMP_PG_LEVEL(level);
+	return rmptable_rmpupdate(pfn_to_page(pfn), &val);
+}
+
+static int snp_make_page_private(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
+	struct rmpupdate val;
+	int rmp_level;
+	rmpentry_t *e;
+
+	e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (!e)
+		return -EINVAL;
+
+	/* Log if the entry is validated */
+	if (rmpentry_validated(e))
+		pr_err_ratelimited("Asked to make a pre-validated gpa %llx private\n", gpa);
+
+	memset(&val, 0, sizeof(val));
+	val.gpa = gpa;
+	val.asid = sev->asid;
+	val.pagesize = X86_RMP_PG_LEVEL(level);
+	val.assigned = true;
+
+	return rmptable_rmpupdate(pfn_to_page(pfn), &val);
+}
+
+static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
+{
+	struct kvm *kvm = vcpu->kvm;
+	gpa_t end, next_gpa;
+	int rc, tdp_level;
+	kvm_pfn_t pfn;
+
+	end = gpa + page_level_size(level);
+
+	for (; end > gpa; gpa = next_gpa) {
+		/*
+		 * Get the pfn and level for the gpa from the nested page table.
+		 *
+		 * If the TDP walk failed, then its safe to say that we don't have a valid
+		 * mapping for the gpa in the nested page table. Create a fault to map the
+		 * page is nested page table.
+		 */
+		if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) {
+			pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level);
+			if (is_error_noslot_pfn(pfn))
+				goto out;
+
+			if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level))
+				goto out;
+		}
+
+		/* Adjust the level so that we don't go higher than the backing page level */
+		level = min_t(size_t, level, tdp_level);
+
+		spin_lock(&kvm->mmu_lock);
+
+		switch (op) {
+		case SNP_PAGE_STATE_SHARED:
+			rc = snp_make_page_shared(vcpu, gpa, pfn, level);
+			break;
+		case SNP_PAGE_STATE_PRIVATE:
+			rc = snp_make_page_private(vcpu, gpa, pfn, level);
+			break;
+		default:
+			rc = -EINVAL;
+			break;
+		}
+
+		spin_unlock(&kvm->mmu_lock);
+
+		if (rc) {
+			pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
+					   op, gpa, pfn, level, rc);
+			goto out;
+		}
+
+		next_gpa = gpa + page_level_size(level);
+	}
+
+out:
+	return rc;
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -2400,6 +2523,31 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PAGE_STATE_CHANGE_REQ: {
+		gfn_t gfn;
+		int ret;
+		u8 op;
+
+		gfn = get_ghcb_msr_bits(svm,
+					GHCB_MSR_PAGE_STATE_CHANGE_GFN_MASK,
+					GHCB_MSR_PAGE_STATE_CHANGE_GFN_POS);
+		op = get_ghcb_msr_bits(svm,
+					GHCB_MSR_PAGE_STATE_CHANGE_OP_MASK,
+					GHCB_MSR_PAGE_STATE_CHANGE_OP_POS);
+
+		ret = __snp_handle_page_state_change(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K);
+
+		set_ghcb_msr_bits(svm, ret,
+				  GHCB_MSR_PAGE_STATE_CHANGE_ERROR_MASK,
+				  GHCB_MSR_PAGE_STATE_CHANGE_ERROR_POS);
+		set_ghcb_msr_bits(svm, 0,
+				  GHCB_MSR_PAGE_STATE_CHANGE_RSVD_MASK,
+				  GHCB_MSR_PAGE_STATE_CHANGE_RSVD_POS);
+		set_ghcb_msr_bits(svm, GHCB_MSR_PAGE_STATE_CHANGE_RESP,
+				  GHCB_MSR_INFO_MASK,
+				  GHCB_MSR_INFO_POS);
+		break;
+	}
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0de7c77b0d59..31bc9cc12c44 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -589,6 +589,17 @@ void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 #define GHCB_MSR_GHCB_GPA_REGISTER_RESP		0x013
 #define GHCB_MSR_GHCB_GPA_REGISTER_ERROR	0xfffffffffffff
 
+#define GHCB_MSR_PAGE_STATE_CHANGE_REQ		0x014
+#define	GHCB_MSR_PAGE_STATE_CHANGE_GFN_POS	12
+#define	GHCB_MSR_PAGE_STATE_CHANGE_GFN_MASK	0xffffffffff
+#define	GHCB_MSR_PAGE_STATE_CHANGE_OP_POS	52
+#define	GHCB_MSR_PAGE_STATE_CHANGE_OP_MASK	0xf
+#define GHCB_MSR_PAGE_STATE_CHANGE_RESP		0x015
+#define	GHCB_MSR_PAGE_STATE_CHANGE_ERROR_POS	32
+#define	GHCB_MSR_PAGE_STATE_CHANGE_ERROR_MASK	0xffffffff
+#define	GHCB_MSR_PAGE_STATE_CHANGE_RSVD_POS	12
+#define	GHCB_MSR_PAGE_STATE_CHANGE_RSVD_MASK	0xfffff
+
 #define GHCB_MSR_TERM_REQ		0x100
 #define GHCB_MSR_TERM_REASON_SET_POS	12
 #define GHCB_MSR_TERM_REASON_SET_MASK	0xf
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 28/30] KVM: SVM: add support to handle Page State Change VMGEXIT
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (26 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 27/30] KVM: SVM: add support to handle MSR based Page State Change VMGEXIT Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 29/30] KVM: X86: export the kvm_zap_gfn_range() for the SNP use Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 30/30] KVM: X86: Add support to handle the RMP nested page fault Brijesh Singh
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification section 4.1.6.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 58 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h |  4 +++
 2 files changed, 62 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 8f046b45c424..35e7a7bbf878 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2141,6 +2141,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_HLT_LOOP:
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
+	case SVM_VMGEXIT_PAGE_STATE_CHANGE:
 		break;
 	default:
 		goto vmgexit_err;
@@ -2425,6 +2426,7 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, int op, gpa_t g
 		case SNP_PAGE_STATE_PRIVATE:
 			rc = snp_make_page_private(vcpu, gpa, pfn, level);
 			break;
+		/* TODO: Add USMASH and PSMASH support */
 		default:
 			rc = -EINVAL;
 			break;
@@ -2445,6 +2447,53 @@ static int __snp_handle_page_state_change(struct kvm_vcpu *vcpu, int op, gpa_t g
 	return rc;
 }
 
+static unsigned long snp_handle_page_state_change(struct vcpu_svm *svm, struct ghcb *ghcb)
+{
+	struct snp_page_state_entry *entry;
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct snp_page_state_change *info;
+	unsigned long rc;
+	int level, op;
+	gpa_t gpa;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		return -ENXIO;
+
+	if (!setup_vmgexit_scratch(svm, true, sizeof(ghcb->save.sw_scratch))) {
+		pr_err("vmgexit: scratch area is not setup.\n");
+		return -EINVAL;
+	}
+
+	info = (struct snp_page_state_change *)svm->ghcb_sa;
+	entry = &info->entry[info->header.cur_entry];
+
+	if ((info->header.cur_entry >= SNP_PAGE_STATE_CHANGE_MAX_ENTRY) ||
+	    (info->header.end_entry >= SNP_PAGE_STATE_CHANGE_MAX_ENTRY) ||
+	    (info->header.cur_entry > info->header.end_entry))
+		return VMGEXIT_PAGE_STATE_INVALID_HEADER;
+
+	while (info->header.cur_entry <= info->header.end_entry) {
+		entry = &info->entry[info->header.cur_entry];
+		gpa = gfn_to_gpa(entry->gfn);
+		level = RMP_X86_PG_LEVEL(entry->pagesize);
+		op = entry->operation;
+
+		if (!IS_ALIGNED(gpa, page_level_size(level))) {
+			rc = VMGEXIT_PAGE_STATE_INVALID_ENTRY;
+			goto out;
+		}
+
+		rc = __snp_handle_page_state_change(vcpu, op, gpa, level);
+		if (rc)
+			goto out;
+
+		info->header.cur_entry++;
+	}
+
+out:
+	return rc;
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -2667,6 +2716,15 @@ int sev_handle_vmgexit(struct vcpu_svm *svm)
 		ret = 1;
 		break;
 	}
+	case SVM_VMGEXIT_PAGE_STATE_CHANGE: {
+		unsigned long rc;
+
+		ret = 1;
+
+		rc = snp_handle_page_state_change(svm, ghcb);
+		ghcb_set_sw_exit_info_2(ghcb, rc);
+		break;
+	}
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(&svm->vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 31bc9cc12c44..9fcfceb4d71e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -600,6 +600,10 @@ void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 #define	GHCB_MSR_PAGE_STATE_CHANGE_RSVD_POS	12
 #define	GHCB_MSR_PAGE_STATE_CHANGE_RSVD_MASK	0xfffff
 
+#define VMGEXIT_PAGE_STATE_INVALID_HEADER	0x100000001
+#define VMGEXIT_PAGE_STATE_INVALID_ENTRY	0x100000002
+#define VMGEXIT_PAGE_STATE_FIRMWARE_ERROR(x)	((x & 0xffffffff) | 0x200000000)
+
 #define GHCB_MSR_TERM_REQ		0x100
 #define GHCB_MSR_TERM_REASON_SET_POS	12
 #define GHCB_MSR_TERM_REASON_SET_MASK	0xf
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 29/30] KVM: X86: export the kvm_zap_gfn_range() for the SNP use
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (27 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 28/30] KVM: SVM: add support to handle " Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  2021-03-24 17:04 ` [RFC Part2 PATCH 30/30] KVM: X86: Add support to handle the RMP nested page fault Brijesh Singh
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

While resolving the RMP page fault, we may run into cases where the page
level between the RMP entry and TDP does not match and the 2M RMP entry
must be split into 4K RMP entries. Or a 2M TDP page need to be broken
into multiple of 4K pages.

To keep the RMP and TDP page level in sync, we will zap the gfn range
after splitting the pages in the RMP entry. The zap should force the
TDP to gets rebuilt with the new page level.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/mmu.h              | 2 --
 arch/x86/kvm/mmu/mmu.c          | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 074605408970..5ea584606885 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1397,6 +1397,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+
 
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
 bool pdptrs_changed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e7c4e55215bf..5f7ebe4afd63 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -223,8 +223,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return -(u32)fault & errcode;
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 147f22bda6e7..1e057e046ca4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5569,6 +5569,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	spin_unlock(&kvm->mmu_lock);
 }
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
 				    struct kvm_rmap_head *rmap_head)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC Part2 PATCH 30/30] KVM: X86: Add support to handle the RMP nested page fault
  2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (28 preceding siblings ...)
  2021-03-24 17:04 ` [RFC Part2 PATCH 29/30] KVM: X86: export the kvm_zap_gfn_range() for the SNP use Brijesh Singh
@ 2021-03-24 17:04 ` Brijesh Singh
  29 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 17:04 UTC (permalink / raw)
  To: linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Brijesh Singh, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson

Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to
resolve the RMP violation encountered during the NPT table walk.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu/mmu.c          | 20 ++++++++++++
 arch/x86/kvm/svm/sev.c          | 57 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  2 ++
 5 files changed, 82 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ea584606885..79dec4f93808 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1311,6 +1311,8 @@ struct kvm_x86_ops {
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
+	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
+			int level, u64 error_code);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e057e046ca4..ec396169706f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5105,6 +5105,18 @@ int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page_virt);
 
+static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	kvm_pfn_t pfn;
+	int level;
+
+	if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level)))
+		return RET_PF_RETRY;
+
+	kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code);
+	return RET_PF_RETRY;
+}
+
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
@@ -5121,6 +5133,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 			goto emulate;
 	}
 
+	if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) {
+		r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code);
+		if (r == RET_PF_RETRY)
+			return 1;
+		else
+			return r;
+	}
+
 	if (r == RET_PF_INVALID) {
 		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
 					  lower_32_bits(error_code), false);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 35e7a7bbf878..dbb4f15de9ba 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2924,3 +2924,60 @@ int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
 
 	return min_t(uint32_t, level, max_level);
 }
+
+int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
+			      int level, u64 error_code)
+{
+	int rlevel, rc = 0;
+	rmpentry_t *e;
+	bool private;
+	gfn_t gfn;
+
+	e = lookup_page_in_rmptable(pfn_to_page(pfn), &rlevel);
+	if (!e)
+		return 1;
+
+	private = !!(error_code & PFERR_GUEST_ENC_MASK);
+
+	/*
+	 * See APM section 15.36.11 on how to handle the RMP fault for the large pages.
+	 *
+	 *  npt	     rmp    access      action
+	 *  --------------------------------------------------
+	 *  4k       2M     C=1       psmash
+	 *  x        x      C=1       if page is not private then add a new RMP entry
+	 *  x        x      C=0       if page is private then make it shared
+	 *  2M       4k     C=x       zap
+	 */
+	if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
+	    ((level == PG_LEVEL_4K) && (rlevel == PG_LEVEL_2M) && private)) {
+		rc = snp_rmptable_psmash(vcpu, pfn);
+		goto zap_gfn;
+	}
+
+	/*
+	 * If it's a private access, and the page is not assigned in the RMP table, create a
+	 * new private RMP entry.
+	 */
+	if (!rmpentry_assigned(e) && private) {
+		rc = snp_make_page_private(vcpu, gpa, pfn, PG_LEVEL_4K);
+		goto zap_gfn;
+	}
+
+	/*
+	 * If it's a shared access, then make the page shared in the RMP table.
+	 */
+	if (rmpentry_assigned(e) && !private)
+		rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K);
+
+zap_gfn:
+	/*
+	 * Now that we have updated the RMP pagesize, zap the existing rmaps for
+	 * large entry ranges so that nested page table gets rebuilt with the updated RMP
+	 * pagesize.
+	 */
+	gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+	kvm_zap_gfn_range(vcpu->kvm, gfn, gfn + 512);
+
+	return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 73259a3564eb..ab30c3e3956f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4564,6 +4564,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 	.get_tdp_max_page_level = sev_get_tdp_max_page_level,
+	.handle_rmp_page_fault = snp_handle_rmp_page_fault,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9fcfceb4d71e..eacae54de9b5 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -637,6 +637,8 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 void sev_snp_init_vmcb(struct vcpu_svm *svm);
 int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
+int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
+			      int level, u64 error_code);
 
 /* vmenter.S */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF
  2021-03-24 17:04 ` [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF Brijesh Singh
@ 2021-03-24 17:47   ` Andy Lutomirski
  2021-03-24 20:35     ` Brijesh Singh
  0 siblings, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2021-03-24 17:47 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: LKML, X86 ML, kvm list, Linux Crypto Mailing List, Andi Kleen,
	Herbert Xu, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Dave Hansen,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Wed, Mar 24, 2021 at 10:04 AM Brijesh Singh <brijesh.singh@amd.com> wrote:
>
> If hardware detects an RMP violation, it will raise a page-fault exception
> with the RMP bit set. To help the debug, dump the RMP entry of the faulting
> address.
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/mm/fault.c | 75 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 75 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f39b551f89a6..7605e06a6dd9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -31,6 +31,7 @@
>  #include <asm/pgtable_areas.h>         /* VMALLOC_START, ...           */
>  #include <asm/kvm_para.h>              /* kvm_handle_async_pf          */
>  #include <asm/vdso.h>                  /* fixup_vdso_exception()       */
> +#include <asm/sev-snp.h>               /* lookup_rmpentry ...          */
>
>  #define CREATE_TRACE_POINTS
>  #include <asm/trace/exceptions.h>
> @@ -147,6 +148,76 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
>  DEFINE_SPINLOCK(pgd_lock);
>  LIST_HEAD(pgd_list);
>
> +static void dump_rmpentry(struct page *page, rmpentry_t *e)
> +{
> +       unsigned long paddr = page_to_pfn(page) << PAGE_SHIFT;
> +
> +       pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx asid=%d "
> +               "vmsa=%d validated=%d]\n", paddr, rmpentry_assigned(e), rmpentry_immutable(e),
> +               rmpentry_pagesize(e), rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
> +               rmpentry_validated(e));
> +       pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", paddr, e->high, e->low);
> +}
> +
> +static void show_rmpentry(unsigned long address)
> +{
> +       struct page *page = virt_to_page(address);

This is an error path, and I don't think you have any particular
guarantee that virt_to_page(address) is valid.  Please add appropriate
validation or use one of the slow lookup helpers.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code
  2021-03-24 17:04 ` [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code Brijesh Singh
@ 2021-03-24 18:03   ` Dave Hansen
  2021-03-25 14:32     ` Brijesh Singh
  2021-04-20 10:32   ` Borislav Petkov
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2021-03-24 18:03 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
> index 10b1de500ab1..107f9d947e8d 100644
> --- a/arch/x86/include/asm/trap_pf.h
> +++ b/arch/x86/include/asm/trap_pf.h
> @@ -12,6 +12,7 @@
>   *   bit 4 ==				1: fault was an instruction fetch
>   *   bit 5 ==				1: protection keys block access
>   *   bit 15 ==				1: SGX MMU page-fault
> + *   bit 31 ==				1: fault was an RMP violation
>   */
>  enum x86_pf_error_code {
>  	X86_PF_PROT	=		1 << 0,
> @@ -21,6 +22,7 @@ enum x86_pf_error_code {
>  	X86_PF_INSTR	=		1 << 4,
>  	X86_PF_PK	=		1 << 5,
>  	X86_PF_SGX	=		1 << 15,
> +	X86_PF_RMP	=		1ull << 31,
>  };

Man, I hope AMD and Intel are talking to each other about these bits.  :)

Either way, this is hitting the limits of what I know about how enums
are implemented.  I had internalized that they are just an 'int', but
that doesn't seem quite right.  It sounds like they must be implemented
using *an* integer type, but not necessarily 'int' itself.

Either way, '1<<31' doesn't fit in a 32-bit signed int.  But, gcc at
least doesn't seem to blow the enum up into a 64-bit type, which is nice.

Could we at least start declaring these with BIT()?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF
  2021-03-24 17:47   ` Andy Lutomirski
@ 2021-03-24 20:35     ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-24 20:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: brijesh.singh, LKML, X86 ML, kvm list, Linux Crypto Mailing List,
	Andi Kleen, Herbert Xu, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 3/24/21 12:47 PM, Andy Lutomirski wrote:
> On Wed, Mar 24, 2021 at 10:04 AM Brijesh Singh <brijesh.singh@amd.com> wrote:
>> If hardware detects an RMP violation, it will raise a page-fault exception
>> with the RMP bit set. To help the debug, dump the RMP entry of the faulting
>> address.
>>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Joerg Roedel <jroedel@suse.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Tony Luck <tony.luck@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>> Cc: x86@kernel.org
>> Cc: kvm@vger.kernel.org
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/mm/fault.c | 75 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 75 insertions(+)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index f39b551f89a6..7605e06a6dd9 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -31,6 +31,7 @@
>>  #include <asm/pgtable_areas.h>         /* VMALLOC_START, ...           */
>>  #include <asm/kvm_para.h>              /* kvm_handle_async_pf          */
>>  #include <asm/vdso.h>                  /* fixup_vdso_exception()       */
>> +#include <asm/sev-snp.h>               /* lookup_rmpentry ...          */
>>
>>  #define CREATE_TRACE_POINTS
>>  #include <asm/trace/exceptions.h>
>> @@ -147,6 +148,76 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
>>  DEFINE_SPINLOCK(pgd_lock);
>>  LIST_HEAD(pgd_list);
>>
>> +static void dump_rmpentry(struct page *page, rmpentry_t *e)
>> +{
>> +       unsigned long paddr = page_to_pfn(page) << PAGE_SHIFT;
>> +
>> +       pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx asid=%d "
>> +               "vmsa=%d validated=%d]\n", paddr, rmpentry_assigned(e), rmpentry_immutable(e),
>> +               rmpentry_pagesize(e), rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
>> +               rmpentry_validated(e));
>> +       pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", paddr, e->high, e->low);
>> +}
>> +
>> +static void show_rmpentry(unsigned long address)
>> +{
>> +       struct page *page = virt_to_page(address);
> This is an error path, and I don't think you have any particular
> guarantee that virt_to_page(address) is valid.  Please add appropriate
> validation or use one of the slow lookup helpers.


Noted, thanks for the quick feedback.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-03-24 17:04 ` [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation Brijesh Singh
@ 2021-03-25 14:30   ` Dave Hansen
  2021-03-25 14:48   ` Dave Hansen
  1 sibling, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 14:30 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/24/21 10:04 AM, Brijesh Singh wrote:
> @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	if (hw_error_code & X86_PF_INSTR)
>  		flags |= FAULT_FLAG_INSTRUCTION;
>  
> +	/*
> +	 * If its an RMP violation, see if we can resolve it.
> +	 */
> +	if ((hw_error_code & X86_PF_RMP)) {
> +		ret = handle_rmp_page_fault(hw_error_code, address);
> +		if (ret == RMP_FAULT_PAGE_SPLIT) {
> +			flags |= FAULT_FLAG_PAGE_SPLIT;
> +		} else if (ret == RMP_FAULT_KILL) {
> +			fault |= VM_FAULT_SIGBUS;
> +			mm_fault_error(regs, hw_error_code, address, fault);
> +			return;
> +		} else {
> +			return;
> +		}
> +	}

Won't khugepaged come right back around and coalesce this page again?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code
  2021-03-24 18:03   ` Dave Hansen
@ 2021-03-25 14:32     ` Brijesh Singh
  2021-03-25 14:34       ` Dave Hansen
  0 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-03-25 14:32 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, kvm, linux-crypto
  Cc: brijesh.singh, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 3/24/21 1:03 PM, Dave Hansen wrote:
>> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
>> index 10b1de500ab1..107f9d947e8d 100644
>> --- a/arch/x86/include/asm/trap_pf.h
>> +++ b/arch/x86/include/asm/trap_pf.h
>> @@ -12,6 +12,7 @@
>>   *   bit 4 ==				1: fault was an instruction fetch
>>   *   bit 5 ==				1: protection keys block access
>>   *   bit 15 ==				1: SGX MMU page-fault
>> + *   bit 31 ==				1: fault was an RMP violation
>>   */
>>  enum x86_pf_error_code {
>>  	X86_PF_PROT	=		1 << 0,
>> @@ -21,6 +22,7 @@ enum x86_pf_error_code {
>>  	X86_PF_INSTR	=		1 << 4,
>>  	X86_PF_PK	=		1 << 5,
>>  	X86_PF_SGX	=		1 << 15,
>> +	X86_PF_RMP	=		1ull << 31,
>>  };
> Man, I hope AMD and Intel are talking to each other about these bits.  :)
>
> Either way, this is hitting the limits of what I know about how enums
> are implemented.  I had internalized that they are just an 'int', but
> that doesn't seem quite right.  It sounds like they must be implemented
> using *an* integer type, but not necessarily 'int' itself.
>
> Either way, '1<<31' doesn't fit in a 32-bit signed int.  But, gcc at
> least doesn't seem to blow the enum up into a 64-bit type, which is nice.
>
> Could we at least start declaring these with BIT()?


Sure, I can bit the BIT() macro to define the bits. Do you want me to
update all of the fault codes to use BIT() or just the one I am adding
in this patch ?



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code
  2021-03-25 14:32     ` Brijesh Singh
@ 2021-03-25 14:34       ` Dave Hansen
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 14:34 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/25/21 7:32 AM, Brijesh Singh wrote:
>>>  enum x86_pf_error_code {
>>>  	X86_PF_PROT	=		1 << 0,
>>> @@ -21,6 +22,7 @@ enum x86_pf_error_code {
>>>  	X86_PF_INSTR	=		1 << 4,
>>>  	X86_PF_PK	=		1 << 5,
>>>  	X86_PF_SGX	=		1 << 15,
>>> +	X86_PF_RMP	=		1ull << 31,
>>>  };
...
>> Could we at least start declaring these with BIT()?
> 
> Sure, I can bit the BIT() macro to define the bits. Do you want me to
> update all of the fault codes to use BIT() or just the one I am adding
> in this patch ?

Please update all of them for consistency.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-03-24 17:04 ` [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation Brijesh Singh
  2021-03-25 14:30   ` Dave Hansen
@ 2021-03-25 14:48   ` Dave Hansen
  2021-03-25 15:24     ` Brijesh Singh
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 14:48 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/24/21 10:04 AM, Brijesh Singh wrote:
> When SEV-SNP is enabled globally in the system, a write from the hypervisor
> can raise an RMP violation. We can resolve the RMP violation by splitting
> the virtual address to a lower page level.
> 
> e.g
> - guest made a page shared in the RMP entry so that the hypervisor
>   can write to it.
> - the hypervisor has mapped the pfn as a large page. A write access
>   will cause an RMP violation if one of the pages within the 2MB region
>   is a guest private page.
> 
> The above RMP violation can be resolved by simply splitting the large
> page.

What if the large page is provided by hugetlbfs?

What if the kernel uses the direct map to access the page instead of the
userspace mapping?

> The architecture specific code will read the RMP entry to determine
> if the fault can be resolved by splitting and propagating the request
> to split the page by setting newly introduced fault flag
> (FAULT_FLAG_PAGE_SPLIT). If the fault cannot be resolved by splitting,
> then a SIGBUS signal is sent to terminate the process.

Are users just supposed to know what memory types are compatible with
SEV-SNP?  Basically, don't use anything that might map a guest using
non-4k entries, except THP?

This does seem like a rather nasty aspect of the hardware.  For
everything else, if the virtualization page tables and the x86 tables
disagree, the TLB just sees the smallest page size.

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 7605e06a6dd9..f6571563f433 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1305,6 +1305,70 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>  }
>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>  
> +#define RMP_FAULT_RETRY		0
> +#define RMP_FAULT_KILL		1
> +#define RMP_FAULT_PAGE_SPLIT	2
> +
> +static inline size_t pages_per_hpage(int level)
> +{
> +	return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +/*
> + * The RMP fault can happen when a hypervisor attempts to write to:
> + * 1. a guest owned page or
> + * 2. any pages in the large page is a guest owned page.
> + *
> + * #1 will happen only when a process or VMM is attempting to modify the guest page
> + * without the guests cooperation. If a guest wants a VMM to be able to write to its memory
> + * then it should make the page shared. If we detect #1, kill the process because we can not
> + * resolve the fault.
> + *
> + * #2 can happen when the page level does not match between the RMP entry and x86
> + * page table walk, e.g the page is mapped as a large page in the x86 page table but its
> + * added as a 4K shared page in the RMP entry. This can be resolved by splitting the address
> + * into a smaller page level.
> + */

These comments need to get wrapped a bit sooner.  Could you try to match
some of the others in the file?

> +static int handle_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
> +{
> +	unsigned long pfn, mask;
> +	int rmp_level, level;
> +	rmpentry_t *e;
> +	pte_t *pte;
> +
> +	/* Get the native page level */
> +	pte = lookup_address_in_mm(current->mm, address, &level);
> +	if (unlikely(!pte))
> +		return RMP_FAULT_KILL;
> +
> +	pfn = pte_pfn(*pte);
> +	if (level > PG_LEVEL_4K) {
> +		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> +		pfn |= (address >> PAGE_SHIFT) & mask;
> +	}

What is this trying to do, exactly?

> +	/* Get the page level from the RMP entry. */
> +	e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
> +	if (!e) {
> +		pr_alert("SEV-SNP: failed to lookup RMP entry for address 0x%lx pfn 0x%lx\n",
> +			 address, pfn);
> +		return RMP_FAULT_KILL;
> +	}
> +
> +	/* Its a guest owned page */
> +	if (rmpentry_assigned(e))
> +		return RMP_FAULT_KILL;
> +
> +	/*
> +	 * Its a shared page but the page level does not match between the native walk
> +	 * and RMP entry.
> +	 */

For these two-line comments, please try to split the text fairly evenly
between the lines.

> +	if (level > rmp_level)
> +		return RMP_FAULT_PAGE_SPLIT;
> +
> +	return RMP_FAULT_RETRY;
> +}
> +
>  /* Handle faults in the user portion of the address space */
>  static inline
>  void do_user_addr_fault(struct pt_regs *regs,
> @@ -1315,6 +1379,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	struct task_struct *tsk;
>  	struct mm_struct *mm;
>  	vm_fault_t fault;
> +	int ret;
>  	unsigned int flags = FAULT_FLAG_DEFAULT;
>  
>  	tsk = current;
> @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	if (hw_error_code & X86_PF_INSTR)
>  		flags |= FAULT_FLAG_INSTRUCTION;
>  
> +	/*
> +	 * If its an RMP violation, see if we can resolve it.
> +	 */
> +	if ((hw_error_code & X86_PF_RMP)) {
> +		ret = handle_rmp_page_fault(hw_error_code, address);
> +		if (ret == RMP_FAULT_PAGE_SPLIT) {
> +			flags |= FAULT_FLAG_PAGE_SPLIT;
> +		} else if (ret == RMP_FAULT_KILL) {
> +			fault |= VM_FAULT_SIGBUS;
> +			mm_fault_error(regs, hw_error_code, address, fault);
> +			return;
> +		} else {
> +			return;
> +		}
> +	}
> +
>  #ifdef CONFIG_X86_64
>  	/*
>  	 * Faults in the vsyscall page might need emulation.  The
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ecdf8a8cd6ae..1be3218f3738 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -434,6 +434,8 @@ extern pgprot_t protection_map[16];
>   * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>   * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>   * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the region to smaller
> + *   page size and retry.
>   *
>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>   * whether we would allow page faults to retry by specifying these two
> @@ -464,6 +466,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_REMOTE			0x80
>  #define FAULT_FLAG_INSTRUCTION  		0x100
>  #define FAULT_FLAG_INTERRUPTIBLE		0x200
> +#define FAULT_FLAG_PAGE_SPLIT			0x400
>  
>  /*
>   * The default fault flags that should be used by most of the
> @@ -501,7 +504,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
>  	{ FAULT_FLAG_USER,		"USER" }, \
>  	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
>  	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
> -	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
> +	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
> +	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
>  
>  /*
>   * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/mm/memory.c b/mm/memory.c
> index feff48e1465a..c9dcf9b30719 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4427,6 +4427,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> +	return 0;
> +}

Wait a sec, I thought this could fail.  Where's the "failed to split"
path?  Why does this even return an error code if it's always 0?

>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> @@ -4448,6 +4454,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  	pgd_t *pgd;
>  	p4d_t *p4d;
>  	vm_fault_t ret;
> +	int split_page = flags & FAULT_FLAG_PAGE_SPLIT;
>  
>  	pgd = pgd_offset(mm, address);
>  	p4d = p4d_alloc(mm, pgd, address);
> @@ -4504,6 +4511,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  				pmd_migration_entry_wait(mm, vmf.pmd);
>  			return 0;
>  		}
> +
> +		if (split_page)
> +			return handle_split_page_fault(&vmf);
> +
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>  				return do_huge_pmd_numa_page(&vmf, orig_pmd);

Is there a reason for the 'split_page' variable?  It seems like a waste
of space.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-03-24 17:04 ` [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support Brijesh Singh
@ 2021-03-25 14:58   ` Dave Hansen
  2021-03-25 15:31     ` Brijesh Singh
  2021-04-14  7:27   ` Borislav Petkov
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 14:58 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

> +static int __init mem_encrypt_snp_init(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> +		return 1;
> +
> +	if (rmptable_init()) {
> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> +		return 1;
> +	}
> +
> +	static_branch_enable(&snp_enable_key);
> +
> +	return 0;
> +}

Could you explain a bit why 'snp_enable_key' is needed in addition to
X86_FEATURE_SEV_SNP?

For a lot of features, we just use cpu_feature_enabled(), which does
both compile-time and static_cpu_has().  This whole series seems to lack
compile-time disables for the code that it adds, like the code it adds
to arch/x86/mm/fault.c or even mm/memory.c.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-03-24 17:04 ` [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table Brijesh Singh
@ 2021-03-25 15:17   ` Dave Hansen
  2021-04-19 12:32   ` Borislav Petkov
  1 sibling, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 15:17 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/24/21 10:04 AM, Brijesh Singh wrote:
> The spliting of the physmap is a temporary solution until we work to
> improve the kernel page fault handler to split the pages on demand.
> One of the disadvtange of splitting is that eventually, we will end up
> breaking down the entire physmap unless we combine the split pages back to
> a large page. I am open to the suggestation on various approaches we could
> take to address this problem.

Other than suggesting that the hardware be fixed to do the fracturing
itself?  :)

I suspect that this code is trying to be *too* generic.  I would expect
that very little of guest memory is actually shared with the host.  It's
also not going to be random guest pages.  The guest and the host have to
agree on these things, and I *think* the host is free to move the
physical backing around once it's shared.

So, let's say that there a guest->host paravirt interface where the
guest says in advance, "I want to share this page."  The host can split
at *that* point and *only* split that one page's mapping.  Any page
faults would occur only if the host screws up, and would result in an oops.

That also gives a point where the host can say, "nope, that hugetlbfs, I
can't split it".


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-03-25 14:48   ` Dave Hansen
@ 2021-03-25 15:24     ` Brijesh Singh
  2021-03-25 15:59       ` Dave Hansen
  0 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-03-25 15:24 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, kvm, linux-crypto
  Cc: brijesh.singh, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 3/25/21 9:48 AM, Dave Hansen wrote:
> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>> can raise an RMP violation. We can resolve the RMP violation by splitting
>> the virtual address to a lower page level.
>>
>> e.g
>> - guest made a page shared in the RMP entry so that the hypervisor
>>   can write to it.
>> - the hypervisor has mapped the pfn as a large page. A write access
>>   will cause an RMP violation if one of the pages within the 2MB region
>>   is a guest private page.
>>
>> The above RMP violation can be resolved by simply splitting the large
>> page.
> What if the large page is provided by hugetlbfs?

I was not able to find a method to split the large pages in the
hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
transparent hugepage or small pages.


>
> What if the kernel uses the direct map to access the page instead of the
> userspace mapping?


See the Patch 04/30. Currently, we split the kernel direct maps to 4K
before adding the page in the RMP table to avoid the need to split the
pages due to the RMP fault.


>
>> The architecture specific code will read the RMP entry to determine
>> if the fault can be resolved by splitting and propagating the request
>> to split the page by setting newly introduced fault flag
>> (FAULT_FLAG_PAGE_SPLIT). If the fault cannot be resolved by splitting,
>> then a SIGBUS signal is sent to terminate the process.
> Are users just supposed to know what memory types are compatible with
> SEV-SNP?  Basically, don't use anything that might map a guest using
> non-4k entries, except THP?


Currently, VMM will need to know the compatible memory type and use it
for allocating the backing pages.

>
> This does seem like a rather nasty aspect of the hardware.  For
> everything else, if the virtualization page tables and the x86 tables
> disagree, the TLB just sees the smallest page size.
>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 7605e06a6dd9..f6571563f433 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1305,6 +1305,70 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>>  }
>>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>>  
>> +#define RMP_FAULT_RETRY		0
>> +#define RMP_FAULT_KILL		1
>> +#define RMP_FAULT_PAGE_SPLIT	2
>> +
>> +static inline size_t pages_per_hpage(int level)
>> +{
>> +	return page_level_size(level) / PAGE_SIZE;
>> +}
>> +
>> +/*
>> + * The RMP fault can happen when a hypervisor attempts to write to:
>> + * 1. a guest owned page or
>> + * 2. any pages in the large page is a guest owned page.
>> + *
>> + * #1 will happen only when a process or VMM is attempting to modify the guest page
>> + * without the guests cooperation. If a guest wants a VMM to be able to write to its memory
>> + * then it should make the page shared. If we detect #1, kill the process because we can not
>> + * resolve the fault.
>> + *
>> + * #2 can happen when the page level does not match between the RMP entry and x86
>> + * page table walk, e.g the page is mapped as a large page in the x86 page table but its
>> + * added as a 4K shared page in the RMP entry. This can be resolved by splitting the address
>> + * into a smaller page level.
>> + */
> These comments need to get wrapped a bit sooner.  Could you try to match
> some of the others in the file?


Noted.


>
>> +static int handle_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
>> +{
>> +	unsigned long pfn, mask;
>> +	int rmp_level, level;
>> +	rmpentry_t *e;
>> +	pte_t *pte;
>> +
>> +	/* Get the native page level */
>> +	pte = lookup_address_in_mm(current->mm, address, &level);
>> +	if (unlikely(!pte))
>> +		return RMP_FAULT_KILL;
>> +
>> +	pfn = pte_pfn(*pte);
>> +	if (level > PG_LEVEL_4K) {
>> +		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
>> +		pfn |= (address >> PAGE_SHIFT) & mask;
>> +	}
> What is this trying to do, exactly?


Trying to calculate the pfn within the large entry.

The lookup above will return a base pfn of a large page. Need to find
index within the large page to calculate the PFN.

>
>> +	/* Get the page level from the RMP entry. */
>> +	e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
>> +	if (!e) {
>> +		pr_alert("SEV-SNP: failed to lookup RMP entry for address 0x%lx pfn 0x%lx\n",
>> +			 address, pfn);
>> +		return RMP_FAULT_KILL;
>> +	}
>> +
>> +	/* Its a guest owned page */
>> +	if (rmpentry_assigned(e))
>> +		return RMP_FAULT_KILL;
>> +
>> +	/*
>> +	 * Its a shared page but the page level does not match between the native walk
>> +	 * and RMP entry.
>> +	 */
> For these two-line comments, please try to split the text fairly evenly
> between the lines.


Noted.

>
>> +	if (level > rmp_level)
>> +		return RMP_FAULT_PAGE_SPLIT;
>> +
>> +	return RMP_FAULT_RETRY;
>> +}
>> +
>>  /* Handle faults in the user portion of the address space */
>>  static inline
>>  void do_user_addr_fault(struct pt_regs *regs,
>> @@ -1315,6 +1379,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>>  	struct task_struct *tsk;
>>  	struct mm_struct *mm;
>>  	vm_fault_t fault;
>> +	int ret;
>>  	unsigned int flags = FAULT_FLAG_DEFAULT;
>>  
>>  	tsk = current;
>> @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
>>  	if (hw_error_code & X86_PF_INSTR)
>>  		flags |= FAULT_FLAG_INSTRUCTION;
>>  
>> +	/*
>> +	 * If its an RMP violation, see if we can resolve it.
>> +	 */
>> +	if ((hw_error_code & X86_PF_RMP)) {
>> +		ret = handle_rmp_page_fault(hw_error_code, address);
>> +		if (ret == RMP_FAULT_PAGE_SPLIT) {
>> +			flags |= FAULT_FLAG_PAGE_SPLIT;
>> +		} else if (ret == RMP_FAULT_KILL) {
>> +			fault |= VM_FAULT_SIGBUS;
>> +			mm_fault_error(regs, hw_error_code, address, fault);
>> +			return;
>> +		} else {
>> +			return;
>> +		}
>> +	}
>> +
>>  #ifdef CONFIG_X86_64
>>  	/*
>>  	 * Faults in the vsyscall page might need emulation.  The
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ecdf8a8cd6ae..1be3218f3738 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -434,6 +434,8 @@ extern pgprot_t protection_map[16];
>>   * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>>   * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>>   * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
>> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the region to smaller
>> + *   page size and retry.
>>   *
>>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>>   * whether we would allow page faults to retry by specifying these two
>> @@ -464,6 +466,7 @@ extern pgprot_t protection_map[16];
>>  #define FAULT_FLAG_REMOTE			0x80
>>  #define FAULT_FLAG_INSTRUCTION  		0x100
>>  #define FAULT_FLAG_INTERRUPTIBLE		0x200
>> +#define FAULT_FLAG_PAGE_SPLIT			0x400
>>  
>>  /*
>>   * The default fault flags that should be used by most of the
>> @@ -501,7 +504,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
>>  	{ FAULT_FLAG_USER,		"USER" }, \
>>  	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
>>  	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
>> -	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
>> +	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
>> +	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
>>  
>>  /*
>>   * vm_fault is filled by the pagefault handler and passed to the vma's
>> diff --git a/mm/memory.c b/mm/memory.c
>> index feff48e1465a..c9dcf9b30719 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4427,6 +4427,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>>  	return 0;
>>  }
>>  
>> +static int handle_split_page_fault(struct vm_fault *vmf)
>> +{
>> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>> +	return 0;
>> +}
> Wait a sec, I thought this could fail.  Where's the "failed to split"
> path?  Why does this even return an error code if it's always 0?


Ah, I missed it. Let me address this comment by propagating the error
code to the caller so that it can take the corrective action. In this
particular case, if we fail to split then we will not be able to resolve
the fault and will need to send SIGBUG to abort the process.


>
>>  /*
>>   * By the time we get here, we already hold the mm semaphore
>>   *
>> @@ -4448,6 +4454,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>>  	pgd_t *pgd;
>>  	p4d_t *p4d;
>>  	vm_fault_t ret;
>> +	int split_page = flags & FAULT_FLAG_PAGE_SPLIT;
>>  
>>  	pgd = pgd_offset(mm, address);
>>  	p4d = p4d_alloc(mm, pgd, address);
>> @@ -4504,6 +4511,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>>  				pmd_migration_entry_wait(mm, vmf.pmd);
>>  			return 0;
>>  		}
>> +
>> +		if (split_page)
>> +			return handle_split_page_fault(&vmf);
>> +
>>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>>  				return do_huge_pmd_numa_page(&vmf, orig_pmd);
> Is there a reason for the 'split_page' variable?  It seems like a waste
> of space.


I can remove it to save some space.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-03-25 14:58   ` Dave Hansen
@ 2021-03-25 15:31     ` Brijesh Singh
  2021-03-25 15:51       ` Dave Hansen
  0 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-03-25 15:31 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, kvm, linux-crypto
  Cc: brijesh.singh, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 3/25/21 9:58 AM, Dave Hansen wrote:
>> +static int __init mem_encrypt_snp_init(void)
>> +{
>> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>> +		return 1;
>> +
>> +	if (rmptable_init()) {
>> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>> +		return 1;
>> +	}
>> +
>> +	static_branch_enable(&snp_enable_key);
>> +
>> +	return 0;
>> +}
> Could you explain a bit why 'snp_enable_key' is needed in addition to
> X86_FEATURE_SEV_SNP?


The X86_FEATURE_SEV_SNP indicates that hardware supports the feature --
this does not necessary means that SEV-SNP is enabled in the host. The
snp_enabled_key() helper is later used by kernel and drivers to check
whether SEV-SNP is enabled. e.g. when a driver calls the RMPUPDATE
instruction, the rmpupdate helper routine checks whether the SNP is
enabled. If SEV-SNP is not enabled then instruction will cause a #UD.

>
> For a lot of features, we just use cpu_feature_enabled(), which does
> both compile-time and static_cpu_has().  This whole series seems to lack
> compile-time disables for the code that it adds, like the code it adds
> to arch/x86/mm/fault.c or even mm/memory.c.


Noted, I will add the #ifdef  to make sure that its compiled out when
the config does not have the AMD_MEM_ENCRYPTION enabled.


>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-03-25 15:31     ` Brijesh Singh
@ 2021-03-25 15:51       ` Dave Hansen
  2021-03-25 17:41         ` Brijesh Singh
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 15:51 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/25/21 8:31 AM, Brijesh Singh wrote:
> 
> On 3/25/21 9:58 AM, Dave Hansen wrote:
>>> +static int __init mem_encrypt_snp_init(void)
>>> +{
>>> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>>> +		return 1;
>>> +
>>> +	if (rmptable_init()) {
>>> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>> +		return 1;
>>> +	}
>>> +
>>> +	static_branch_enable(&snp_enable_key);
>>> +
>>> +	return 0;
>>> +}
>> Could you explain a bit why 'snp_enable_key' is needed in addition to
>> X86_FEATURE_SEV_SNP?
> 
> 
> The X86_FEATURE_SEV_SNP indicates that hardware supports the feature --
> this does not necessary means that SEV-SNP is enabled in the host.

I think you're confusing the CPUID bit that initially populates
X86_FEATURE_SEV_SNP with the X86_FEATURE bit.  We clear X86_FEATURE bits
all the time for features that the kernel turns off, even while the
hardware supports it.

Look at what we do in init_ia32_feat_ctl() for SGX, for instance.  We
then go on to use X86_FEATURE_SGX at runtime to see if SGX was disabled,
even though the hardware supports it.

>> For a lot of features, we just use cpu_feature_enabled(), which does
>> both compile-time and static_cpu_has().  This whole series seems to lack
>> compile-time disables for the code that it adds, like the code it adds
>> to arch/x86/mm/fault.c or even mm/memory.c.
> 
> Noted, I will add the #ifdef  to make sure that its compiled out when
> the config does not have the AMD_MEM_ENCRYPTION enabled.

IS_ENABLED() tends to be nicer for these things.

Even better is if you coordinate these with your X86_FEATURE_SEV_SNP
checks.  Then, put X86_FEATURE_SEV_SNP in disabled-features.h, and you
can use cpu_feature_enabled(X86_FEATURE_SEV_SNP) as both a
(statically-patched) runtime *AND* compile-time check without an
explicit #ifdefs.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-03-25 15:24     ` Brijesh Singh
@ 2021-03-25 15:59       ` Dave Hansen
  2021-04-21 12:59         ` Vlastimil Babka
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2021-03-25 15:59 UTC (permalink / raw)
  To: Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/25/21 8:24 AM, Brijesh Singh wrote:
> On 3/25/21 9:48 AM, Dave Hansen wrote:
>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>> the virtual address to a lower page level.
>>>
>>> e.g
>>> - guest made a page shared in the RMP entry so that the hypervisor
>>>   can write to it.
>>> - the hypervisor has mapped the pfn as a large page. A write access
>>>   will cause an RMP violation if one of the pages within the 2MB region
>>>   is a guest private page.
>>>
>>> The above RMP violation can be resolved by simply splitting the large
>>> page.
>> What if the large page is provided by hugetlbfs?
> I was not able to find a method to split the large pages in the
> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
> transparent hugepage or small pages.

That's really, really nasty.  Especially since it might not be evident
until long after boot and the guest is killed.

It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
memory.  It's physically contiguous, so it would keep you from having to
fracture the direct map all the way down to 4k, it also can't be
reclaimed (just like all SEV memory).

I think the minimal thing you can do here is to fail to add memory to
the RMP in the first place if you can't split it.  That way, users will
at least fail to _start_ their VM versus dying randomly for no good reason.

Even better would be to come up with a stronger contract between host
and guest.  I really don't think the host should be exposed to random
RMP faults on the direct map.  If the guest wants to share memory, then
it needs to tell the host and give the host an opportunity to move the
guest physical memory.  It might, for instance, sequester all the shared
pages in a single spot to minimize direct map fragmentation.

I'll let the other x86 folks chime in on this, but I really think this
needs a different approach than what's being proposed.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-03-25 15:51       ` Dave Hansen
@ 2021-03-25 17:41         ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-03-25 17:41 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, kvm, linux-crypto
  Cc: brijesh.singh, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 3/25/21 10:51 AM, Dave Hansen wrote:
> On 3/25/21 8:31 AM, Brijesh Singh wrote:
>> On 3/25/21 9:58 AM, Dave Hansen wrote:
>>>> +static int __init mem_encrypt_snp_init(void)
>>>> +{
>>>> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>>>> +		return 1;
>>>> +
>>>> +	if (rmptable_init()) {
>>>> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>>> +		return 1;
>>>> +	}
>>>> +
>>>> +	static_branch_enable(&snp_enable_key);
>>>> +
>>>> +	return 0;
>>>> +}
>>> Could you explain a bit why 'snp_enable_key' is needed in addition to
>>> X86_FEATURE_SEV_SNP?
>>
>> The X86_FEATURE_SEV_SNP indicates that hardware supports the feature --
>> this does not necessary means that SEV-SNP is enabled in the host.
> I think you're confusing the CPUID bit that initially populates
> X86_FEATURE_SEV_SNP with the X86_FEATURE bit.  We clear X86_FEATURE bits
> all the time for features that the kernel turns off, even while the
> hardware supports it.


Ah, yes I was getting mixed up. I will see if we can remove the
snp_key_enabled and use the feature check.


> Look at what we do in init_ia32_feat_ctl() for SGX, for instance.  We
> then go on to use X86_FEATURE_SGX at runtime to see if SGX was disabled,
> even though the hardware supports it.
>
>>> For a lot of features, we just use cpu_feature_enabled(), which does
>>> both compile-time and static_cpu_has().  This whole series seems to lack
>>> compile-time disables for the code that it adds, like the code it adds
>>> to arch/x86/mm/fault.c or even mm/memory.c.
>> Noted, I will add the #ifdef  to make sure that its compiled out when
>> the config does not have the AMD_MEM_ENCRYPTION enabled.
> IS_ENABLED() tends to be nicer for these things.
>
> Even better is if you coordinate these with your X86_FEATURE_SEV_SNP
> checks.  Then, put X86_FEATURE_SEV_SNP in disabled-features.h, and you
> can use cpu_feature_enabled(X86_FEATURE_SEV_SNP) as both a
> (statically-patched) runtime *AND* compile-time check without an
> explicit #ifdefs.

I will try improve this in v2 and will try IS_ENABLED().



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-03-24 17:04 ` [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support Brijesh Singh
  2021-03-25 14:58   ` Dave Hansen
@ 2021-04-14  7:27   ` Borislav Petkov
  2021-04-14 22:48     ` Brijesh Singh
  1 sibling, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-14  7:27 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Wed, Mar 24, 2021 at 12:04:07PM -0500, Brijesh Singh wrote:
> @@ -538,6 +540,10 @@
>  #define MSR_K8_SYSCFG			0xc0010010
>  #define MSR_K8_SYSCFG_MEM_ENCRYPT_BIT	23
>  #define MSR_K8_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_K8_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_K8_SYSCFG_SNP_EN_BIT	24
> +#define MSR_K8_SYSCFG_SNP_EN		BIT_ULL(MSR_K8_SYSCFG_SNP_EN_BIT)
> +#define MSR_K8_SYSCFG_SNP_VMPL_EN_BIT	25
> +#define MSR_K8_SYSCFG_SNP_VMPL_EN	BIT_ULL(MSR_K8_SYSCFG_SNP_VMPL_EN_BIT)
>  #define MSR_K8_INT_PENDING_MSG		0xc0010055
>  /* C1E active bits in int pending message */
>  #define K8_INTP_C1E_ACTIVE_MASK		0x18000000

Ok, I believe it is finally time to make this MSR architectural and drop
this silliness with "K8" in the name. If you wanna send me a prepatch which
converts all like this:

MSR_K8_SYSCFG -> MSR_AMD64_SYSCFG

I'll gladly take it. If you prefer me to do it, I'll gladly do it.

> @@ -44,12 +45,16 @@ u64 sev_check_data __section(".data") = 0;
>  EXPORT_SYMBOL(sme_me_mask);
>  DEFINE_STATIC_KEY_FALSE(sev_enable_key);
>  EXPORT_SYMBOL_GPL(sev_enable_key);
> +DEFINE_STATIC_KEY_FALSE(snp_enable_key);
> +EXPORT_SYMBOL_GPL(snp_enable_key);
>  
>  bool sev_enabled __section(".data");
>  
>  /* Buffer used for early in-place encryption by BSP, no locking needed */
>  static char sme_early_buffer[PAGE_SIZE] __initdata __aligned(PAGE_SIZE);
>  
> +static unsigned long rmptable_start, rmptable_end;

__ro_after_init I guess.

> +
>  /*
>   * When SNP is active, this routine changes the page state from private to shared before
>   * copying the data from the source to destination and restore after the copy. This is required
> @@ -528,3 +533,82 @@ void __init mem_encrypt_init(void)
>  	print_mem_encrypt_feature_info();
>  }
>  
> +static __init void snp_enable(void *arg)
> +{
> +	u64 val;
> +
> +	rdmsrl_safe(MSR_K8_SYSCFG, &val);

Why is this one _safe but the wrmsr isn't? Also, _safe returns a value -
check it pls and return early.

> +
> +	val |= MSR_K8_SYSCFG_SNP_EN;
> +	val |= MSR_K8_SYSCFG_SNP_VMPL_EN;
> +
> +	wrmsrl(MSR_K8_SYSCFG, val);
> +}
> +
> +static __init int rmptable_init(void)
> +{
> +	u64 rmp_base, rmp_end;
> +	unsigned long sz;
> +	void *start;
> +	u64 val;
> +
> +	rdmsrl_safe(MSR_AMD64_RMP_BASE, &rmp_base);
> +	rdmsrl_safe(MSR_AMD64_RMP_END, &rmp_end);

Ditto, why _safe if you're checking CPUID?

> +
> +	if (!rmp_base || !rmp_end) {
> +		pr_info("SEV-SNP: Memory for the RMP table has not been reserved by BIOS\n");
> +		return 1;
> +	}
> +
> +	sz = rmp_end - rmp_base + 1;
> +
> +	start = memremap(rmp_base, sz, MEMREMAP_WB);
> +	if (!start) {
> +		pr_err("SEV-SNP: Failed to map RMP table 0x%llx-0x%llx\n", rmp_base, rmp_end);
			^^^^^^^

That prefix is done by doing

#undef pr_fmt
#define pr_fmt(fmt)     "SEV-SNP: " fmt

before the SNP-specific functions.

> +		return 1;
> +	}
> +
> +	/*
> +	 * Check if SEV-SNP is already enabled, this can happen if we are coming from kexec boot.
> +	 * Do not initialize the RMP table when SEV-SNP is already.
> +	 */

comment can be 80 cols wide.

> +	rdmsrl_safe(MSR_K8_SYSCFG, &val);

As above.

> +	if (val & MSR_K8_SYSCFG_SNP_EN)
> +		goto skip_enable;
> +
> +	/* Initialize the RMP table to zero */
> +	memset(start, 0, sz);
> +
> +	/* Flush the caches to ensure that data is written before we enable the SNP */
> +	wbinvd_on_all_cpus();
> +
> +	/* Enable the SNP feature */
> +	on_each_cpu(snp_enable, NULL, 1);

What happens if you boot only a subset of the CPUs and then others get
hotplugged later? IOW, you need a CPU hotplug notifier which enables the
feature bit on newly arrived CPUs.

Which makes me wonder whether it makes sense to have this in an initcall
and not put it instead in init_amd(): the BSP will do the init work
and the APs coming in will see that it has been enabled and only call
snp_enable().

Which solves the hotplug thing automagically.

> +
> +skip_enable:
> +	rmptable_start = (unsigned long)start;
> +	rmptable_end = rmptable_start + sz;
> +
> +	pr_info("SEV-SNP: RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);

			  "RMP table at ..."

also, why is this issued in skip_enable? You want to issue it only once,
on enable.

also, rmp_base and rmp_end look redundant - you can simply use
rmptable_start and rmptable_end.

Which reminds me - that function needs to check as the very first thing
on entry whether SNP is enabled and exit if so - there's no need to read
MSR_AMD64_RMP_BASE and MSR_AMD64_RMP_END unnecessarily.

> +
> +	return 0;
> +}
> +
> +static int __init mem_encrypt_snp_init(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> +		return 1;
> +
> +	if (rmptable_init()) {
> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> +		return 1;
> +	}
> +
> +	static_branch_enable(&snp_enable_key);
> +
> +	return 0;
> +}
> +/*
> + * SEV-SNP must be enabled across all CPUs, so make the initialization as a late initcall.

Is there any particular reason for this to be a late initcall?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support
  2021-04-14  7:27   ` Borislav Petkov
@ 2021-04-14 22:48     ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-14 22:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 4/14/21 2:27 AM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:07PM -0500, Brijesh Singh wrote:
>> @@ -538,6 +540,10 @@
>>  #define MSR_K8_SYSCFG			0xc0010010
>>  #define MSR_K8_SYSCFG_MEM_ENCRYPT_BIT	23
>>  #define MSR_K8_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_K8_SYSCFG_MEM_ENCRYPT_BIT)
>> +#define MSR_K8_SYSCFG_SNP_EN_BIT	24
>> +#define MSR_K8_SYSCFG_SNP_EN		BIT_ULL(MSR_K8_SYSCFG_SNP_EN_BIT)
>> +#define MSR_K8_SYSCFG_SNP_VMPL_EN_BIT	25
>> +#define MSR_K8_SYSCFG_SNP_VMPL_EN	BIT_ULL(MSR_K8_SYSCFG_SNP_VMPL_EN_BIT)
>>  #define MSR_K8_INT_PENDING_MSG		0xc0010055
>>  /* C1E active bits in int pending message */
>>  #define K8_INTP_C1E_ACTIVE_MASK		0x18000000
> Ok, I believe it is finally time to make this MSR architectural and drop
> this silliness with "K8" in the name. If you wanna send me a prepatch which
> converts all like this:
>
> MSR_K8_SYSCFG -> MSR_AMD64_SYSCFG
>
> I'll gladly take it. If you prefer me to do it, I'll gladly do it.


I will send patch to address it.

>
>> @@ -44,12 +45,16 @@ u64 sev_check_data __section(".data") = 0;
>>  EXPORT_SYMBOL(sme_me_mask);
>>  DEFINE_STATIC_KEY_FALSE(sev_enable_key);
>>  EXPORT_SYMBOL_GPL(sev_enable_key);
>> +DEFINE_STATIC_KEY_FALSE(snp_enable_key);
>> +EXPORT_SYMBOL_GPL(snp_enable_key);
>>  
>>  bool sev_enabled __section(".data");
>>  
>>  /* Buffer used for early in-place encryption by BSP, no locking needed */
>>  static char sme_early_buffer[PAGE_SIZE] __initdata __aligned(PAGE_SIZE);
>>  
>> +static unsigned long rmptable_start, rmptable_end;
> __ro_after_init I guess.

Yes.

>
>> +
>>  /*
>>   * When SNP is active, this routine changes the page state from private to shared before
>>   * copying the data from the source to destination and restore after the copy. This is required
>> @@ -528,3 +533,82 @@ void __init mem_encrypt_init(void)
>>  	print_mem_encrypt_feature_info();
>>  }
>>  
>> +static __init void snp_enable(void *arg)
>> +{
>> +	u64 val;
>> +
>> +	rdmsrl_safe(MSR_K8_SYSCFG, &val);
> Why is this one _safe but the wrmsr isn't? Also, _safe returns a value -
> check it pls and return early.
No strong reason to use _safe. We reached here after all the CPUID
checks. I will drop the _safe.
>
>> +
>> +	val |= MSR_K8_SYSCFG_SNP_EN;
>> +	val |= MSR_K8_SYSCFG_SNP_VMPL_EN;
>> +
>> +	wrmsrl(MSR_K8_SYSCFG, val);
>> +}
>> +
>> +static __init int rmptable_init(void)
>> +{
>> +	u64 rmp_base, rmp_end;
>> +	unsigned long sz;
>> +	void *start;
>> +	u64 val;
>> +
>> +	rdmsrl_safe(MSR_AMD64_RMP_BASE, &rmp_base);
>> +	rdmsrl_safe(MSR_AMD64_RMP_END, &rmp_end);
> Ditto, why _safe if you're checking CPUID?
>
>> +
>> +	if (!rmp_base || !rmp_end) {
>> +		pr_info("SEV-SNP: Memory for the RMP table has not been reserved by BIOS\n");
>> +		return 1;
>> +	}
>> +
>> +	sz = rmp_end - rmp_base + 1;
>> +
>> +	start = memremap(rmp_base, sz, MEMREMAP_WB);
>> +	if (!start) {
>> +		pr_err("SEV-SNP: Failed to map RMP table 0x%llx-0x%llx\n", rmp_base, rmp_end);
> 			^^^^^^^
>
> That prefix is done by doing
>
> #undef pr_fmt
> #define pr_fmt(fmt)     "SEV-SNP: " fmt
>
> before the SNP-specific functions.
Sure, I will use it.
>
>> +		return 1;
>> +	}
>> +
>> +	/*
>> +	 * Check if SEV-SNP is already enabled, this can happen if we are coming from kexec boot.
>> +	 * Do not initialize the RMP table when SEV-SNP is already.
>> +	 */
> comment can be 80 cols wide.
Noted.
>
>> +	rdmsrl_safe(MSR_K8_SYSCFG, &val);
> As above.
>
>> +	if (val & MSR_K8_SYSCFG_SNP_EN)
>> +		goto skip_enable;
>> +
>> +	/* Initialize the RMP table to zero */
>> +	memset(start, 0, sz);
>> +
>> +	/* Flush the caches to ensure that data is written before we enable the SNP */
>> +	wbinvd_on_all_cpus();
>> +
>> +	/* Enable the SNP feature */
>> +	on_each_cpu(snp_enable, NULL, 1);
> What happens if you boot only a subset of the CPUs and then others get
> hotplugged later? IOW, you need a CPU hotplug notifier which enables the
> feature bit on newly arrived CPUs.
>
> Which makes me wonder whether it makes sense to have this in an initcall
> and not put it instead in init_amd(): the BSP will do the init work
> and the APs coming in will see that it has been enabled and only call
> snp_enable().
>
> Which solves the hotplug thing automagically.

This function need to be called after all the APs are up. But you have
raised a very good point regarding the hotplug CPU. Let me look into CPU
hotplug notifier and move the MSR initialization for AP inside the AP
bringup callbacks.

>
>> +
>> +skip_enable:
>> +	rmptable_start = (unsigned long)start;
>> +	rmptable_end = rmptable_start + sz;
>> +
>> +	pr_info("SEV-SNP: RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
> 			  "RMP table at ..."
>
> also, why is this issued in skip_enable? You want to issue it only once,
> on enable.
>
> also, rmp_base and rmp_end look redundant - you can simply use
> rmptable_start and rmptable_end.
>
> Which reminds me - that function needs to check as the very first thing
> on entry whether SNP is enabled and exit if so - there's no need to read
> MSR_AMD64_RMP_BASE and MSR_AMD64_RMP_END unnecessarily.

We need to store the virtual address of the RMP table range so that RMP
table can be accessed later. In next patch we add some helpers to read
the RMP table.


>> +
>> +	return 0;
>> +}
>> +
>> +static int __init mem_encrypt_snp_init(void)
>> +{
>> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>> +		return 1;
>> +
>> +	if (rmptable_init()) {
>> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>> +		return 1;
>> +	}
>> +
>> +	static_branch_enable(&snp_enable_key);
>> +
>> +	return 0;
>> +}
>> +/*
>> + * SEV-SNP must be enabled across all CPUs, so make the initialization as a late initcall.
> Is there any particular reason for this to be a late initcall?

I wanted this function to be called after APs are up but your hotplug
question is making me rethink this logic. I will work to improve it.


>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-03-24 17:04 ` [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers Brijesh Singh
@ 2021-04-15 16:57   ` Borislav Petkov
  2021-04-15 18:08     ` Brijesh Singh
  2021-04-15 17:03   ` Borislav Petkov
  1 sibling, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-15 16:57 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Wed, Mar 24, 2021 at 12:04:08PM -0500, Brijesh Singh wrote:
> The lookup_page_in_rmptable() can be used by the host to read the RMP
> entry for a given page. The RMP entry format is documented in PPR
> section 2.1.5.2.

I see

Table 15-36. Fields of an RMP Entry

in the APM.

Which PPR do you mean? Also, you know where to put those documents,
right?

> +/* RMP table entry format (PPR section 2.1.5.2) */
> +struct __packed rmpentry {
> +	union {
> +		struct {
> +			uint64_t assigned:1;
> +			uint64_t pagesize:1;
> +			uint64_t immutable:1;
> +			uint64_t rsvd1:9;
> +			uint64_t gpa:39;
> +			uint64_t asid:10;
> +			uint64_t vmsa:1;
> +			uint64_t validated:1;
> +			uint64_t rsvd2:1;
> +		} info;
> +		uint64_t low;
> +	};
> +	uint64_t high;
> +};
> +
> +typedef struct rmpentry rmpentry_t;

Eww, a typedef. Why?

struct rmpentry is just fine.

> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 39461b9cb34e..06394b6d56b2 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -34,6 +34,8 @@
>  
>  #include "mm_internal.h"
>  

<--- Needs a comment here to explain the magic 0x4000 and the magic
shift by 8.

> +#define rmptable_page_offset(x)	(0x4000 + (((unsigned long) x) >> 8))
> +
>  /*
>   * Since SME related variables are set early in the boot process they must
>   * reside in the .data section so as not to be zeroed out when the .bss
> @@ -612,3 +614,33 @@ static int __init mem_encrypt_snp_init(void)
>   * SEV-SNP must be enabled across all CPUs, so make the initialization as a late initcall.
>   */
>  late_initcall(mem_encrypt_snp_init);
> +
> +rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)

snp_lookup_page_in_rmptable()

> +{
> +	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
> +	rmpentry_t *entry, *large_entry;
> +	unsigned long vaddr;
> +
> +	if (!static_branch_unlikely(&snp_enable_key))
> +		return NULL;
> +
> +	vaddr = rmptable_start + rmptable_page_offset(phys);
> +	if (WARN_ON(vaddr > rmptable_end))

Do you really want to spew a warn on splat for each wrong vaddr? What
for?

> +		return NULL;
> +
> +	entry = (rmpentry_t *)vaddr;
> +
> +	/*
> +	 * Check if this page is covered by the large RMP entry. This is needed to get
> +	 * the page level used in the RMP entry.
> +	 *

No need for a new line in the comment and no need for the "e.g." thing
either.

Also, s/the large RMP entry/a large RMP entry/g.

> +	 * e.g. if the page is covered by the large RMP entry then page size is set in the
> +	 *       base RMP entry.
> +	 */
> +	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
> +	large_entry = (rmpentry_t *)vaddr;
> +	*level = rmpentry_pagesize(large_entry);
> +
> +	return entry;
> +}
> +EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);

Exported for kvm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-03-24 17:04 ` [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers Brijesh Singh
  2021-04-15 16:57   ` Borislav Petkov
@ 2021-04-15 17:03   ` Borislav Petkov
  2021-04-15 18:09     ` Brijesh Singh
  1 sibling, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-15 17:03 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Wed, Mar 24, 2021 at 12:04:08PM -0500, Brijesh Singh wrote:
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c

Also, why is all this SNP stuff landing in this file instead of in sev.c
or so which is AMD-specific?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction
  2021-03-24 17:04 ` [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
@ 2021-04-15 18:00   ` Borislav Petkov
  2021-04-15 18:15     ` Brijesh Singh
  0 siblings, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-15 18:00 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Wed, Mar 24, 2021 at 12:04:09PM -0500, Brijesh Singh wrote:
> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> index 06394b6d56b2..7a0138cb3e17 100644
> --- a/arch/x86/mm/mem_encrypt.c
> +++ b/arch/x86/mm/mem_encrypt.c
> @@ -644,3 +644,44 @@ rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)
>  	return entry;
>  }
>  EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);
> +
> +int rmptable_psmash(struct page *page)

psmash() should be enough like all those other wrappers around insns.

> +{
> +	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
> +	int ret;
> +
> +	if (!static_branch_unlikely(&snp_enable_key))
> +		return -ENXIO;
> +
> +	/* Retry if another processor is modifying the RMP entry. */

Also, a comment here should say which binutils version supports the
insn mnemonic so that it can be converted to "psmash" later. Ditto for
rmpupdate below.

Looking at the binutils repo, it looks like since version 2.36.

/me rebuilds objdump...

> +	do {
> +		asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> +			      : "=a"(ret)
> +			      : "a"(spa)
> +			      : "memory", "cc");
> +	} while (ret == PSMASH_FAIL_INUSE);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(rmptable_psmash);
> +
> +int rmptable_rmpupdate(struct page *page, struct rmpupdate *val)

rmpupdate()

> +{
> +	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
> +	bool flush = true;
> +	int ret;
> +
> +	if (!static_branch_unlikely(&snp_enable_key))
> +		return -ENXIO;
> +
> +	/* Retry if another processor is modifying the RMP entry. */
> +	do {
> +		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
> +			     : "=a"(ret)
> +			     : "a"(spa), "c"((unsigned long)val), "d"(flush)
					    ^^^^^^^^^^^^^^^

what's the cast for?

"d"(flush)?

There's nothing in the APM talking about RMPUPDATE taking an input arg
in %rdx?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-04-15 16:57   ` Borislav Petkov
@ 2021-04-15 18:08     ` Brijesh Singh
  2021-04-15 19:50       ` Borislav Petkov
  0 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-04-15 18:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

Hi Boris,


On 4/15/21 11:57 AM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:08PM -0500, Brijesh Singh wrote:
>> The lookup_page_in_rmptable() can be used by the host to read the RMP
>> entry for a given page. The RMP entry format is documented in PPR
>> section 2.1.5.2.
> I see
>
> Table 15-36. Fields of an RMP Entry
>
> in the APM.
>
> Which PPR do you mean? Also, you know where to put those documents,
> right?

This is from Family 19h Model 01h Rev B01. The processor which
introduces the SNP feature. Yes, I have already upload the PPR on the BZ.

The PPR is also available at AMD: https://www.amd.com/en/support/tech-docs


>> +/* RMP table entry format (PPR section 2.1.5.2) */
>> +struct __packed rmpentry {
>> +	union {
>> +		struct {
>> +			uint64_t assigned:1;
>> +			uint64_t pagesize:1;
>> +			uint64_t immutable:1;
>> +			uint64_t rsvd1:9;
>> +			uint64_t gpa:39;
>> +			uint64_t asid:10;
>> +			uint64_t vmsa:1;
>> +			uint64_t validated:1;
>> +			uint64_t rsvd2:1;
>> +		} info;
>> +		uint64_t low;
>> +	};
>> +	uint64_t high;
>> +};
>> +
>> +typedef struct rmpentry rmpentry_t;
> Eww, a typedef. Why?
>
> struct rmpentry is just fine.


I guess I was trying to shorten the name. I am good with struct rmpentry;


>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
>> index 39461b9cb34e..06394b6d56b2 100644
>> --- a/arch/x86/mm/mem_encrypt.c
>> +++ b/arch/x86/mm/mem_encrypt.c
>> @@ -34,6 +34,8 @@
>>  
>>  #include "mm_internal.h"
>>  
> <--- Needs a comment here to explain the magic 0x4000 and the magic
> shift by 8.


All those magic numbers are documented in the PPR. APM does not provide
the offset of the entry inside the RMP table. This is where we need to
refer the PPR.

>> +#define rmptable_page_offset(x)	(0x4000 + (((unsigned long) x) >> 8))
>> +
>>  /*
>>   * Since SME related variables are set early in the boot process they must
>>   * reside in the .data section so as not to be zeroed out when the .bss
>> @@ -612,3 +614,33 @@ static int __init mem_encrypt_snp_init(void)
>>   * SEV-SNP must be enabled across all CPUs, so make the initialization as a late initcall.
>>   */
>>  late_initcall(mem_encrypt_snp_init);
>> +
>> +rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)
> snp_lookup_page_in_rmptable()

Noted.


>> +{
>> +	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
>> +	rmpentry_t *entry, *large_entry;
>> +	unsigned long vaddr;
>> +
>> +	if (!static_branch_unlikely(&snp_enable_key))
>> +		return NULL;
>> +
>> +	vaddr = rmptable_start + rmptable_page_offset(phys);
>> +	if (WARN_ON(vaddr > rmptable_end))
> Do you really want to spew a warn on splat for each wrong vaddr? What
> for?
I guess I was using it during my development and there is no need for
it. I will remove it.
>
>> +		return NULL;
>> +
>> +	entry = (rmpentry_t *)vaddr;
>> +
>> +	/*
>> +	 * Check if this page is covered by the large RMP entry. This is needed to get
>> +	 * the page level used in the RMP entry.
>> +	 *
> No need for a new line in the comment and no need for the "e.g." thing
> either.
>
> Also, s/the large RMP entry/a large RMP entry/g.
Noted/
>
>> +	 * e.g. if the page is covered by the large RMP entry then page size is set in the
>> +	 *       base RMP entry.
>> +	 */
>> +	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
>> +	large_entry = (rmpentry_t *)vaddr;
>> +	*level = rmpentry_pagesize(large_entry);
>> +
>> +	return entry;
>> +}
>> +EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);
> Exported for kvm?

The current user for this are: KVM, CCP and page fault handler.

-Brijesh


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-04-15 17:03   ` Borislav Petkov
@ 2021-04-15 18:09     ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-15 18:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 4/15/21 12:03 PM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:08PM -0500, Brijesh Singh wrote:
>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
> Also, why is all this SNP stuff landing in this file instead of in sev.c
> or so which is AMD-specific?
>
I don't have any strong reason to keep in mem_encrypt.c. All these can
go in sev.c. I will move them in next version.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction
  2021-04-15 18:00   ` Borislav Petkov
@ 2021-04-15 18:15     ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-15 18:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 4/15/21 1:00 PM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:09PM -0500, Brijesh Singh wrote:
>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
>> index 06394b6d56b2..7a0138cb3e17 100644
>> --- a/arch/x86/mm/mem_encrypt.c
>> +++ b/arch/x86/mm/mem_encrypt.c
>> @@ -644,3 +644,44 @@ rmpentry_t *lookup_page_in_rmptable(struct page *page, int *level)
>>  	return entry;
>>  }
>>  EXPORT_SYMBOL_GPL(lookup_page_in_rmptable);
>> +
>> +int rmptable_psmash(struct page *page)
> psmash() should be enough like all those other wrappers around insns.

Noted.


>
>> +{
>> +	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
>> +	int ret;
>> +
>> +	if (!static_branch_unlikely(&snp_enable_key))
>> +		return -ENXIO;
>> +
>> +	/* Retry if another processor is modifying the RMP entry. */
> Also, a comment here should say which binutils version supports the
> insn mnemonic so that it can be converted to "psmash" later. Ditto for
> rmpupdate below.
>
> Looking at the binutils repo, it looks like since version 2.36.
>
> /me rebuilds objdump...

Sure, I will add comment.

>> +	do {
>> +		asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
>> +			      : "=a"(ret)
>> +			      : "a"(spa)
>> +			      : "memory", "cc");
>> +	} while (ret == PSMASH_FAIL_INUSE);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(rmptable_psmash);
>> +
>> +int rmptable_rmpupdate(struct page *page, struct rmpupdate *val)
> rmpupdate()
>
>> +{
>> +	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
>> +	bool flush = true;
>> +	int ret;
>> +
>> +	if (!static_branch_unlikely(&snp_enable_key))
>> +		return -ENXIO;
>> +
>> +	/* Retry if another processor is modifying the RMP entry. */
>> +	do {
>> +		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
>> +			     : "=a"(ret)
>> +			     : "a"(spa), "c"((unsigned long)val), "d"(flush)
> 					    ^^^^^^^^^^^^^^^
>
> what's the cast for?
No need to cast it. I will drop in next round.
> "d"(flush)?

Hmm, either it copied this function from pvalidate or old internal APM
may had the flush. I will fix it in the next rev. thanks for pointing it.


>
> There's nothing in the APM talking about RMPUPDATE taking an input arg
> in %rdx?
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-04-15 18:08     ` Brijesh Singh
@ 2021-04-15 19:50       ` Borislav Petkov
  2021-04-15 22:18         ` Brijesh Singh
  0 siblings, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-15 19:50 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Thu, Apr 15, 2021 at 01:08:09PM -0500, Brijesh Singh wrote:
> This is from Family 19h Model 01h Rev B01. The processor which
> introduces the SNP feature. Yes, I have already upload the PPR on the BZ.
> 
> The PPR is also available at AMD: https://www.amd.com/en/support/tech-docs

Please add the link in the bugzilla to the comments here - this is the
reason why stuff is being uploaded in the first place, because those
vendor sites tend to change and those links become stale with time.

> I guess I was trying to shorten the name. I am good with struct rmpentry;

Yes please - typedefs are used only in very specific cases.

> All those magic numbers are documented in the PPR.

We use defines - not magic numbers. For example

#define RMPTABLE_ENTRIES_OFFSET 0x4000

The 8 is probably

PAGE_SHIFT - RMPENTRY_SHIFT

because you have GPA bits [50:12] and an RMP entry is 16 bytes, i.e., 1 << 4.

With defines it is actually clear what the computation is doing - with
naked numbers not really.

> APM does not provide the offset of the entry inside the RMP table.

It does, kinda, but in the pseudocode of those new insns in APM v3. From
PVALIDATE pseudo:

	RMP_ENTRY_PA = RMP_BASE + 0x4000 + (SYSTEM_PA / 0x1000) * 16

and that last

	/ 0x1000 * 16

is actually

	>> 12 - 4

i.e., the >> 8 shift.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers
  2021-04-15 19:50       ` Borislav Petkov
@ 2021-04-15 22:18         ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-15 22:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 4/15/21 2:50 PM, Borislav Petkov wrote:
> On Thu, Apr 15, 2021 at 01:08:09PM -0500, Brijesh Singh wrote:
>> This is from Family 19h Model 01h Rev B01. The processor which
>> introduces the SNP feature. Yes, I have already upload the PPR on the BZ.
>>
>> The PPR is also available at AMD: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fen%2Fsupport%2Ftech-docs&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Ca20ef8e85fca49875f4b08d90047b837%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637541130354491050%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=MosOIntXEk6ikXpIFd89XkZPb8H6oO25y2mP2l82blU%3D&amp;reserved=0
> Please add the link in the bugzilla to the comments here - this is the
> reason why stuff is being uploaded in the first place, because those
> vendor sites tend to change and those links become stale with time.

Will do.


>
>> I guess I was trying to shorten the name. I am good with struct rmpentry;
> Yes please - typedefs are used only in very specific cases.
>
>> All those magic numbers are documented in the PPR.
> We use defines - not magic numbers. For example
>
> #define RMPTABLE_ENTRIES_OFFSET 0x4000
>
> The 8 is probably
>
> PAGE_SHIFT - RMPENTRY_SHIFT
>
> because you have GPA bits [50:12] and an RMP entry is 16 bytes, i.e., 1 << 4.
>
> With defines it is actually clear what the computation is doing - with
> naked numbers not really.

Sure, I will add macros to make it more readable.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-03-24 17:04 ` [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table Brijesh Singh
  2021-03-25 15:17   ` Dave Hansen
@ 2021-04-19 12:32   ` Borislav Petkov
  2021-04-19 15:25     ` Brijesh Singh
  1 sibling, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-19 12:32 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka

On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote:
> A write from the hypervisor goes through the RMP checks. When the
> hypervisor writes to pages, hardware checks to ensures that the assigned
> bit in the RMP is zero (i.e page is shared). If the page table entry that
> gives the sPA indicates that the target page size is a large page, then
> all RMP entries for the 4KB constituting pages of the target must have the
> assigned bit 0.

Hmm, so this is important: I read this such that we can have a 2M
page table entry but the RMP table can contain 4K entries for the
corresponding 512 4K pages. Is that correct?

If so, then there's a certain discrepancy here and I'd expect that if
the page gets split/collapsed, depending on the result, the RMP table
should be updated too, so that it remains in sync.

For example:

* mm decides to group all 512 4K entries into a 2M entry, RMP table gets
updated in the end to reflect that

* mm decides to split a page, RMP table gets updated too, for the same
reason.

In this way, RMP table will be always in sync with the pagetables.

I know, I probably am missing something but that makes most sense to
me instead of noticing the discrepancy and getting to work then, when
handling the RMP violation.

Or?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 12:32   ` Borislav Petkov
@ 2021-04-19 15:25     ` Brijesh Singh
  2021-04-19 16:52       ` Borislav Petkov
  0 siblings, 1 reply; 69+ messages in thread
From: Brijesh Singh @ 2021-04-19 15:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka


On 4/19/21 7:32 AM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:10PM -0500, Brijesh Singh wrote:
>> A write from the hypervisor goes through the RMP checks. When the
>> hypervisor writes to pages, hardware checks to ensures that the assigned
>> bit in the RMP is zero (i.e page is shared). If the page table entry that
>> gives the sPA indicates that the target page size is a large page, then
>> all RMP entries for the 4KB constituting pages of the target must have the
>> assigned bit 0.
> Hmm, so this is important: I read this such that we can have a 2M
> page table entry but the RMP table can contain 4K entries for the
> corresponding 512 4K pages. Is that correct?

Yes that is correct.


>
> If so, then there's a certain discrepancy here and I'd expect that if
> the page gets split/collapsed, depending on the result, the RMP table
> should be updated too, so that it remains in sync.

Yes that is correct. For write access to succeed we need both the x86
and RMP page tables in sync.

>
> For example:
>
> * mm decides to group all 512 4K entries into a 2M entry, RMP table gets
> updated in the end to reflect that

To my understanding, we don't group 512 4K entries into a 2M for the
kernel address range. We do this for the userspace address through
khugepage daemon. If page tables get out of sync then it will cause an
RMP violation, the Patch #7 adds support to split the pages on demand.


>
> * mm decides to split a page, RMP table gets updated too, for the same
> reason.
>
> In this way, RMP table will be always in sync with the pagetables.
>
> I know, I probably am missing something but that makes most sense to
> me instead of noticing the discrepancy and getting to work then, when
> handling the RMP violation.
>
> Or?
>
> Thx.
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 15:25     ` Brijesh Singh
@ 2021-04-19 16:52       ` Borislav Petkov
       [not found]         ` <30bff969-e8cf-a991-7660-054ea136855a@amd.com>
  0 siblings, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-19 16:52 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka

On Mon, Apr 19, 2021 at 10:25:01AM -0500, Brijesh Singh wrote:
> To my understanding, we don't group 512 4K entries into a 2M for the
> kernel address range. We do this for the userspace address through
> khugepage daemon. If page tables get out of sync then it will cause an
> RMP violation, the Patch #7 adds support to split the pages on demand.

Ok. So I haven't reviewed the whole thing but, is it possible to keep
the RMP table in sync so that you don't have to split the physmap like
you do in this patch?

I.e., if the physmap page is 2M, then you have a corresponding RMP entry
of 2M so that you don't have to split. And if you have 4K, then the
corresponding RMP entry is 4K. You get the idea...

IOW, when does that happen: "During the page table walk, we may get into
the situation where one of the pages within the large page is owned by
the guest (i.e assigned bit is set in RMP)." In which case is a 4K page
- as part of a 2M physmap mapping - owned by a guest?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
       [not found]         ` <30bff969-e8cf-a991-7660-054ea136855a@amd.com>
@ 2021-04-19 17:58           ` Dave Hansen
  2021-04-19 18:10             ` Andy Lutomirski
  2021-04-20  9:47           ` Borislav Petkov
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2021-04-19 17:58 UTC (permalink / raw)
  To: Brijesh Singh, Borislav Petkov
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka

On 4/19/21 10:46 AM, Brijesh Singh wrote:
> - guest wants to make gpa 0x1000 as a shared page. To support this, we
> need to psmash the large RMP entry into 512 4K entries. The psmash
> instruction breaks the large RMP entry into 512 4K entries without
> affecting the previous validation. Now the we need to force the host to
> use the 4K page level instead of the 2MB.
> 
> To my understanding, Linux kernel fault handler does not build the page
> tables on demand for the kernel addresses. All kernel addresses are
> pre-mapped on the boot. Currently, I am proactively spitting the physmap
> to avoid running into situation where x86 page level is greater than the
> RMP page level.

In other words, if the host maps guest memory with 2M mappings, the
guest can induce page faults in the host.  The only way the host can
avoid this is to map everything with 4k mappings.

If the host does not avoid this, it could end up in the situation where
it gets page faults on access to kernel data structures.  Imagine if a
kernel stack page ended up in the same 2M mapping as a guest page.  I
*think* the next write to the kernel stack would end up double-faulting.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 17:58           ` Dave Hansen
@ 2021-04-19 18:10             ` Andy Lutomirski
  2021-04-19 18:33               ` Dave Hansen
  2021-04-19 21:25               ` Brijesh Singh
  0 siblings, 2 replies; 69+ messages in thread
From: Andy Lutomirski @ 2021-04-19 18:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Brijesh Singh, Borislav Petkov, linux-kernel, x86, kvm,
	linux-crypto, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka



> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>> need to psmash the large RMP entry into 512 4K entries. The psmash
>> instruction breaks the large RMP entry into 512 4K entries without
>> affecting the previous validation. Now the we need to force the host to
>> use the 4K page level instead of the 2MB.
>> 
>> To my understanding, Linux kernel fault handler does not build the page
>> tables on demand for the kernel addresses. All kernel addresses are
>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>> to avoid running into situation where x86 page level is greater than the
>> RMP page level.
> 
> In other words, if the host maps guest memory with 2M mappings, the
> guest can induce page faults in the host.  The only way the host can
> avoid this is to map everything with 4k mappings.
> 
> If the host does not avoid this, it could end up in the situation where
> it gets page faults on access to kernel data structures.  Imagine if a
> kernel stack page ended up in the same 2M mapping as a guest page.  I
> *think* the next write to the kernel stack would end up double-faulting.

I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right?

ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 18:10             ` Andy Lutomirski
@ 2021-04-19 18:33               ` Dave Hansen
  2021-04-19 18:37                 ` Andy Lutomirski
  2021-04-20  9:51                 ` Borislav Petkov
  2021-04-19 21:25               ` Brijesh Singh
  1 sibling, 2 replies; 69+ messages in thread
From: Dave Hansen @ 2021-04-19 18:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brijesh Singh, Borislav Petkov, linux-kernel, x86, kvm,
	linux-crypto, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka

On 4/19/21 11:10 AM, Andy Lutomirski wrote:
> I’m confused by this scenario. This should only affect physical pages
> that are in the 2M area that contains guest memory. But, if we have a
> 2M direct map PMD entry that contains kernel data and guest private
> memory, we’re already in a situation in which the kernel touching
> that memory would machine check, right?

Not machine check, but page fault.  Do machine checks even play a
special role in SEV-SNP?  I thought that was only TDX?

My point was just that you can't _easily_ do the 2M->4k kernel mapping
demotion in a page fault handler, like I think Borislav was suggesting.

> ISTM we should fully unmap any guest private page from the kernel and
> all host user pagetables before actually making it be a guest private
> page.

Yes, that sounds attractive.  Then, we'd actually know if the host
kernel was doing stray reads somehow because we'd get a fault there too.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 18:33               ` Dave Hansen
@ 2021-04-19 18:37                 ` Andy Lutomirski
  2021-04-20  9:51                 ` Borislav Petkov
  1 sibling, 0 replies; 69+ messages in thread
From: Andy Lutomirski @ 2021-04-19 18:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Brijesh Singh, Borislav Petkov, linux-kernel, x86, kvm,
	linux-crypto, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka



> On Apr 19, 2021, at 11:33 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/19/21 11:10 AM, Andy Lutomirski wrote:
>> I’m confused by this scenario. This should only affect physical pages
>> that are in the 2M area that contains guest memory. But, if we have a
>> 2M direct map PMD entry that contains kernel data and guest private
>> memory, we’re already in a situation in which the kernel touching
>> that memory would machine check, right?
> 
> Not machine check, but page fault.  Do machine checks even play a
> special role in SEV-SNP?  I thought that was only TDX?

Brain fart.

> 
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

We are certainly toast if this hits the stack.  Or if it hits a page table or the GDT or IDT :). The latter delightful choices would be triple faults.

I sure hope the code we use to split a mapping is properly NMI safe.

> 
>> ISTM we should fully unmap any guest private page from the kernel and
>> all host user pagetables before actually making it be a guest private
>> page.
> 
> Yes, that sounds attractive.  Then, we'd actually know if the host
> kernel was doing stray reads somehow because we'd get a fault there too.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 18:10             ` Andy Lutomirski
  2021-04-19 18:33               ` Dave Hansen
@ 2021-04-19 21:25               ` Brijesh Singh
  1 sibling, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-19 21:25 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: brijesh.singh, Borislav Petkov, linux-kernel, x86, kvm,
	linux-crypto, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka


On 4/19/21 1:10 PM, Andy Lutomirski wrote:
>
>> On Apr 19, 2021, at 10:58 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>>> need to psmash the large RMP entry into 512 4K entries. The psmash
>>> instruction breaks the large RMP entry into 512 4K entries without
>>> affecting the previous validation. Now the we need to force the host to
>>> use the 4K page level instead of the 2MB.
>>>
>>> To my understanding, Linux kernel fault handler does not build the page
>>> tables on demand for the kernel addresses. All kernel addresses are
>>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>>> to avoid running into situation where x86 page level is greater than the
>>> RMP page level.
>> In other words, if the host maps guest memory with 2M mappings, the
>> guest can induce page faults in the host.  The only way the host can
>> avoid this is to map everything with 4k mappings.
>>
>> If the host does not avoid this, it could end up in the situation where
>> it gets page faults on access to kernel data structures.  Imagine if a
>> kernel stack page ended up in the same 2M mapping as a guest page.  I
>> *think* the next write to the kernel stack would end up double-faulting.
> I’m confused by this scenario. This should only affect physical pages that are in the 2M area that contains guest memory. But, if we have a 2M direct map PMD entry that contains kernel data and guest private memory, we’re already in a situation in which the kernel touching that memory would machine check, right?

When SEV-SNP is enabled in the host, a page can be in one of the
following state:

1. Hypevisor  (assigned = 0, Validated=0)

2. Firmware (assigned = 1, immutable=1)

3. Context/VMSA (assigned=1, vmsa=1)

4. Guest private (assigned = 1, Validated=1)


You are right that we should never run into situation where the kernel
data and guest page will be in the same PMD entry. 

During the SEV-VM creation, KVM allocates one firmware page and one vmsa
page for each vcpus. The firmware page is used by the SEV-SNP firmware
to keep some private metadata. The VMSA page contains the guest register
state. I am more concern about the pages allocated by the KVM for the
VMSA and firmware. These pages are not a guest private per se.  To avoid
getting into this situation we can probably create SNP buffer pool. All
the firmware and VMSA pages should come from this pool.

Another challenging one, KVM maps a guest page and does write to it. One
such example is the GHCB page. If the mapped address points to a PMD
entry then we will get an RMP violation.


> ISTM we should fully unmap any guest private page from the kernel and all host user pagetables before actually making it be a guest private page.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
       [not found]         ` <30bff969-e8cf-a991-7660-054ea136855a@amd.com>
  2021-04-19 17:58           ` Dave Hansen
@ 2021-04-20  9:47           ` Borislav Petkov
  1 sibling, 0 replies; 69+ messages in thread
From: Borislav Petkov @ 2021-04-20  9:47 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka

On Mon, Apr 19, 2021 at 12:46:53PM -0500, Brijesh Singh wrote:
> - KVM calls  alloc_page() to allocate a VMSA page. The allocator returns
> 0xffffc80000200000 (PFN 0x200, page-level=2M). The VMSA page is private
> page so KVM will call RMPUPDATE to add the page as a private page in the
> RMP table. While adding the RMP entry the KVM will use the page level=4K.

Right, and *here* we split the 2M page on the *host* so that there's no
discrepancy between the host pagetable and the RMP. I guess your patch
does exactly that. :)

And AFAIR, set_memory.c doesn't have the functionality to coalesce 4K
pages back into the corresponding 2M page. Which means, the RMP table is
in sync, more or less.

Thx and thanks for elaborating.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table
  2021-04-19 18:33               ` Dave Hansen
  2021-04-19 18:37                 ` Andy Lutomirski
@ 2021-04-20  9:51                 ` Borislav Petkov
  1 sibling, 0 replies; 69+ messages in thread
From: Borislav Petkov @ 2021-04-20  9:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Brijesh Singh, linux-kernel, x86, kvm,
	linux-crypto, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson,
	Vlastimil Babka

On Mon, Apr 19, 2021 at 11:33:08AM -0700, Dave Hansen wrote:
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

Yeah, see my reply to Brijesh. Not in the #PF handler but when the guest
does update the RMP table on page allocation, we should split the kernel
mapping too, so that it corresponds to what's being changed in the RMP
table.

Dunno how useful it would be if we also do coalescing of 4K pages into
their corresponding 2M pages... I haven't looked at set_memory.c for a
long time and have forgotten about all details...

In any case, my main goal here is to keep the tables in sync so that we
don't have to do crazy splitting in unsafe contexts like #PF. I hope I'm
making sense...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code
  2021-03-24 17:04 ` [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code Brijesh Singh
  2021-03-24 18:03   ` Dave Hansen
@ 2021-04-20 10:32   ` Borislav Petkov
  2021-04-20 21:37     ` Brijesh Singh
  1 sibling, 1 reply; 69+ messages in thread
From: Borislav Petkov @ 2021-04-20 10:32 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On Wed, Mar 24, 2021 at 12:04:11PM -0500, Brijesh Singh wrote:

Btw, for all your patches where the subject prefix is only "x86:":

The tip tree preferred format for patch subject prefixes is
'subsys/component:', e.g. 'x86/apic:', 'x86/mm/fault:', 'sched/fair:',
'genirq/core:'. Please do not use file names or complete file paths as
prefix. 'git log path/to/file' should give you a reasonable hint in most
cases.

The condensed patch description in the subject line should start with a
uppercase letter and should be written in imperative tone.

Please go over them and fix that up.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code
  2021-04-20 10:32   ` Borislav Petkov
@ 2021-04-20 21:37     ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-20 21:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: brijesh.singh, linux-kernel, x86, kvm, linux-crypto, ak, herbert,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Tony Luck, Dave Hansen, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 4/20/21 5:32 AM, Borislav Petkov wrote:
> On Wed, Mar 24, 2021 at 12:04:11PM -0500, Brijesh Singh wrote:
>
> Btw, for all your patches where the subject prefix is only "x86:":
>
> The tip tree preferred format for patch subject prefixes is
> 'subsys/component:', e.g. 'x86/apic:', 'x86/mm/fault:', 'sched/fair:',
> 'genirq/core:'. Please do not use file names or complete file paths as
> prefix. 'git log path/to/file' should give you a reasonable hint in most
> cases.
>
> The condensed patch description in the subject line should start with a
> uppercase letter and should be written in imperative tone.
>
> Please go over them and fix that up.

Sure, I will go over each patch and fix them.

thanks

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-03-25 15:59       ` Dave Hansen
@ 2021-04-21 12:59         ` Vlastimil Babka
  2021-04-21 13:43           ` Brijesh Singh
  0 siblings, 1 reply; 69+ messages in thread
From: Vlastimil Babka @ 2021-04-21 12:59 UTC (permalink / raw)
  To: Dave Hansen, Brijesh Singh, linux-kernel, x86, kvm, linux-crypto
  Cc: ak, herbert, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Joerg Roedel, H. Peter Anvin, Tony Luck, Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson

On 3/25/21 4:59 PM, Dave Hansen wrote:
> On 3/25/21 8:24 AM, Brijesh Singh wrote:
>> On 3/25/21 9:48 AM, Dave Hansen wrote:
>>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>>> the virtual address to a lower page level.
>>>>
>>>> e.g
>>>> - guest made a page shared in the RMP entry so that the hypervisor
>>>>   can write to it.
>>>> - the hypervisor has mapped the pfn as a large page. A write access
>>>>   will cause an RMP violation if one of the pages within the 2MB region
>>>>   is a guest private page.
>>>>
>>>> The above RMP violation can be resolved by simply splitting the large
>>>> page.
>>> What if the large page is provided by hugetlbfs?
>> I was not able to find a method to split the large pages in the
>> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
>> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
>> transparent hugepage or small pages.
> 
> That's really, really nasty.  Especially since it might not be evident
> until long after boot and the guest is killed.

I'd assume a SNP-aware QEMU would be needed in the first place and thus this
QEMU would know not to use hugetlbfs?

> It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
> memory.  It's physically contiguous, so it would keep you from having to

Maybe this could be solvable by remapping the hugetlbfs page with pte's when
needed (a guest wants to share 4k out of 2MB with the host temporarily). But
certainly never as flexibly as pte-mapped THP's as the complexity of that
(refcounting tail pages etc) is significant.

> fracture the direct map all the way down to 4k, it also can't be
> reclaimed (just like all SEV memory).

About that... the whitepaper I've seen [1] mentions support for swapping guest
pages. I'd expect the same mechanism could be used for their migration -
scattering 4kB unmovable SEV pages around would be terrible for fragmentation. I
assume neither swap or migration support is part of the patchset(s) yet?

> I think the minimal thing you can do here is to fail to add memory to
> the RMP in the first place if you can't split it.  That way, users will
> at least fail to _start_ their VM versus dying randomly for no good reason.
> 
> Even better would be to come up with a stronger contract between host
> and guest.  I really don't think the host should be exposed to random
> RMP faults on the direct map.  If the guest wants to share memory, then
> it needs to tell the host and give the host an opportunity to move the
> guest physical memory.  It might, for instance, sequester all the shared
> pages in a single spot to minimize direct map fragmentation.

Agreed, and the contract should be elaborated before going to implementation
details (patches). Could a malicious guest violate such contract unilaterally? I
guess not, because psmash is a hypervisor instruction? And if yes, the
RMP-specific page fault handlers would be used just to kill such guest, not to
fix things up during page fault.

> I'll let the other x86 folks chime in on this, but I really think this
> needs a different approach than what's being proposed.

Not an x86 folk, but agreed :)

[1]
https://www.amd.com/system/files/TechDocs/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation
  2021-04-21 12:59         ` Vlastimil Babka
@ 2021-04-21 13:43           ` Brijesh Singh
  0 siblings, 0 replies; 69+ messages in thread
From: Brijesh Singh @ 2021-04-21 13:43 UTC (permalink / raw)
  To: Vlastimil Babka, Dave Hansen, linux-kernel, x86, kvm, linux-crypto
  Cc: brijesh.singh, ak, herbert, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Joerg Roedel, H. Peter Anvin, Tony Luck,
	Peter Zijlstra (Intel),
	Paolo Bonzini, Tom Lendacky, David Rientjes, Sean Christopherson


On 4/21/21 7:59 AM, Vlastimil Babka wrote:
> On 3/25/21 4:59 PM, Dave Hansen wrote:
>> On 3/25/21 8:24 AM, Brijesh Singh wrote:
>>> On 3/25/21 9:48 AM, Dave Hansen wrote:
>>>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>>>> the virtual address to a lower page level.
>>>>>
>>>>> e.g
>>>>> - guest made a page shared in the RMP entry so that the hypervisor
>>>>>   can write to it.
>>>>> - the hypervisor has mapped the pfn as a large page. A write access
>>>>>   will cause an RMP violation if one of the pages within the 2MB region
>>>>>   is a guest private page.
>>>>>
>>>>> The above RMP violation can be resolved by simply splitting the large
>>>>> page.
>>>> What if the large page is provided by hugetlbfs?
>>> I was not able to find a method to split the large pages in the
>>> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
>>> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
>>> transparent hugepage or small pages.
>> That's really, really nasty.  Especially since it might not be evident
>> until long after boot and the guest is killed.
> I'd assume a SNP-aware QEMU would be needed in the first place and thus this
> QEMU would know not to use hugetlbfs?


Yes, that is correct. Qemu patches will not launch SEV-SNP guest when
hugetlbfs is used. I can also look to add the check in kernel to ensure
that backing pages does not come from the hugetlbfs so that non-QEMU VMM
will also fail to create the SNP guest.

>
>> It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
>> memory.  It's physically contiguous, so it would keep you from having to
> Maybe this could be solvable by remapping the hugetlbfs page with pte's when
> needed (a guest wants to share 4k out of 2MB with the host temporarily). But
> certainly never as flexibly as pte-mapped THP's as the complexity of that
> (refcounting tail pages etc) is significant.
>
>> fracture the direct map all the way down to 4k, it also can't be
>> reclaimed (just like all SEV memory).
> About that... the whitepaper I've seen [1] mentions support for swapping guest
> pages. I'd expect the same mechanism could be used for their migration -
> scattering 4kB unmovable SEV pages around would be terrible for fragmentation. I
> assume neither swap or migration support is part of the patchset(s) yet?


Yes, the patches does not support swapping guest pages yet. We want to
add the support incrementally. The swap/move can be implemented after we
have the base enabled in the kernel. Both the SEV and SNP firmware
provides PSP commands that can be used to swap the guest pages. I
believe KVM mmu notifier can use it during the page move.

>
>> I think the minimal thing you can do here is to fail to add memory to
>> the RMP in the first place if you can't split it.  That way, users will
>> at least fail to _start_ their VM versus dying randomly for no good reason.
>>
>> Even better would be to come up with a stronger contract between host
>> and guest.  I really don't think the host should be exposed to random
>> RMP faults on the direct map.  If the guest wants to share memory, then
>> it needs to tell the host and give the host an opportunity to move the
>> guest physical memory.  It might, for instance, sequester all the shared
>> pages in a single spot to minimize direct map fragmentation.
> Agreed, and the contract should be elaborated before going to implementation
> details (patches). Could a malicious guest violate such contract unilaterally? I
> guess not, because psmash is a hypervisor instruction? And if yes, the
> RMP-specific page fault handlers would be used just to kill such guest, not to
> fix things up during page fault.

The version 2 of GHCB specification defines a contract between the guest
and the host. When guest is ready to share a page with the host it
issues the page state change request to the hypervisor. Hypervisor is
responsible to add the page in the RMP table using the RMPUPDATE
instruction. The page state change request include an operation field.
The operation can be one of the following

1. Add page in RMP table (make guest page private)

2. Remove page from RMP table (make guest page shared)

3. Psmash - split the large RMP entry

4. Unmash - merge small RMP entry into large. The unmash operation
require the PSP assist.

The current RMP-specific fault handler checks if host is attempting to
write to a guest private page. If so, kill the guest. I guess it covers
the case where a malicious guest violates the contract to issue the
page-state-change.


>> I'll let the other x86 folks chime in on this, but I really think this
>> needs a different approach than what's being proposed.
> Not an x86 folk, but agreed :)
>
> [1]
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2FSEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7C3a8c99a1738940b550af08d904c55938%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637546068243853651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=x%2Bmtud8IxrykFCPAPgBu2CCAFO9Q26PA3OhryvlX%2BbM%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, back to index

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 17:04 [RFC Part2 PATCH 00/30] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 01/30] x86: Add the host SEV-SNP initialization support Brijesh Singh
2021-03-25 14:58   ` Dave Hansen
2021-03-25 15:31     ` Brijesh Singh
2021-03-25 15:51       ` Dave Hansen
2021-03-25 17:41         ` Brijesh Singh
2021-04-14  7:27   ` Borislav Petkov
2021-04-14 22:48     ` Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 02/30] x86/sev-snp: add RMP entry lookup helpers Brijesh Singh
2021-04-15 16:57   ` Borislav Petkov
2021-04-15 18:08     ` Brijesh Singh
2021-04-15 19:50       ` Borislav Petkov
2021-04-15 22:18         ` Brijesh Singh
2021-04-15 17:03   ` Borislav Petkov
2021-04-15 18:09     ` Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 03/30] x86: add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
2021-04-15 18:00   ` Borislav Petkov
2021-04-15 18:15     ` Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table Brijesh Singh
2021-03-25 15:17   ` Dave Hansen
2021-04-19 12:32   ` Borislav Petkov
2021-04-19 15:25     ` Brijesh Singh
2021-04-19 16:52       ` Borislav Petkov
     [not found]         ` <30bff969-e8cf-a991-7660-054ea136855a@amd.com>
2021-04-19 17:58           ` Dave Hansen
2021-04-19 18:10             ` Andy Lutomirski
2021-04-19 18:33               ` Dave Hansen
2021-04-19 18:37                 ` Andy Lutomirski
2021-04-20  9:51                 ` Borislav Petkov
2021-04-19 21:25               ` Brijesh Singh
2021-04-20  9:47           ` Borislav Petkov
2021-03-24 17:04 ` [RFC Part2 PATCH 05/30] x86: define RMP violation #PF error code Brijesh Singh
2021-03-24 18:03   ` Dave Hansen
2021-03-25 14:32     ` Brijesh Singh
2021-03-25 14:34       ` Dave Hansen
2021-04-20 10:32   ` Borislav Petkov
2021-04-20 21:37     ` Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF Brijesh Singh
2021-03-24 17:47   ` Andy Lutomirski
2021-03-24 20:35     ` Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation Brijesh Singh
2021-03-25 14:30   ` Dave Hansen
2021-03-25 14:48   ` Dave Hansen
2021-03-25 15:24     ` Brijesh Singh
2021-03-25 15:59       ` Dave Hansen
2021-04-21 12:59         ` Vlastimil Babka
2021-04-21 13:43           ` Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 08/30] crypto:ccp: define the SEV-SNP commands Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 09/30] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 10/30] crypto: ccp: shutdown SNP firmware on kexec Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 11/30] crypto:ccp: provide APIs to issue SEV-SNP commands Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 12/30] crypto ccp: handle the legacy SEV command when SNP is enabled Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 13/30] KVM: SVM: add initial SEV-SNP support Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 14/30] KVM: SVM: make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 15/30] KVM: SVM: define new SEV_FEATURES field in the VMCB Save State Area Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 16/30] KVM: SVM: add KVM_SNP_INIT command Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 17/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 18/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_UPDATE command Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 19/30] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 20/30] KVM: SVM: add KVM_SEV_SNP_LAUNCH_FINISH command Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 21/30] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 22/30] x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by SEV Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 23/30] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 24/30] KVM: X86: define new RMP check related #NPF error bits Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 25/30] KVM: X86: update page-fault trace to log the 64-bit error code Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 26/30] KVM: SVM: add support to handle GHCB GPA register VMGEXIT Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 27/30] KVM: SVM: add support to handle MSR based Page State Change VMGEXIT Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 28/30] KVM: SVM: add support to handle " Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 29/30] KVM: X86: export the kvm_zap_gfn_range() for the SNP use Brijesh Singh
2021-03-24 17:04 ` [RFC Part2 PATCH 30/30] KVM: X86: Add support to handle the RMP nested page fault Brijesh Singh

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git