linux-coco.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support
@ 2021-07-07 18:35 Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol Brijesh Singh
                   ` (40 more replies)
  0 siblings, 41 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Part-1.

This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.

The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.

The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:
- Register GHCB GPA
- Page State Change Request
- Hypevisor feature
- Guest message request

The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
access requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality is
already protected via memory encryption. When hardware encounters an RMP
checks failure, it raises a page-fault exception. If RMP check failure
is due to the page-size mismatch, then split the large page to resolve
the fault.

The series does not provide support for the following SEV-SNP specific
NAE's yet:

* Interrupt security

The series is based on the commit:
 a4345a7cecfb (origin/next, next) Merge tag 'kvmarm-fixes-5.13-1' 

Changes since v3:
 * Add support for extended guest message request.
 * Add ioctl to query the SNP Platform status.
 * Add ioctl to get and set the SNP config.
 * Add check to verify that memory reserved for the RMP covers the full system RAM.
 * Start the SNP specific commands from 256 instead of 255.
 * Multiple cleanup and fixes based on the review feedback.

Changes since v2:
 * Add AP creation support.
 * Drop the patch to handle the RMP fault for the kernel address.
 * Add functions to track the write access from the hypervisor.
 * Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
 * Dump the RMP entry on RMP violation for the debug.
 * Shorten the GHCB macro names.
 * Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
 * Sync the header with the latest 0.9 SNP spec.
 
Changes since v1:
 * Add AP reset MSR protocol VMGEXIT NAE.
 * Add Hypervisor features VMGEXIT NAE.
 * Move the RMP table initialization and RMPUPDATE/PSMASH helper in
   arch/x86/kernel/sev.c.
 * Add support to map/unmap SEV legacy command buffer to firmware state when
   SNP is active.
 * Enhance PSP driver to provide helper to allocate/free memory used for the
   firmware context page.
 * Add support to handle RMP fault for the kernel address.
 * Add support to handle GUEST_REQUEST NAE event for attestation.
 * Rename RMP table lookup helper.
 * Drop typedef from rmpentry struct definition.
 * Drop SNP static key and use cpu_feature_enabled() to check whether SEV-SNP
   is active.
 * Multiple cleanup/fixes to address Boris review feedback.

Brijesh Singh (37):
  KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
  x86/cpufeatures: Add SEV-SNP CPU feature
  x86/sev: Add the host SEV-SNP initialization support
  x86/sev: Add RMP entry lookup helpers
  x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  x86/sev: Split the physmap when adding the page in RMP table
  x86/traps: Define RMP violation #PF error code
  x86/fault: Add support to dump RMP entry on fault
  x86/fault: Add support to handle the RMP fault for user address
  crypto:ccp: Define the SEV-SNP commands
  crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
  crypto: ccp: Shutdown SNP firmware on kexec
  crypto:ccp: Provide APIs to issue SEV-SNP commands
  crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  crypto: ccp: Handle the legacy SEV command when SNP is enabled
  crypto: ccp: Add the SNP_PLATFORM_STATUS command
  crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
  crypto: ccp: provide APIs to query extended attestation report
  KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  KVM: SVM: Add initial SEV-SNP support
  KVM: SVM: Add KVM_SNP_INIT command
  KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
  KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  KVM: X86: Introduce kvm_mmu_map_tdp_page() for use by SEV
  KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
  KVM: X86: Define new RMP check related #NPF error bits
  KVM: X86: update page-fault trace to log the 64-bit error code
  KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
  KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  KVM: SVM: Add support to handle Page State Change VMGEXIT
  KVM: Add arch hooks to track the host write to guest memory
  KVM: X86: Export the kvm_zap_gfn_range() for the SNP use
  KVM: SVM: Add support to handle the RMP nested page fault
  KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event

Tom Lendacky (3):
  KVM: SVM: Add support to handle AP reset MSR protocol
  KVM: SVM: Use a VMSA physical address variable for populating VMCB
  KVM: SVM: Support SEV-SNP AP Creation NAE event

 Documentation/virt/coco/sevguest.rst          |   55 +
 .../virt/kvm/amd-memory-encryption.rst        |   91 ++
 arch/x86/include/asm/cpufeatures.h            |    1 +
 arch/x86/include/asm/disabled-features.h      |    8 +-
 arch/x86/include/asm/kvm_host.h               |   24 +
 arch/x86/include/asm/msr-index.h              |    6 +
 arch/x86/include/asm/sev-common.h             |   18 +
 arch/x86/include/asm/sev.h                    |    4 +-
 arch/x86/include/asm/svm.h                    |    3 +
 arch/x86/include/asm/trap_pf.h                |   18 +-
 arch/x86/include/uapi/asm/svm.h               |    4 +-
 arch/x86/kernel/cpu/amd.c                     |    3 +-
 arch/x86/kernel/sev.c                         |  217 +++
 arch/x86/kvm/lapic.c                          |    5 +-
 arch/x86/kvm/mmu.h                            |    5 +-
 arch/x86/kvm/mmu/mmu.c                        |   76 +-
 arch/x86/kvm/svm/sev.c                        | 1321 ++++++++++++++++-
 arch/x86/kvm/svm/svm.c                        |   37 +-
 arch/x86/kvm/svm/svm.h                        |   48 +-
 arch/x86/kvm/trace.h                          |    6 +-
 arch/x86/kvm/vmx/vmx.c                        |    8 +
 arch/x86/kvm/x86.c                            |   89 +-
 arch/x86/mm/fault.c                           |  149 ++
 drivers/crypto/ccp/sev-dev.c                  |  863 ++++++++++-
 drivers/crypto/ccp/sev-dev.h                  |   18 +
 drivers/crypto/ccp/sp-pci.c                   |   12 +
 include/linux/kvm_host.h                      |    3 +
 include/linux/mm.h                            |    6 +-
 include/linux/psp-sev.h                       |  347 +++++
 include/linux/sev.h                           |   76 +
 include/uapi/linux/kvm.h                      |   47 +
 include/uapi/linux/psp-sev.h                  |   60 +
 mm/memory.c                                   |   13 +
 tools/arch/x86/include/asm/cpufeatures.h      |    1 +
 virt/kvm/kvm_main.c                           |   21 +-
 35 files changed, 3571 insertions(+), 92 deletions(-)
 create mode 100644 include/linux/sev.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-14 20:17   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 02/40] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT Brijesh Singh
                   ` (39 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

From: Tom Lendacky <thomas.lendacky@amd.com>

Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
available in version 2 of the GHCB specification.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-common.h |  6 ++++
 arch/x86/kvm/svm/sev.c            | 56 ++++++++++++++++++++++++++-----
 arch/x86/kvm/svm/svm.h            |  1 +
 3 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index e14d24f0950c..466baa9cd0f5 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -45,6 +45,12 @@
 		(((unsigned long)reg & GHCB_MSR_CPUID_REG_MASK) << GHCB_MSR_CPUID_REG_POS) | \
 		(((unsigned long)fn) << GHCB_MSR_CPUID_FUNC_POS))
 
+/* AP Reset Hold */
+#define GHCB_MSR_AP_RESET_HOLD_REQ		0x006
+#define GHCB_MSR_AP_RESET_HOLD_RESP		0x007
+#define GHCB_MSR_AP_RESET_HOLD_RESULT_POS	12
+#define GHCB_MSR_AP_RESET_HOLD_RESULT_MASK	GENMASK_ULL(51, 0)
+
 /* GHCB GPA Register */
 #define GHCB_MSR_GPA_REG_REQ		0x012
 #define GHCB_MSR_GPA_REG_VALUE_POS	12
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index d93a1c368b61..7d0b98dbe523 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -57,6 +57,10 @@ module_param_named(sev_es, sev_es_enabled, bool, 0444);
 #define sev_es_enabled false
 #endif /* CONFIG_KVM_AMD_SEV */
 
+#define AP_RESET_HOLD_NONE		0
+#define AP_RESET_HOLD_NAE_EVENT		1
+#define AP_RESET_HOLD_MSR_PROTO		2
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -2199,6 +2203,9 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 
 void sev_es_unmap_ghcb(struct vcpu_svm *svm)
 {
+	/* Clear any indication that the vCPU is in a type of AP Reset Hold */
+	svm->ap_reset_hold_type = AP_RESET_HOLD_NONE;
+
 	if (!svm->ghcb)
 		return;
 
@@ -2404,6 +2411,22 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_AP_RESET_HOLD_REQ:
+		svm->ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
+		ret = kvm_emulate_ap_reset_hold(&svm->vcpu);
+
+		/*
+		 * Preset the result to a non-SIPI return and then only set
+		 * the result to non-zero when delivering a SIPI.
+		 */
+		set_ghcb_msr_bits(svm, 0,
+				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
+				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
+
+		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
+				  GHCB_MSR_INFO_MASK,
+				  GHCB_MSR_INFO_POS);
+		break;
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
@@ -2491,6 +2514,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
 		break;
 	case SVM_VMGEXIT_AP_HLT_LOOP:
+		svm->ap_reset_hold_type = AP_RESET_HOLD_NAE_EVENT;
 		ret = kvm_emulate_ap_reset_hold(vcpu);
 		break;
 	case SVM_VMGEXIT_AP_JUMP_TABLE: {
@@ -2628,13 +2652,29 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 		return;
 	}
 
-	/*
-	 * Subsequent SIPI: Return from an AP Reset Hold VMGEXIT, where
-	 * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
-	 * non-zero value.
-	 */
-	if (!svm->ghcb)
-		return;
+	/* Subsequent SIPI */
+	switch (svm->ap_reset_hold_type) {
+	case AP_RESET_HOLD_NAE_EVENT:
+		/*
+		 * Return from an AP Reset Hold VMGEXIT, where the guest will
+		 * set the CS and RIP. Set SW_EXIT_INFO_2 to a non-zero value.
+		 */
+		ghcb_set_sw_exit_info_2(svm->ghcb, 1);
+		break;
+	case AP_RESET_HOLD_MSR_PROTO:
+		/*
+		 * Return from an AP Reset Hold VMGEXIT, where the guest will
+		 * set the CS and RIP. Set GHCB data field to a non-zero value.
+		 */
+		set_ghcb_msr_bits(svm, 1,
+				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
+				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
 
-	ghcb_set_sw_exit_info_2(svm->ghcb, 1);
+		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
+				  GHCB_MSR_INFO_MASK,
+				  GHCB_MSR_INFO_POS);
+		break;
+	default:
+		break;
+	}
 }
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0b89aee51b74..ad12ca26b2d8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -174,6 +174,7 @@ struct vcpu_svm {
 	struct ghcb *ghcb;
 	struct kvm_host_map ghcb_map;
 	bool received_first_sipi;
+	unsigned int ap_reset_hold_type;
 
 	/* SEV-ES scratch area support */
 	void *ghcb_sa;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 02/40] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-14 20:37   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 03/40] x86/cpufeatures: Add SEV-SNP CPU feature Brijesh Singh
                   ` (38 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Version 2 of the GHCB specification introduced advertisement of features
that are supported by the Hypervisor.

Now that KVM supports version 2 of the GHCB specification, bump the
maximum supported protocol version.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/uapi/asm/svm.h |  4 ++--
 arch/x86/kvm/svm/sev.c          | 14 ++++++++++++++
 arch/x86/kvm/svm/svm.h          |  3 ++-
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 9aaf0ab386ef..ba4137abf012 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -115,7 +115,7 @@
 #define SVM_VMGEXIT_AP_CREATE_ON_INIT		0
 #define SVM_VMGEXIT_AP_CREATE			1
 #define SVM_VMGEXIT_AP_DESTROY			2
-#define SVM_VMGEXIT_HYPERVISOR_FEATURES		0x8000fffd
+#define SVM_VMGEXIT_HV_FT			0x8000fffd
 #define SVM_VMGEXIT_UNSUPPORTED_EVENT		0x8000ffff
 
 #define SVM_EXIT_ERR           -1
@@ -227,7 +227,7 @@
 	{ SVM_VMGEXIT_EXT_GUEST_REQUEST,	"vmgexit_ext_guest_request" }, \
 	{ SVM_VMGEXIT_PSC,	"vmgexit_page_state_change" }, \
 	{ SVM_VMGEXIT_AP_CREATION,	"vmgexit_ap_creation" }, \
-	{ SVM_VMGEXIT_HYPERVISOR_FEATURES,	"vmgexit_hypervisor_feature" }, \
+	{ SVM_VMGEXIT_HV_FT,      "vmgexit_hypervisor_feature" }, \
 	{ SVM_EXIT_ERR,         "invalid_guest_state" }
 
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 7d0b98dbe523..b8505710c36b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2173,6 +2173,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_HLT_LOOP:
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
+	case SVM_VMGEXIT_HV_FT:
 		break;
 	default:
 		goto vmgexit_err;
@@ -2427,6 +2428,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_MASK,
 				  GHCB_MSR_INFO_POS);
 		break;
+	case GHCB_MSR_HV_FT_REQ: {
+		set_ghcb_msr_bits(svm, GHCB_HV_FT_SUPPORTED,
+				GHCB_MSR_HV_FT_MASK, GHCB_MSR_HV_FT_POS);
+		set_ghcb_msr_bits(svm, GHCB_MSR_HV_FT_RESP,
+				GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
+		break;
+	}
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
@@ -2542,6 +2550,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		ret = 1;
 		break;
 	}
+	case SVM_VMGEXIT_HV_FT: {
+		ghcb_set_sw_exit_info_2(ghcb, GHCB_HV_FT_SUPPORTED);
+
+		ret = 1;
+		break;
+	}
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ad12ca26b2d8..5f874168551b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -527,9 +527,10 @@ void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 
 /* sev.c */
 
-#define GHCB_VERSION_MAX	1ULL
+#define GHCB_VERSION_MAX	2ULL
 #define GHCB_VERSION_MIN	1ULL
 
+#define GHCB_HV_FT_SUPPORTED	0
 
 extern unsigned int max_sev_asid;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 03/40] x86/cpufeatures: Add SEV-SNP CPU feature
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 02/40] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support Brijesh Singh
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Add CPU feature detection for Secure Encrypted Virtualization with
Secure Nested Paging. This feature adds a strong memory integrity
protection to help prevent malicious hypervisor-based attacks like
data replay, memory re-mapping, and more.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/kernel/cpu/amd.c                | 3 ++-
 tools/arch/x86/include/asm/cpufeatures.h | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ac37830ae941..433d00323b36 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -397,6 +397,7 @@
 #define X86_FEATURE_SEV			(19*32+ 1) /* AMD Secure Encrypted Virtualization */
 #define X86_FEATURE_VM_PAGE_FLUSH	(19*32+ 2) /* "" VM Page Flush MSR is supported */
 #define X86_FEATURE_SEV_ES		(19*32+ 3) /* AMD Secure Encrypted Virtualization - Encrypted State */
+#define X86_FEATURE_SEV_SNP		(19*32+4)  /* AMD Secure Encrypted Virtualization - Secure Nested Paging */
 #define X86_FEATURE_SME_COHERENT	(19*32+10) /* "" AMD hardware-enforced cache coherency */
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 0adb0341cd7c..19567f976996 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -586,7 +586,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 	 *	      If BIOS has not enabled SME then don't advertise the
 	 *	      SME feature (set in scattered.c).
 	 *   For SEV: If BIOS has not enabled SEV then don't advertise the
-	 *            SEV and SEV_ES feature (set in scattered.c).
+	 *            SEV, SEV_ES and SEV_SNP feature.
 	 *
 	 *   In all cases, since support for SME and SEV requires long mode,
 	 *   don't advertise the feature under CONFIG_X86_32.
@@ -618,6 +618,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
 clear_sev:
 		setup_clear_cpu_cap(X86_FEATURE_SEV);
 		setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
+		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
 	}
 }
 
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index cc96e26d69f7..e78ac4011ec8 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -390,6 +390,7 @@
 #define X86_FEATURE_SEV			(19*32+ 1) /* AMD Secure Encrypted Virtualization */
 #define X86_FEATURE_VM_PAGE_FLUSH	(19*32+ 2) /* "" VM Page Flush MSR is supported */
 #define X86_FEATURE_SEV_ES		(19*32+ 3) /* AMD Secure Encrypted Virtualization - Encrypted State */
+#define X86_FEATURE_SEV_SNP		(19*32+4)  /* AMD Secure Encrypted Virtualization - Secure Nested Paging */
 #define X86_FEATURE_SME_COHERENT	(19*32+10) /* "" AMD hardware-enforced cache coherency */
 
 /*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (2 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 03/40] x86/cpufeatures: Add SEV-SNP CPU feature Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-14 21:07   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers Brijesh Singh
                   ` (36 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The goal of RMP is to
track the owner of each page of memory. Pages of memory can be owned by
the hypervisor, owned by a specific VM or owned by the AMD-SP. See APM2
section 15.36.3 for more detail on RMP.

The RMP table is used to enforce access control to memory. The table itself
is not directly writable by the software. New CPU instructions (RMPUPDATE,
PVALIDATE, RMPADJUST) are used to manipulate the RMP entries.

Based on the platform configuration, the BIOS reserves the memory used
for the RMP table. The start and end address of the RMP table must be
queried by reading the RMP_BASE and RMP_END MSRs. If the RMP_BASE and
RMP_END are not set then disable the SEV-SNP feature.

The SEV-SNP feature is enabled only after the RMP table is successfully
initialized.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/msr-index.h         |   6 +
 arch/x86/kernel/sev.c                    | 143 +++++++++++++++++++++++
 3 files changed, 156 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index b7dd944dc867..0d5c8d08185c 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -68,6 +68,12 @@
 # define DISABLE_SGX	(1 << (X86_FEATURE_SGX & 31))
 #endif
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+# define DISABLE_SEV_SNP	0
+#else
+# define DISABLE_SEV_SNP	(1 << (X86_FEATURE_SEV_SNP & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -91,7 +97,7 @@
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
-#define DISABLED_MASK19	0
+#define DISABLED_MASK19	(DISABLE_SEV_SNP)
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 69ce50fa3565..e8d45929010a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -485,6 +485,8 @@
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
 #define MSR_AMD64_SEV_ES_ENABLED	BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
 #define MSR_AMD64_SEV_SNP_ENABLED	BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
+#define MSR_AMD64_RMP_BASE		0xc0010132
+#define MSR_AMD64_RMP_END		0xc0010133
 
 #define MSR_AMD64_VIRT_SPEC_CTRL	0xc001011f
 
@@ -542,6 +544,10 @@
 #define MSR_AMD64_SYSCFG		0xc0010010
 #define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT	23
 #define MSR_AMD64_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_AMD64_SYSCFG_SNP_EN_BIT		24
+#define MSR_AMD64_SYSCFG_SNP_EN		BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT	25
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN	BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
 #define MSR_K8_INT_PENDING_MSG		0xc0010055
 /* C1E active bits in int pending message */
 #define K8_INTP_C1E_ACTIVE_MASK		0x18000000
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index aa7e37631447..f9d813d498fa 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -24,6 +24,8 @@
 #include <linux/sev-guest.h>
 #include <linux/platform_device.h>
 #include <linux/io.h>
+#include <linux/io.h>
+#include <linux/iommu.h>
 
 #include <asm/cpu_entry_area.h>
 #include <asm/stacktrace.h>
@@ -40,11 +42,14 @@
 #include <asm/efi.h>
 #include <asm/cpuid-indexed.h>
 #include <asm/setup.h>
+#include <asm/iommu.h>
 
 #include "sev-internal.h"
 
 #define DR7_RESET_VALUE        0x400
 
+#define RMPTABLE_ENTRIES_OFFSET        0x4000
+
 /* For early boot hypervisor communication in SEV-ES enabled guests */
 static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
 
@@ -56,6 +61,9 @@ static struct ghcb __initdata *boot_ghcb;
 
 static u64 snp_secrets_phys;
 
+static unsigned long rmptable_start __ro_after_init;
+static unsigned long rmptable_end __ro_after_init;
+
 /* #VC handler runtime per-CPU data */
 struct sev_es_runtime_data {
 	struct ghcb ghcb_page;
@@ -2176,3 +2184,138 @@ static int __init add_snp_guest_request(void)
 	return 0;
 }
 device_initcall(add_snp_guest_request);
+
+#undef pr_fmt
+#define pr_fmt(fmt)	"SEV-SNP: " fmt
+
+static int __snp_enable(unsigned int cpu)
+{
+	u64 val;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return 0;
+
+	rdmsrl(MSR_AMD64_SYSCFG, val);
+
+	val |= MSR_AMD64_SYSCFG_SNP_EN;
+	val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
+
+	wrmsrl(MSR_AMD64_SYSCFG, val);
+
+	return 0;
+}
+
+static __init void snp_enable(void *arg)
+{
+	__snp_enable(smp_processor_id());
+}
+
+static bool get_rmptable_info(u64 *start, u64 *len)
+{
+	u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
+
+	rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
+	rdmsrl(MSR_AMD64_RMP_END, rmp_end);
+
+	if (!rmp_base || !rmp_end) {
+		pr_info("Memory for the RMP table has not been reserved by BIOS\n");
+		return false;
+	}
+
+	rmp_sz = rmp_end - rmp_base + 1;
+
+	/*
+	 * Calculate the amount the memory that must be reserved by the BIOS to
+	 * address the full system RAM. The reserved memory should also cover the
+	 * RMP table itself.
+	 *
+	 * See PPR section 2.1.5.2 for more information on memory requirement.
+	 */
+	nr_pages = totalram_pages();
+	calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_ENTRIES_OFFSET;
+
+	if (calc_rmp_sz > rmp_sz) {
+		pr_info("Memory reserved for the RMP table does not cover the full system "
+			"RAM (expected 0x%llx got 0x%llx)\n", calc_rmp_sz, rmp_sz);
+		return false;
+	}
+
+	*start = rmp_base;
+	*len = rmp_sz;
+
+	pr_info("RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
+
+	return true;
+}
+
+static __init int __snp_rmptable_init(void)
+{
+	u64 rmp_base, sz;
+	void *start;
+	u64 val;
+
+	if (!get_rmptable_info(&rmp_base, &sz))
+		return 1;
+
+	start = memremap(rmp_base, sz, MEMREMAP_WB);
+	if (!start) {
+		pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
+		return 1;
+	}
+
+	/*
+	 * Check if SEV-SNP is already enabled, this can happen if we are coming from
+	 * kexec boot.
+	 */
+	rdmsrl(MSR_AMD64_SYSCFG, val);
+	if (val & MSR_AMD64_SYSCFG_SNP_EN)
+		goto skip_enable;
+
+	/* Initialize the RMP table to zero */
+	memset(start, 0, sz);
+
+	/* Flush the caches to ensure that data is written before SNP is enabled. */
+	wbinvd_on_all_cpus();
+
+	/* Enable SNP on all CPUs. */
+	on_each_cpu(snp_enable, NULL, 1);
+
+skip_enable:
+	rmptable_start = (unsigned long)start;
+	rmptable_end = rmptable_start + sz;
+
+	return 0;
+}
+
+static int __init snp_rmptable_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
+		return 0;
+
+	/*
+	 * The SEV-SNP support requires that IOMMU must be enabled, and is not
+	 * configured in the passthrough mode.
+	 */
+	if (no_iommu || iommu_default_passthrough()) {
+		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+		pr_err("IOMMU is either disabled or configured in passthrough mode.\n");
+		return 0;
+	}
+
+	if (__snp_rmptable_init()) {
+		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+		return 1;
+	}
+
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
+
+	return 0;
+}
+
+/*
+ * This must be called after the PCI subsystem. This is because before enabling
+ * the SNP feature we need to ensure that IOMMU is not configured in the
+ * passthrough mode. The iommu_default_passthrough() is used for checking the
+ * passthough state, and it is available after subsys_initcall().
+ */
+fs_initcall(snp_rmptable_init);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (3 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-15 18:37   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
                   ` (35 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
entry for a given page. The RMP entry format is documented in AMD PPR, see
https://bugzilla.kernel.org/attachment.cgi?id=296015.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev.h |  4 +--
 arch/x86/kernel/sev.c      | 26 +++++++++++++++++++
 include/linux/sev.h        | 51 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 78 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/sev.h

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 6c23e694a109..9e7e7e737f55 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -9,6 +9,7 @@
 #define __ASM_ENCRYPTED_STATE_H
 
 #include <linux/types.h>
+#include <linux/sev.h>
 #include <asm/insn.h>
 #include <asm/sev-common.h>
 #include <asm/bootparam.h>
@@ -75,9 +76,6 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* Software defined (when rFlags.CF = 1) */
 #define PVALIDATE_FAIL_NOUPDATE		255
 
-/* RMP page size */
-#define RMP_PG_SIZE_4K			0
-
 #define RMPADJUST_VMSA_PAGE_BIT		BIT(16)
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index f9d813d498fa..1aed3d53f59f 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -49,6 +49,8 @@
 #define DR7_RESET_VALUE        0x400
 
 #define RMPTABLE_ENTRIES_OFFSET        0x4000
+#define RMPENTRY_SHIFT			8
+#define rmptable_page_offset(x)	(RMPTABLE_ENTRIES_OFFSET + (((unsigned long)x) >> RMPENTRY_SHIFT))
 
 /* For early boot hypervisor communication in SEV-ES enabled guests */
 static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
@@ -2319,3 +2321,27 @@ static int __init snp_rmptable_init(void)
  * passthough state, and it is available after subsys_initcall().
  */
 fs_initcall(snp_rmptable_init);
+
+struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)
+{
+	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
+	struct rmpentry *entry, *large_entry;
+	unsigned long vaddr;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return NULL;
+
+	vaddr = rmptable_start + rmptable_page_offset(phys);
+	if (unlikely(vaddr > rmptable_end))
+		return NULL;
+
+	entry = (struct rmpentry *)vaddr;
+
+	/* Read a large RMP entry to get the correct page level used in RMP entry. */
+	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
+	large_entry = (struct rmpentry *)vaddr;
+	*level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
+
+	return entry;
+}
+EXPORT_SYMBOL_GPL(snp_lookup_page_in_rmptable);
diff --git a/include/linux/sev.h b/include/linux/sev.h
new file mode 100644
index 000000000000..83c89e999999
--- /dev/null
+++ b/include/linux/sev.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD Secure Encrypted Virtualization
+ *
+ * Author: Brijesh Singh <brijesh.singh@amd.com>
+ */
+
+#ifndef __LINUX_SEV_H
+#define __LINUX_SEV_H
+
+struct __packed rmpentry {
+	union {
+		struct {
+			u64	assigned	: 1,
+				pagesize	: 1,
+				immutable	: 1,
+				rsvd1		: 9,
+				gpa		: 39,
+				asid		: 10,
+				vmsa		: 1,
+				validated	: 1,
+				rsvd2		: 1;
+		} info;
+		u64 low;
+	};
+	u64 high;
+};
+
+#define rmpentry_assigned(x)	((x)->info.assigned)
+#define rmpentry_pagesize(x)	((x)->info.pagesize)
+#define rmpentry_vmsa(x)	((x)->info.vmsa)
+#define rmpentry_asid(x)	((x)->info.asid)
+#define rmpentry_validated(x)	((x)->info.validated)
+#define rmpentry_gpa(x)		((unsigned long)(x)->info.gpa)
+#define rmpentry_immutable(x)	((x)->info.immutable)
+
+/* RMP page size */
+#define RMP_PG_SIZE_4K			0
+
+#define RMP_TO_X86_PG_LEVEL(level)	(((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level);
+#else
+static inline struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_AMD_MEM_ENCRYPT */
+#endif /* __LINUX_SEV_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (4 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-12 18:44   ` Peter Gonda
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table Brijesh Singh
                   ` (34 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The RMPUPDATE instruction writes a new RMP entry in the RMP Table. The
hypervisor will use the instruction to add pages to the RMP table. See
APM3 for details on the instruction operations.

The PSMASH instruction expands a 2MB RMP entry into a corresponding set of
contiguous 4KB-Page RMP entries. The hypervisor will use this instruction
to adjust the RMP entry without invalidating the previous RMP entry.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kernel/sev.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/sev.h   | 20 ++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1aed3d53f59f..949efe530319 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2345,3 +2345,45 @@ struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)
 	return entry;
 }
 EXPORT_SYMBOL_GPL(snp_lookup_page_in_rmptable);
+
+int psmash(struct page *page)
+{
+	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
+	int ret;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return -ENXIO;
+
+	/* Retry if another processor is modifying the RMP entry. */
+	do {
+		/* Binutils version 2.36 supports the PSMASH mnemonic. */
+		asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
+			      : "=a"(ret)
+			      : "a"(spa)
+			      : "memory", "cc");
+	} while (ret == FAIL_INUSE);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(psmash);
+
+int rmpupdate(struct page *page, struct rmpupdate *val)
+{
+	unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
+	int ret;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return -ENXIO;
+
+	/* Retry if another processor is modifying the RMP entry. */
+	do {
+		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
+		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
+			     : "=a"(ret)
+			     : "a"(spa), "c"((unsigned long)val)
+			     : "memory", "cc");
+	} while (ret == FAIL_INUSE);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(rmpupdate);
diff --git a/include/linux/sev.h b/include/linux/sev.h
index 83c89e999999..bcd4d75d87c8 100644
--- a/include/linux/sev.h
+++ b/include/linux/sev.h
@@ -39,13 +39,33 @@ struct __packed rmpentry {
 
 #define RMP_TO_X86_PG_LEVEL(level)	(((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
 
+struct rmpupdate {
+	u64 gpa;
+	u8 assigned;
+	u8 pagesize;
+	u8 immutable;
+	u8 rsvd;
+	u32 asid;
+} __packed;
+
+
+/*
+ * The psmash() and rmpupdate() returns FAIL_INUSE when another processor is
+ * modifying the RMP entry.
+ */
+#define FAIL_INUSE              3
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level);
+int psmash(struct page *page);
+int rmpupdate(struct page *page, struct rmpupdate *e);
 #else
 static inline struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)
 {
 	return NULL;
 }
+static inline int psmash(struct page *page) { return -ENXIO; }
+static inline int rmpupdate(struct page *page, struct rmpupdate *e) { return -ENXIO; }
 
 #endif /* CONFIG_AMD_MEM_ENCRYPT */
 #endif /* __LINUX_SEV_H */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (5 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-14 22:25   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 08/40] x86/traps: Define RMP violation #PF error code Brijesh Singh
                   ` (33 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used in conjuntion with standard x86 and IOMMU page
tables to enforce memory restrictions and page access rights. The
RMP is indexed by system physical address, and is checked at the end
of CPU and IOMMU table walks. The RMP check is enforced as soon as
SEV-SNP is enabled globally in the system. Not every memory access
requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality
is already protected via memory encryption. When hardware encounters
an RMP checks failure, it raise a page-fault exception. The RMP bit in
fault error code can be used to determine if the fault was due to an
RMP checks failure.

A write from the hypervisor goes through the RMP checks. When the
hypervisor writes to pages, hardware checks to ensures that the assigned
bit in the RMP is zero (i.e page is shared). If the page table entry that
gives the sPA indicates that the target page size is a large page, then
all RMP entries for the 4KB constituting pages of the target must have the
assigned bit 0. If one of entry does not have assigned bit 0 then hardware
will raise an RMP violation. To resolve it, split the page table entry
leading to target page into 4K.

This poses a challenge in the Linux memory model. The Linux kernel
creates a direct mapping of all the physical memory -- referred to as
the physmap. The physmap may contain a valid mapping of guest owned pages.
During the page table walk, the host access may get into the situation
where one of the pages within the large page is owned by the guest (i.e
assigned bit is set in RMP). A write to a non-guest within the large page
will raise an RMP violation. Call set_memory_4k() to split the physmap
before adding the page in the RMP table. This ensures that the pages
added in the RMP table are used as 4K in the physmap.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kernel/sev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 949efe530319..a482e01f880a 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
 	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
 		return -ENXIO;
 
+	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
+	if (ret) {
+		pr_err("Failed to split physical address 0x%lx (%d)\n", spa, ret);
+		return ret;
+	}
+
 	/* Retry if another processor is modifying the RMP entry. */
 	do {
 		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 08/40] x86/traps: Define RMP violation #PF error code
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (6 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-15 19:02   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault Brijesh Singh
                   ` (32 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Bit 31 in the page fault-error bit will be set when processor encounters
an RMP violation.

While at it, use the BIT() macro.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/trap_pf.h | 18 +++++++++++-------
 arch/x86/mm/fault.c            |  1 +
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..29f678701753 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_TRAP_PF_H
 #define _ASM_X86_TRAP_PF_H
 
+#include <vdso/bits.h>  /* BIT() macro */
+
 /*
  * Page fault error code bits:
  *
@@ -12,15 +14,17 @@
  *   bit 4 ==				1: fault was an instruction fetch
  *   bit 5 ==				1: protection keys block access
  *   bit 15 ==				1: SGX MMU page-fault
+ *   bit 31 ==				1: fault was an RMP violation
  */
 enum x86_pf_error_code {
-	X86_PF_PROT	=		1 << 0,
-	X86_PF_WRITE	=		1 << 1,
-	X86_PF_USER	=		1 << 2,
-	X86_PF_RSVD	=		1 << 3,
-	X86_PF_INSTR	=		1 << 4,
-	X86_PF_PK	=		1 << 5,
-	X86_PF_SGX	=		1 << 15,
+	X86_PF_PROT	=		BIT(0),
+	X86_PF_WRITE	=		BIT(1),
+	X86_PF_USER	=		BIT(2),
+	X86_PF_RSVD	=		BIT(3),
+	X86_PF_INSTR	=		BIT(4),
+	X86_PF_PK	=		BIT(5),
+	X86_PF_SGX	=		BIT(15),
+	X86_PF_RMP	=		BIT(31),
 };
 
 #endif /* _ASM_X86_TRAP_PF_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1c548ad00752..2715240c757e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -545,6 +545,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 		 !(error_code & X86_PF_PROT) ? "not-present page" :
 		 (error_code & X86_PF_RSVD)  ? "reserved bit violation" :
 		 (error_code & X86_PF_PK)    ? "protection keys violation" :
+		 (error_code & X86_PF_RMP)   ? "rmp violation" :
 					       "permissions violation");
 
 	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (7 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 08/40] x86/traps: Define RMP violation #PF error code Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 19:21   ` Dave Hansen
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address Brijesh Singh
                   ` (31 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. If the hardware encounters the check failure, then it raises
the #PF (with RMP set). Dump the RMP table to help the debug.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/fault.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2715240c757e..195149eae9b6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/sev.h>			/* snp_lookup_page_in_rmptable() */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -502,6 +503,81 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index)
 		 name, index, addr, (desc.limit0 | (desc.limit1 << 16)));
 }
 
+static void dump_rmpentry(unsigned long address)
+{
+	struct rmpentry *e;
+	unsigned long pfn;
+	pgd_t *pgd;
+	pte_t *pte;
+	int level;
+
+	pgd = __va(read_cr3_pa());
+	pgd += pgd_index(address);
+
+	pte = lookup_address_in_pgd(pgd, address, &level);
+	if (unlikely(!pte))
+		return;
+
+	switch (level) {
+	case PG_LEVEL_4K: {
+		pfn = pte_pfn(*pte);
+		break;
+	}
+	case PG_LEVEL_2M: {
+		pfn = pmd_pfn(*(pmd_t *)pte);
+		break;
+	}
+	case PG_LEVEL_1G: {
+		pfn = pud_pfn(*(pud_t *)pte);
+		break;
+	}
+	case PG_LEVEL_512G: {
+		pfn = p4d_pfn(*(p4d_t *)pte);
+		break;
+	}
+	default:
+		return;
+	}
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+	if (unlikely(!e))
+		return;
+
+	/*
+	 * If the RMP entry at the faulting address was not assigned, then
+	 * dump may not provide any useful debug information. Iterate
+	 * through the entire 2MB region, and dump the RMP entries if one
+	 * of the bit in the RMP entry is set.
+	 */
+	if (rmpentry_assigned(e)) {
+		pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
+			" asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
+			rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
+			rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
+			rmpentry_validated(e));
+
+		pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", pfn << PAGE_SHIFT,
+			e->high, e->low);
+	} else {
+		unsigned long pfn_end;
+
+		pfn = pfn & ~0x1ff;
+		pfn_end = pfn + PTRS_PER_PMD;
+
+		while (pfn < pfn_end) {
+			e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+
+			if (unlikely(!e))
+				return;
+
+			if (e->low || e->high)
+				pr_alert("RMPEntry paddr 0x%lx: %016llx %016llx\n",
+					pfn << PAGE_SHIFT, e->high, e->low);
+			pfn++;
+		}
+	}
+}
+
 static void
 show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
@@ -578,6 +654,9 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 	}
 
 	dump_pagetable(address);
+
+	if (error_code & X86_PF_RMP)
+		dump_rmpentry(address);
 }
 
 static noinline void
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (8 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-08 16:16   ` Dave Hansen
  2021-07-30 16:00   ` Vlastimil Babka
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 11/40] crypto:ccp: Define the SEV-SNP commands Brijesh Singh
                   ` (30 subsequent siblings)
  40 siblings, 2 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. When the host writes to pages, hardware checks the following
conditions at the end of page walk:

1. Assigned bit in the RMP table is zero (i.e page is shared).
2. If the page table entry that gives the sPA indicates that the target
   page size is a large page, then all RMP entries for the 4KB
   constituting pages of the target must have the assigned bit 0.
3. Immutable bit in the RMP table is not zero.

The hardware will raise page fault if one of the above conditions is not
met. Try resolving the fault instead of taking fault again and again. If
the host attempts to write to the guest private memory then send the
SIGBUG signal to kill the process. If the page level between the host and
RMP entry does not match, then split the address to keep the RMP and host
page levels in sync.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/fault.c | 69 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h  |  6 +++-
 mm/memory.c         | 13 +++++++++
 3 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 195149eae9b6..cdf48019c1a7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1281,6 +1281,58 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
+#define RMP_FAULT_RETRY		0
+#define RMP_FAULT_KILL		1
+#define RMP_FAULT_PAGE_SPLIT	2
+
+static inline size_t pages_per_hpage(int level)
+{
+	return page_level_size(level) / PAGE_SIZE;
+}
+
+static int handle_user_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
+{
+	unsigned long pfn, mask;
+	int rmp_level, level;
+	struct rmpentry *e;
+	pte_t *pte;
+
+	if (unlikely(!cpu_feature_enabled(X86_FEATURE_SEV_SNP)))
+		return RMP_FAULT_KILL;
+
+	/* Get the native page level */
+	pte = lookup_address_in_mm(current->mm, address, &level);
+	if (unlikely(!pte))
+		return RMP_FAULT_KILL;
+
+	pfn = pte_pfn(*pte);
+	if (level > PG_LEVEL_4K) {
+		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+		pfn |= (address >> PAGE_SHIFT) & mask;
+	}
+
+	/* Get the page level from the RMP entry. */
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (!e)
+		return RMP_FAULT_KILL;
+
+	/*
+	 * Check if the RMP violation is due to the guest private page access.
+	 * We can not resolve this RMP fault, ask to kill the guest.
+	 */
+	if (rmpentry_assigned(e))
+		return RMP_FAULT_KILL;
+
+	/*
+	 * The backing page level is higher than the RMP page level, request
+	 * to split the page.
+	 */
+	if (level > rmp_level)
+		return RMP_FAULT_PAGE_SPLIT;
+
+	return RMP_FAULT_RETRY;
+}
+
 /*
  * Handle faults in the user portion of the address space.  Nothing in here
  * should check X86_PF_USER without a specific justification: for almost
@@ -1298,6 +1350,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	vm_fault_t fault;
+	int ret;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
@@ -1378,6 +1431,22 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (error_code & X86_PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	/*
+	 * If its an RMP violation, try resolving it.
+	 */
+	if (error_code & X86_PF_RMP) {
+		ret = handle_user_rmp_page_fault(error_code, address);
+		if (ret == RMP_FAULT_PAGE_SPLIT) {
+			flags |= FAULT_FLAG_PAGE_SPLIT;
+		} else if (ret == RMP_FAULT_KILL) {
+			fault |= VM_FAULT_SIGBUS;
+			do_sigbus(regs, error_code, address, fault);
+			return;
+		} else {
+			return;
+		}
+	}
+
 #ifdef CONFIG_X86_64
 	/*
 	 * Faults in the vsyscall page might need emulation.  The
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 322ec61d0da7..211dfe5d3b1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -450,6 +450,8 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
+ *  region to smaller page size and retry.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -481,6 +483,7 @@ enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_PAGE_SPLIT =		1 << 10,
 };
 
 /*
@@ -520,7 +523,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
 	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
-	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
+	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
+	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/mm/memory.c b/mm/memory.c
index 730daa00952b..aef261d94e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4407,6 +4407,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+static int handle_split_page_fault(struct vm_fault *vmf)
+{
+	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+		return VM_FAULT_SIGBUS;
+
+	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+	return 0;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -4484,6 +4493,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
+
+		if (flags & FAULT_FLAG_PAGE_SPLIT)
+			return handle_split_page_fault(&vmf);
+
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 11/40] crypto:ccp: Define the SEV-SNP commands
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (9 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 12/40] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP Brijesh Singh
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

AMD introduced the next generation of SEV called SEV-SNP (Secure Nested
Paging). SEV-SNP builds upon existing SEV and SEV-ES functionality
while adding new hardware security protection.

Define the commands and structures used to communicate with the AMD-SP
when creating and managing the SEV-SNP guests. The SEV-SNP firmware spec
is available at developer.amd.com/sev.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c |  16 ++-
 include/linux/psp-sev.h      | 222 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/psp-sev.h |  43 +++++++
 3 files changed, 280 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 3506b2050fb8..32884d2bf4e5 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -130,7 +130,21 @@ static int sev_cmd_buffer_len(int cmd)
 	case SEV_CMD_DOWNLOAD_FIRMWARE:		return sizeof(struct sev_data_download_firmware);
 	case SEV_CMD_GET_ID:			return sizeof(struct sev_data_get_id);
 	case SEV_CMD_ATTESTATION_REPORT:	return sizeof(struct sev_data_attestation_report);
-	case SEV_CMD_SEND_CANCEL:			return sizeof(struct sev_data_send_cancel);
+	case SEV_CMD_SEND_CANCEL:		return sizeof(struct sev_data_send_cancel);
+	case SEV_CMD_SNP_GCTX_CREATE:		return sizeof(struct sev_data_snp_gctx_create);
+	case SEV_CMD_SNP_LAUNCH_START:		return sizeof(struct sev_data_snp_launch_start);
+	case SEV_CMD_SNP_LAUNCH_UPDATE:		return sizeof(struct sev_data_snp_launch_update);
+	case SEV_CMD_SNP_ACTIVATE:		return sizeof(struct sev_data_snp_activate);
+	case SEV_CMD_SNP_DECOMMISSION:		return sizeof(struct sev_data_snp_decommission);
+	case SEV_CMD_SNP_PAGE_RECLAIM:		return sizeof(struct sev_data_snp_page_reclaim);
+	case SEV_CMD_SNP_GUEST_STATUS:		return sizeof(struct sev_data_snp_guest_status);
+	case SEV_CMD_SNP_LAUNCH_FINISH:		return sizeof(struct sev_data_snp_launch_finish);
+	case SEV_CMD_SNP_DBG_DECRYPT:		return sizeof(struct sev_data_snp_dbg);
+	case SEV_CMD_SNP_DBG_ENCRYPT:		return sizeof(struct sev_data_snp_dbg);
+	case SEV_CMD_SNP_PAGE_UNSMASH:		return sizeof(struct sev_data_snp_page_unsmash);
+	case SEV_CMD_SNP_PLATFORM_STATUS:	return sizeof(struct sev_data_snp_platform_status_buf);
+	case SEV_CMD_SNP_GUEST_REQUEST:		return sizeof(struct sev_data_snp_guest_request);
+	case SEV_CMD_SNP_CONFIG:		return sizeof(struct sev_user_data_snp_config);
 	default:				return 0;
 	}
 
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index d48a7192e881..c3755099ab55 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -85,6 +85,34 @@ enum sev_cmd {
 	SEV_CMD_DBG_DECRYPT		= 0x060,
 	SEV_CMD_DBG_ENCRYPT		= 0x061,
 
+	/* SNP specific commands */
+	SEV_CMD_SNP_INIT		= 0x81,
+	SEV_CMD_SNP_SHUTDOWN		= 0x82,
+	SEV_CMD_SNP_PLATFORM_STATUS	= 0x83,
+	SEV_CMD_SNP_DF_FLUSH		= 0x84,
+	SEV_CMD_SNP_INIT_EX		= 0x85,
+	SEV_CMD_SNP_DECOMMISSION	= 0x90,
+	SEV_CMD_SNP_ACTIVATE		= 0x91,
+	SEV_CMD_SNP_GUEST_STATUS	= 0x92,
+	SEV_CMD_SNP_GCTX_CREATE		= 0x93,
+	SEV_CMD_SNP_GUEST_REQUEST	= 0x94,
+	SEV_CMD_SNP_ACTIVATE_EX		= 0x95,
+	SEV_CMD_SNP_LAUNCH_START	= 0xA0,
+	SEV_CMD_SNP_LAUNCH_UPDATE	= 0xA1,
+	SEV_CMD_SNP_LAUNCH_FINISH	= 0xA2,
+	SEV_CMD_SNP_DBG_DECRYPT		= 0xB0,
+	SEV_CMD_SNP_DBG_ENCRYPT		= 0xB1,
+	SEV_CMD_SNP_PAGE_SWAP_OUT	= 0xC0,
+	SEV_CMD_SNP_PAGE_SWAP_IN	= 0xC1,
+	SEV_CMD_SNP_PAGE_MOVE		= 0xC2,
+	SEV_CMD_SNP_PAGE_MD_INIT	= 0xC3,
+	SEV_CMD_SNP_PAGE_MD_RECLAIM	= 0xC4,
+	SEV_CMD_SNP_PAGE_RO_RECLAIM	= 0xC5,
+	SEV_CMD_SNP_PAGE_RO_RESTORE	= 0xC6,
+	SEV_CMD_SNP_PAGE_RECLAIM	= 0xC7,
+	SEV_CMD_SNP_PAGE_UNSMASH	= 0xC8,
+	SEV_CMD_SNP_CONFIG		= 0xC9,
+
 	SEV_CMD_MAX,
 };
 
@@ -510,6 +538,200 @@ struct sev_data_attestation_report {
 	u32 len;				/* In/Out */
 } __packed;
 
+/**
+ * struct sev_data_snp_platform_status_buf - SNP_PLATFORM_STATUS command params
+ *
+ * @address: physical address where the status should be copied
+ */
+struct sev_data_snp_platform_status_buf {
+	u64 status_paddr;			/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_download_firmware - SNP_DOWNLOAD_FIRMWARE command params
+ *
+ * @address: physical address of firmware image
+ * @len: len of the firmware image
+ */
+struct sev_data_snp_download_firmware {
+	u64 address;				/* In */
+	u32 len;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_gctx_create - SNP_GCTX_CREATE command params
+ *
+ * @gctx_paddr: system physical address of the page donated to firmware by
+ *		the hypervisor to contain the guest context.
+ */
+struct sev_data_snp_gctx_create {
+	u64 gctx_paddr;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_activate - SNP_ACTIVATE command params
+ *
+ * @gctx_paddr: system physical address guest context page
+ * @asid: ASID to bind to the guest
+ */
+struct sev_data_snp_activate {
+	u64 gctx_paddr;				/* In */
+	u32 asid;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_decommission - SNP_DECOMMISSION command params
+ *
+ * @address: system physical address guest context page
+ */
+struct sev_data_snp_decommission {
+	u64 gctx_paddr;				/* In */
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_start - SNP_LAUNCH_START command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @policy: guest policy
+ * @ma_gctx_addr: system physical address of migration agent
+ * @imi_en: launch flow is launching an IMI for the purpose of
+ *   guest-assisted migration.
+ * @ma_en: the guest is associated with a migration agent
+ */
+struct sev_data_snp_launch_start {
+	u64 gctx_paddr;				/* In */
+	u64 policy;				/* In */
+	u64 ma_gctx_paddr;			/* In */
+	u32 ma_en:1;				/* In */
+	u32 imi_en:1;				/* In */
+	u32 rsvd:30;
+	u8 gosvw[16];				/* In */
+} __packed;
+
+/* SNP support page type */
+enum {
+	SNP_PAGE_TYPE_NORMAL		= 0x1,
+	SNP_PAGE_TYPE_VMSA		= 0x2,
+	SNP_PAGE_TYPE_ZERO		= 0x3,
+	SNP_PAGE_TYPE_UNMEASURED	= 0x4,
+	SNP_PAGE_TYPE_SECRET		= 0x5,
+	SNP_PAGE_TYPE_CPUID		= 0x6,
+
+	SNP_PAGE_TYPE_MAX
+};
+
+/**
+ * struct sev_data_snp_launch_update - SNP_LAUNCH_UPDATE command params
+ *
+ * @gctx_addr: system physical address of guest context page
+ * @imi_page: indicates that this page is part of the IMI of the guest
+ * @page_type: encoded page type
+ * @page_size: page size 0 indicates 4K and 1 indicates 2MB page
+ * @address: system physical address of destination page to encrypt
+ * @vmpl3_perms: VMPL permission mask for VMPL3
+ * @vmpl2_perms: VMPL permission mask for VMPL2
+ * @vmpl1_perms: VMPL permission mask for VMPL1
+ */
+struct sev_data_snp_launch_update {
+	u64 gctx_paddr;				/* In */
+	u32 page_size:1;			/* In */
+	u32 page_type:3;			/* In */
+	u32 imi_page:1;				/* In */
+	u32 rsvd:27;
+	u32 rsvd2;
+	u64 address;				/* In */
+	u32 rsvd3:8;
+	u32 vmpl3_perms:8;			/* In */
+	u32 vmpl2_perms:8;			/* In */
+	u32 vmpl1_perms:8;			/* In */
+	u32 rsvd4;
+} __packed;
+
+/**
+ * struct sev_data_snp_launch_finish - SNP_LAUNCH_FINISH command params
+ *
+ * @gctx_addr: system pphysical address of guest context page
+ */
+struct sev_data_snp_launch_finish {
+	u64 gctx_paddr;
+	u64 id_block_paddr;
+	u64 id_auth_paddr;
+	u8 id_block_en:1;
+	u8 auth_key_en:1;
+	u64 rsvd:62;
+	u8 host_data[32];
+} __packed;
+
+/**
+ * struct sev_data_snp_guest_status - SNP_GUEST_STATUS command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @address: system physical address of guest status page
+ */
+struct sev_data_snp_guest_status {
+	u64 gctx_paddr;
+	u64 address;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_reclaim - SNP_PAGE_RECLAIM command params
+ *
+ * @paddr: system physical address of page to be claimed. The BIT0 indicate
+ *	the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_reclaim {
+	u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_snp_page_unsmash - SNP_PAGE_UNMASH command params
+ *
+ * @paddr: system physical address of page to be unmashed. The BIT0 indicate
+ *	the page size. 0h indicates 4 kB and 1h indicates 2 MB page.
+ */
+struct sev_data_snp_page_unsmash {
+	u64 paddr;
+} __packed;
+
+/**
+ * struct sev_data_dbg - DBG_ENCRYPT/DBG_DECRYPT command parameters
+ *
+ * @handle: handle of the VM to perform debug operation
+ * @src_addr: source address of data to operate on
+ * @dst_addr: destination address of data to operate on
+ * @len: len of data to operate on
+ */
+struct sev_data_snp_dbg {
+	u64 gctx_paddr;				/* In */
+	u64 src_addr;				/* In */
+	u64 dst_addr;				/* In */
+	u32 len;				/* In */
+} __packed;
+
+/**
+ * struct sev_snp_guest_request - SNP_GUEST_REQUEST command params
+ *
+ * @gctx_paddr: system physical address of guest context page
+ * @req_paddr: system physical address of request page
+ * @res_paddr: system physical address of response page
+ */
+struct sev_data_snp_guest_request {
+	u64 gctx_paddr;				/* In */
+	u64 req_paddr;				/* In */
+	u64 res_paddr;				/* In */
+} __packed;
+
+/**
+ * struuct sev_data_snp_init - SNP_INIT_EX structure
+ *
+ * @init_rmp: indicate that the RMP should be initialized.
+ */
+struct sev_data_snp_init_ex {
+	u32 init_rmp:1;
+	u32 rsvd:31;
+	u8 rsvd1[60];
+} __packed;
+
 #ifdef CONFIG_CRYPTO_DEV_SP_PSP
 
 /**
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 91b4c63d5cbf..226de6330a18 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -61,6 +61,13 @@ typedef enum {
 	SEV_RET_INVALID_PARAM,
 	SEV_RET_RESOURCE_LIMIT,
 	SEV_RET_SECURE_DATA_INVALID,
+	SEV_RET_INVALID_PAGE_SIZE,
+	SEV_RET_INVALID_PAGE_STATE,
+	SEV_RET_INVALID_MDATA_ENTRY,
+	SEV_RET_INVALID_PAGE_OWNER,
+	SEV_RET_INVALID_PAGE_AEAD_OFLOW,
+	SEV_RET_RMP_INIT_REQUIRED,
+
 	SEV_RET_MAX,
 } sev_ret_code;
 
@@ -147,6 +154,42 @@ struct sev_user_data_get_id2 {
 	__u32 length;				/* In/Out */
 } __packed;
 
+/**
+ * struct sev_user_data_snp_status - SNP status
+ *
+ * @major: API major version
+ * @minor: API minor version
+ * @state: current platform state
+ * @build: firmware build id for the API version
+ * @guest_count: the number of guest currently managed by the firmware
+ * @tcb_version: current TCB version
+ */
+struct sev_user_data_snp_status {
+	__u8 api_major;		/* Out */
+	__u8 api_minor;		/* Out */
+	__u8 state;		/* Out */
+	__u8 rsvd;
+	__u32 build_id;		/* Out */
+	__u32 rsvd1;
+	__u32 guest_count;	/* Out */
+	__u64 tcb_version;	/* Out */
+	__u64 rsvd2;
+} __packed;
+
+/*
+ * struct sev_user_data_snp_config - system wide configuration value for SNP.
+ *
+ * @reported_tcb: The TCB version to report in the guest attestation report.
+ * @mask_chip_id: Indicates that the CHID_ID field in the attestation report
+ * will always be zero.
+ */
+struct sev_user_data_snp_config {
+	__u64 reported_tcb;     /* In */
+	__u32 mask_chip_id;     /* In */
+	__u8 rsvd[52];
+} __packed;
+
+
 /**
  * struct sev_issue_cmd - SEV ioctl parameters
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 12/40] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (10 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 11/40] crypto:ccp: Define the SEV-SNP commands Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 13/40] crypto: ccp: Shutdown SNP firmware on kexec Brijesh Singh
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Before SNP VMs can be launched, the platform must be appropriately
configured and initialized. Platform initialization is accomplished via
the SNP_INIT command.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 114 +++++++++++++++++++++++++++++++++--
 drivers/crypto/ccp/sev-dev.h |   2 +
 include/linux/psp-sev.h      |  16 +++++
 3 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 32884d2bf4e5..d3c717bb5b50 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -591,6 +591,95 @@ static int sev_update_firmware(struct device *dev)
 	return ret;
 }
 
+static void snp_set_hsave_pa(void *arg)
+{
+	wrmsrl(MSR_VM_HSAVE_PA, 0);
+}
+
+static int __sev_snp_init_locked(int *error)
+{
+	struct psp_device *psp = psp_master;
+	struct sev_device *sev;
+	int rc = 0;
+
+	if (!psp || !psp->sev_data)
+		return -ENODEV;
+
+	sev = psp->sev_data;
+
+	if (sev->snp_inited)
+		return 0;
+
+	/*
+	 * The SNP_INIT requires the MSR_VM_HSAVE_PA must be set to 0h
+	 * across all cores.
+	 */
+	on_each_cpu(snp_set_hsave_pa, NULL, 1);
+
+	/* Prepare for first SEV guest launch after INIT */
+	wbinvd_on_all_cpus();
+
+	/* Issue the SNP_INIT firmware command. */
+	rc = __sev_do_cmd_locked(SEV_CMD_SNP_INIT, NULL, error);
+	if (rc)
+		return rc;
+
+	sev->snp_inited = true;
+	dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
+
+	return rc;
+}
+
+int sev_snp_init(int *error)
+{
+	int rc;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return -ENODEV;
+
+	mutex_lock(&sev_cmd_mutex);
+	rc = __sev_snp_init_locked(error);
+	mutex_unlock(&sev_cmd_mutex);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(sev_snp_init);
+
+static int __sev_snp_shutdown_locked(int *error)
+{
+	struct sev_device *sev = psp_master->sev_data;
+	int ret;
+
+	if (!sev->snp_inited)
+		return 0;
+
+	/* SHUTDOWN requires the DF_FLUSH */
+	wbinvd_on_all_cpus();
+	__sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
+
+	ret = __sev_do_cmd_locked(SEV_CMD_SNP_SHUTDOWN, NULL, error);
+	if (ret) {
+		dev_err(sev->dev, "SEV-SNP firmware shutdown failed\n");
+		return ret;
+	}
+
+	sev->snp_inited = false;
+	dev_dbg(sev->dev, "SEV-SNP firmware shutdown\n");
+
+	return ret;
+}
+
+static int sev_snp_shutdown(int *error)
+{
+	int rc;
+
+	mutex_lock(&sev_cmd_mutex);
+	rc = __sev_snp_shutdown_locked(NULL);
+	mutex_unlock(&sev_cmd_mutex);
+
+	return rc;
+}
+
 static int sev_ioctl_do_pek_import(struct sev_issue_cmd *argp, bool writable)
 {
 	struct sev_device *sev = psp_master->sev_data;
@@ -1095,6 +1184,21 @@ void sev_pci_init(void)
 			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
 	}
 
+	/*
+	 * If boot CPU supports the SNP, then first attempt to initialize
+	 * the SNP firmware.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SEV_SNP)) {
+		rc = sev_snp_init(&error);
+		if (rc) {
+			/*
+			 * If we failed to INIT SNP then don't abort the probe.
+			 * Continue to initialize the legacy SEV firmware.
+			 */
+			dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
+		}
+	}
+
 	/* Initialize the platform */
 	rc = sev_platform_init(&error);
 	if (rc && (error == SEV_RET_SECURE_DATA_INVALID)) {
@@ -1109,13 +1213,11 @@ void sev_pci_init(void)
 		rc = sev_platform_init(&error);
 	}
 
-	if (rc) {
+	if (rc)
 		dev_err(sev->dev, "SEV: failed to INIT error %#x\n", error);
-		return;
-	}
 
-	dev_info(sev->dev, "SEV API:%d.%d build:%d\n", sev->api_major,
-		 sev->api_minor, sev->build);
+	dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_inited ?
+		"-SNP" : "", sev->api_major, sev->api_minor, sev->build);
 
 	return;
 
@@ -1138,4 +1240,6 @@ void sev_pci_exit(void)
 			   get_order(SEV_ES_TMR_SIZE));
 		sev_es_tmr = NULL;
 	}
+
+	sev_snp_shutdown(NULL);
 }
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 666c21eb81ab..186ad20cbd24 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -52,6 +52,8 @@ struct sev_device {
 	u8 build;
 
 	void *cmd_buf;
+
+	bool snp_inited;
 };
 
 int sev_dev_init(struct psp_device *psp);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index c3755099ab55..1b53e8782250 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -748,6 +748,20 @@ struct sev_data_snp_init_ex {
  */
 int sev_platform_init(int *error);
 
+/**
+ * sev_snp_init - perform SEV SNP_INIT command
+ *
+ * @error: SEV command return code
+ *
+ * Returns:
+ * 0 if the SEV successfully processed the command
+ * -%ENODEV    if the SEV device is not available
+ * -%ENOTSUPP  if the SEV does not support SEV
+ * -%ETIMEDOUT if the SEV command timed out
+ * -%EIO       if the SEV returned a non-zero return code
+ */
+int sev_snp_init(int *error);
+
 /**
  * sev_platform_status - perform SEV PLATFORM_STATUS command
  *
@@ -855,6 +869,8 @@ sev_platform_status(struct sev_user_data_status *status, int *error) { return -E
 
 static inline int sev_platform_init(int *error) { return -ENODEV; }
 
+static inline int sev_snp_init(int *error) { return -ENODEV; }
+
 static inline int
 sev_guest_deactivate(struct sev_data_deactivate *data, int *error) { return -ENODEV; }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 13/40] crypto: ccp: Shutdown SNP firmware on kexec
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (11 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 12/40] crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 14/40] crypto:ccp: Provide APIs to issue SEV-SNP commands Brijesh Singh
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

When the kernel is getting ready to kexec, it calls the device_shutdown()
to allow drivers to cleanup before the kexec. If SEV firmware is
initialized then shutdown it before kexec'ing the new kernel.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 53 +++++++++++++++++-------------------
 drivers/crypto/ccp/sp-pci.c  | 12 ++++++++
 2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index d3c717bb5b50..84c91bab00bd 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -310,6 +310,9 @@ static int __sev_platform_shutdown_locked(int *error)
 	struct sev_device *sev = psp_master->sev_data;
 	int ret;
 
+	if (sev->state == SEV_STATE_UNINIT)
+		return 0;
+
 	ret = __sev_do_cmd_locked(SEV_CMD_SHUTDOWN, NULL, error);
 	if (ret)
 		return ret;
@@ -1118,6 +1121,22 @@ int sev_dev_init(struct psp_device *psp)
 	return ret;
 }
 
+static void sev_firmware_shutdown(struct sev_device *sev)
+{
+	sev_platform_shutdown(NULL);
+
+	if (sev_es_tmr) {
+		/* The TMR area was encrypted, flush it from the cache */
+		wbinvd_on_all_cpus();
+
+		free_pages((unsigned long)sev_es_tmr,
+			   get_order(SEV_ES_TMR_SIZE));
+		sev_es_tmr = NULL;
+	}
+
+	sev_snp_shutdown(NULL);
+}
+
 void sev_dev_destroy(struct psp_device *psp)
 {
 	struct sev_device *sev = psp->sev_data;
@@ -1125,6 +1144,8 @@ void sev_dev_destroy(struct psp_device *psp)
 	if (!sev)
 		return;
 
+	sev_firmware_shutdown(sev);
+
 	if (sev->misc)
 		kref_put(&misc_dev->refcount, sev_exit);
 
@@ -1155,21 +1176,6 @@ void sev_pci_init(void)
 	if (sev_get_api_version())
 		goto err;
 
-	/*
-	 * If platform is not in UNINIT state then firmware upgrade and/or
-	 * platform INIT command will fail. These command require UNINIT state.
-	 *
-	 * In a normal boot we should never run into case where the firmware
-	 * is not in UNINIT state on boot. But in case of kexec boot, a reboot
-	 * may not go through a typical shutdown sequence and may leave the
-	 * firmware in INIT or WORKING state.
-	 */
-
-	if (sev->state != SEV_STATE_UNINIT) {
-		sev_platform_shutdown(NULL);
-		sev->state = SEV_STATE_UNINIT;
-	}
-
 	if (sev_version_greater_or_equal(0, 15) &&
 	    sev_update_firmware(sev->dev) == 0)
 		sev_get_api_version();
@@ -1227,19 +1233,10 @@ void sev_pci_init(void)
 
 void sev_pci_exit(void)
 {
-	if (!psp_master->sev_data)
-		return;
-
-	sev_platform_shutdown(NULL);
-
-	if (sev_es_tmr) {
-		/* The TMR area was encrypted, flush it from the cache */
-		wbinvd_on_all_cpus();
+	struct sev_device *sev = psp_master->sev_data;
 
-		free_pages((unsigned long)sev_es_tmr,
-			   get_order(SEV_ES_TMR_SIZE));
-		sev_es_tmr = NULL;
-	}
+	if (!sev)
+		return;
 
-	sev_snp_shutdown(NULL);
+	sev_firmware_shutdown(sev);
 }
diff --git a/drivers/crypto/ccp/sp-pci.c b/drivers/crypto/ccp/sp-pci.c
index f468594ef8af..fb1b499bf04d 100644
--- a/drivers/crypto/ccp/sp-pci.c
+++ b/drivers/crypto/ccp/sp-pci.c
@@ -239,6 +239,17 @@ static int sp_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return ret;
 }
 
+static void sp_pci_shutdown(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct sp_device *sp = dev_get_drvdata(dev);
+
+	if (!sp)
+		return;
+
+	sp_destroy(sp);
+}
+
 static void sp_pci_remove(struct pci_dev *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -369,6 +380,7 @@ static struct pci_driver sp_pci_driver = {
 	.id_table = sp_pci_table,
 	.probe = sp_pci_probe,
 	.remove = sp_pci_remove,
+	.shutdown = sp_pci_shutdown,
 	.driver.pm = &sp_pci_pm_ops,
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 14/40] crypto:ccp: Provide APIs to issue SEV-SNP commands
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (12 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 13/40] crypto: ccp: Shutdown SNP firmware on kexec Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-08 18:56   ` Dr. David Alan Gilbert
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled Brijesh Singh
                   ` (26 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
commands for SEV-SNP is defined in the SEV-SNP firmware specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 24 ++++++++++++
 include/linux/psp-sev.h      | 74 ++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 84c91bab00bd..ad9a0c8111e0 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1017,6 +1017,30 @@ int sev_guest_df_flush(int *error)
 }
 EXPORT_SYMBOL_GPL(sev_guest_df_flush);
 
+int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_decommission);
+
+int snp_guest_df_flush(int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_df_flush);
+
+int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);
+
+int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+{
+	return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
+}
+EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
+
 static void sev_exit(struct kref *ref)
 {
 	misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 1b53e8782250..63ef766cbd7a 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -860,6 +860,65 @@ int sev_guest_df_flush(int *error);
  */
 int sev_guest_decommission(struct sev_data_decommission *data, int *error);
 
+/**
+ * snp_guest_df_flush - perform SNP DF_FLUSH command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int snp_guest_df_flush(int *error);
+
+/**
+ * snp_guest_decommission - perform SNP_DECOMMISSION command
+ *
+ * @decommission: sev_data_decommission structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error);
+
+/**
+ * snp_guest_page_reclaim - perform SNP_PAGE_RECLAIM command
+ *
+ * @decommission: sev_snp_page_reclaim structure to be processed
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
+
+/**
+ * snp_guest_dbg_decrypt - perform SEV SNP_DBG_DECRYPT command
+ *
+ * @sev_ret: sev command return code
+ *
+ * Returns:
+ * 0 if the sev successfully processed the command
+ * -%ENODEV    if the sev device is not available
+ * -%ENOTSUPP  if the sev does not support SEV
+ * -%ETIMEDOUT if the sev command timed out
+ * -%EIO       if the sev returned a non-zero return code
+ */
+int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
+
+
 void *psp_copy_user_blob(u64 uaddr, u32 len);
 
 #else	/* !CONFIG_CRYPTO_DEV_SP_PSP */
@@ -887,6 +946,21 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int
 
 static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }
 
+static inline int
+snp_guest_decommission(struct sev_data_snp_decommission *data, int *error) { return -ENODEV; }
+
+static inline int snp_guest_df_flush(int *error) { return -ENODEV; }
+
+static inline int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
+{
+	return -ENODEV;
+}
+
+static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
+{
+	return -ENODEV;
+}
+
 #endif	/* CONFIG_CRYPTO_DEV_SP_PSP */
 
 #endif	/* __PSP_SEV_H__ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (13 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 14/40] crypto:ccp: Provide APIs to issue SEV-SNP commands Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-14 13:22   ` Marc Orr
  2021-07-15 23:48   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 16/40] crypto: ccp: Handle the legacy SEV command " Brijesh Singh
                   ` (25 subsequent siblings)
  40 siblings, 2 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The behavior and requirement for the SEV-legacy command is altered when
the SNP firmware is in the INIT state. See SEV-SNP firmware specification
for more details.

When SNP is INIT state, all the SEV-legacy commands that cause the
firmware to write memory must be in the firmware state. The TMR memory
is allocated by the host but updated by the firmware, so, it must be
in the firmware state.  Additionally, the TMR memory must be a 2MB aligned
instead of the 1MB, and the TMR length need to be 2MB instead of 1MB.
The helper __snp_{alloc,free}_firmware_pages() can be used for allocating
and freeing the memory used by the firmware.

While at it, provide API that can be used by others to allocate a page
that can be used by the firmware. The immediate user for this API will
be the KVM driver. The KVM driver to need to allocate a firmware context
page during the guest creation. The context page need to be updated
by the firmware. See the SEV-SNP specification for further details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 144 +++++++++++++++++++++++++++++++----
 include/linux/psp-sev.h      |  11 +++
 2 files changed, 142 insertions(+), 13 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index ad9a0c8111e0..bb07c68834a6 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -54,6 +54,14 @@ static int psp_timeout;
 #define SEV_ES_TMR_SIZE		(1024 * 1024)
 static void *sev_es_tmr;
 
+/* When SEV-SNP is enabled the TMR need to be 2MB aligned and 2MB size. */
+#define SEV_SNP_ES_TMR_SIZE	(2 * 1024 * 1024)
+
+static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;
+
+static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
+static int sev_do_cmd(int cmd, void *data, int *psp_ret);
+
 static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
 {
 	struct sev_device *sev = psp_master->sev_data;
@@ -151,6 +159,112 @@ static int sev_cmd_buffer_len(int cmd)
 	return 0;
 }
 
+static int snp_reclaim_page(struct page *page, bool locked)
+{
+	struct sev_data_snp_page_reclaim data = {};
+	int ret, err;
+
+	data.paddr = page_to_pfn(page) << PAGE_SHIFT;
+
+	if (locked)
+		ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
+	else
+		ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
+
+	return ret;
+}
+
+static int snp_set_rmptable_state(unsigned long paddr, int npages,
+				  struct rmpupdate *val, bool locked, bool need_reclaim)
+{
+	unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
+	unsigned long pfn_end = pfn + npages;
+	struct psp_device *psp = psp_master;
+	struct sev_device *sev;
+	int rc;
+
+	if (!psp || !psp->sev_data)
+		return 0;
+
+	/* If SEV-SNP is initialized then add the page in RMP table. */
+	sev = psp->sev_data;
+	if (!sev->snp_inited)
+		return 0;
+
+	while (pfn < pfn_end) {
+		if (need_reclaim)
+			if (snp_reclaim_page(pfn_to_page(pfn), locked))
+				return -EFAULT;
+
+		rc = rmpupdate(pfn_to_page(pfn), val);
+		if (rc)
+			return rc;
+
+		pfn++;
+	}
+
+	return 0;
+}
+
+static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
+{
+	struct rmpupdate val = {};
+	unsigned long paddr;
+	struct page *page;
+
+	page = alloc_pages(gfp_mask, order);
+	if (!page)
+		return NULL;
+
+	val.assigned = 1;
+	val.immutable = 1;
+	paddr = __pa((unsigned long)page_address(page));
+
+	if (snp_set_rmptable_state(paddr, 1 << order, &val, locked, false)) {
+		pr_warn("Failed to set page state (leaking it)\n");
+		return NULL;
+	}
+
+	return page;
+}
+
+void *snp_alloc_firmware_page(gfp_t gfp_mask)
+{
+	struct page *page;
+
+	page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
+
+	return page ? page_address(page) : NULL;
+}
+EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
+
+static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
+{
+	struct rmpupdate val = {};
+	unsigned long paddr;
+
+	if (!page)
+		return;
+
+	paddr = __pa((unsigned long)page_address(page));
+
+	if (snp_set_rmptable_state(paddr, 1 << order, &val, locked, true)) {
+		pr_warn("Failed to set page state (leaking it)\n");
+		return;
+	}
+
+	__free_pages(page, order);
+}
+
+void snp_free_firmware_page(void *addr)
+{
+	if (!addr)
+		return;
+
+	__snp_free_firmware_pages(virt_to_page(addr), 0, false);
+}
+EXPORT_SYMBOL(snp_free_firmware_page);
+
 static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 {
 	struct psp_device *psp = psp_master;
@@ -273,7 +387,7 @@ static int __sev_platform_init_locked(int *error)
 
 		data.flags |= SEV_INIT_FLAGS_SEV_ES;
 		data.tmr_address = tmr_pa;
-		data.tmr_len = SEV_ES_TMR_SIZE;
+		data.tmr_len = sev_es_tmr_size;
 	}
 
 	rc = __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
@@ -630,6 +744,8 @@ static int __sev_snp_init_locked(int *error)
 	sev->snp_inited = true;
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
 
+	sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
+
 	return rc;
 }
 
@@ -1153,8 +1269,10 @@ static void sev_firmware_shutdown(struct sev_device *sev)
 		/* The TMR area was encrypted, flush it from the cache */
 		wbinvd_on_all_cpus();
 
-		free_pages((unsigned long)sev_es_tmr,
-			   get_order(SEV_ES_TMR_SIZE));
+
+		__snp_free_firmware_pages(virt_to_page(sev_es_tmr),
+					  get_order(sev_es_tmr_size),
+					  false);
 		sev_es_tmr = NULL;
 	}
 
@@ -1204,16 +1322,6 @@ void sev_pci_init(void)
 	    sev_update_firmware(sev->dev) == 0)
 		sev_get_api_version();
 
-	/* Obtain the TMR memory area for SEV-ES use */
-	tmr_page = alloc_pages(GFP_KERNEL, get_order(SEV_ES_TMR_SIZE));
-	if (tmr_page) {
-		sev_es_tmr = page_address(tmr_page);
-	} else {
-		sev_es_tmr = NULL;
-		dev_warn(sev->dev,
-			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
-	}
-
 	/*
 	 * If boot CPU supports the SNP, then first attempt to initialize
 	 * the SNP firmware.
@@ -1229,6 +1337,16 @@ void sev_pci_init(void)
 		}
 	}
 
+	/* Obtain the TMR memory area for SEV-ES use */
+	tmr_page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(sev_es_tmr_size), false);
+	if (tmr_page) {
+		sev_es_tmr = page_address(tmr_page);
+	} else {
+		sev_es_tmr = NULL;
+		dev_warn(sev->dev,
+			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
+	}
+
 	/* Initialize the platform */
 	rc = sev_platform_init(&error);
 	if (rc && (error == SEV_RET_SECURE_DATA_INVALID)) {
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index 63ef766cbd7a..b72a74f6a4e9 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -12,6 +12,8 @@
 #ifndef __PSP_SEV_H__
 #define __PSP_SEV_H__
 
+#include <linux/sev.h>
+
 #include <uapi/linux/psp-sev.h>
 
 #ifdef CONFIG_X86
@@ -920,6 +922,8 @@ int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
 
 
 void *psp_copy_user_blob(u64 uaddr, u32 len);
+void *snp_alloc_firmware_page(gfp_t mask);
+void snp_free_firmware_page(void *addr);
 
 #else	/* !CONFIG_CRYPTO_DEV_SP_PSP */
 
@@ -961,6 +965,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
 	return -ENODEV;
 }
 
+static inline void *snp_alloc_firmware_page(gfp_t mask)
+{
+	return NULL;
+}
+
+static inline void snp_free_firmware_page(void *addr) { }
+
 #endif	/* CONFIG_CRYPTO_DEV_SP_PSP */
 
 #endif	/* __PSP_SEV_H__ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 16/40] crypto: ccp: Handle the legacy SEV command when SNP is enabled
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (14 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 17/40] crypto: ccp: Add the SNP_PLATFORM_STATUS command Brijesh Singh
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The behavior of the SEV-legacy commands is altered when the SNP firmware
is in the INIT state. When SNP is in INIT state, all the SEV-legacy
commands that cause the firmware to write to memory must be in the
firmware state before issuing the command..

A command buffer may contains a system physical address that the firmware
may write to. There are two cases that need to be handled:

1) system physical address points to a guest memory
2) system physical address points to a host memory

To handle the case #1, map_firmware_writeable() helper simply
changes the page state in the RMP table before and after the command is
sent to the firmware.

For the case #2, the map_firmware_writeable() replaces the host system
physical memory with a pre-allocated firmware page, and after the command
completes, the unmap_firmware_writeable() copies the content from
pre-allocated firmware page to original host system physical.

The unmap_firmware_writeable() calls a __sev_do_cmd_locked() to clear
the immutable bit from the memory page. To support the nested calling,
a separate command buffer is required. Allocate a backup command buffer
and keep reference count of it. If a nested call is detected then use the
backup cmd_buf to complete the command submission.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 349 ++++++++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.h |  12 ++
 2 files changed, 351 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index bb07c68834a6..16f0d9211739 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -265,12 +265,300 @@ void snp_free_firmware_page(void *addr)
 }
 EXPORT_SYMBOL(snp_free_firmware_page);
 
+static int alloc_snp_host_map(struct sev_device *sev)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < MAX_SNP_HOST_MAP_BUFS; i++) {
+		struct snp_host_map *map = &sev->snp_host_map[i];
+
+		memset(map, 0, sizeof(*map));
+
+		page = __snp_alloc_firmware_pages(GFP_KERNEL_ACCOUNT,
+						  get_order(SEV_FW_BLOB_MAX_SIZE), false);
+		if (!page)
+			return -ENOMEM;
+
+		map->host = page_address(page);
+	}
+
+	return 0;
+}
+
+static void free_snp_host_map(struct sev_device *sev)
+{
+	int i;
+
+	for (i = 0; i < MAX_SNP_HOST_MAP_BUFS; i++) {
+		struct snp_host_map *map = &sev->snp_host_map[i];
+
+		if (map->host) {
+			__snp_free_firmware_pages(virt_to_page(map->host),
+						  get_order(SEV_FW_BLOB_MAX_SIZE),
+						  false);
+			memset(map, 0, sizeof(*map));
+		}
+	}
+}
+
+static int map_firmware_writeable(u64 *paddr, u32 len, bool guest, struct snp_host_map *map)
+{
+	unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+	int ret;
+
+	map->active = false;
+
+	if (!paddr || !len)
+		return 0;
+
+	map->paddr = *paddr;
+	map->len = len;
+
+	/* If paddr points to a guest memory then change the page state to firmwware. */
+	if (guest) {
+		struct rmpupdate val = {};
+
+		val.immutable = true;
+		val.assigned = true;
+		ret = snp_set_rmptable_state(*paddr, npages, &val, true, false);
+		if (ret)
+			return ret;
+
+		goto done;
+	}
+
+	if (unlikely(!map->host))
+		return -EINVAL;
+
+	/* Check if the pre-allocated buffer can be used to fullfil the request. */
+	if (unlikely(len > SEV_FW_BLOB_MAX_SIZE))
+		return -EINVAL;
+
+	/* Set the paddr to use an intermediate firmware buffer */
+	*paddr = __psp_pa(map->host);
+
+done:
+	map->active = true;
+	return 0;
+}
+
+static int unmap_firmware_writeable(u64 *paddr, u32 len, bool guest, struct snp_host_map *map)
+{
+	unsigned int npages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+	int ret;
+
+	if (!map->active)
+		return 0;
+
+	/* If paddr points to a guest memory then restore the page state to hypervisor. */
+	if (guest) {
+		struct rmpupdate val = {};
+
+		ret = snp_set_rmptable_state(*paddr, npages, &val, true, true);
+		if (ret)
+			return ret;
+
+		goto done;
+	}
+
+	/* Copy the response data firmware buffer to the callers buffer. */
+	memcpy(__va(__sme_clr(map->paddr)), map->host, min_t(size_t, len, map->len));
+	*paddr = map->paddr;
+
+done:
+	map->active = false;
+	return 0;
+}
+
+static bool sev_legacy_cmd_buf_writable(int cmd)
+{
+	switch (cmd) {
+	case SEV_CMD_PLATFORM_STATUS:
+	case SEV_CMD_GUEST_STATUS:
+	case SEV_CMD_LAUNCH_START:
+	case SEV_CMD_RECEIVE_START:
+	case SEV_CMD_LAUNCH_MEASURE:
+	case SEV_CMD_SEND_START:
+	case SEV_CMD_SEND_UPDATE_DATA:
+	case SEV_CMD_SEND_UPDATE_VMSA:
+	case SEV_CMD_PEK_CSR:
+	case SEV_CMD_PDH_CERT_EXPORT:
+	case SEV_CMD_GET_ID:
+	case SEV_CMD_ATTESTATION_REPORT:
+		return true;
+	default:
+		return false;
+	}
+}
+
+#define prep_buffer(name, addr, len, guest, map)  \
+	   func(&((typeof(name *))cmd_buf)->addr, ((typeof(name *))cmd_buf)->len, guest, map)
+
+static int __snp_cmd_buf_copy(int cmd, void *cmd_buf, bool to_fw, int fw_err)
+{
+	int (*func)(u64 *paddr, u32 len, bool guest, struct snp_host_map *map);
+	struct sev_device *sev = psp_master->sev_data;
+	struct rmpupdate val = {};
+	bool from_fw = !to_fw;
+	int ret;
+
+	/*
+	 * After the command is completed, change the command buffer memory to
+	 * hypervisor state.
+	 *
+	 * The immutable bit is automatically cleared by the firmware, so
+	 * no not need to reclaim the page.
+	 */
+	if (from_fw && sev_legacy_cmd_buf_writable(cmd)) {
+		ret = snp_set_rmptable_state(__pa(cmd_buf), 1, &val, true, false);
+		if (ret)
+			return ret;
+
+		/* No need to go further if firmware failed to execute command. */
+		if (fw_err)
+			return 0;
+	}
+
+	if (to_fw)
+		func = map_firmware_writeable;
+	else
+		func = unmap_firmware_writeable;
+
+	/*
+	 * A command buffer may contains a system physical address. If the address
+	 * points to a host memory then use an intermediate firmware page otherwise
+	 * change the page state in the RMP table.
+	 */
+	switch (cmd) {
+	case SEV_CMD_PDH_CERT_EXPORT:
+		if (prep_buffer(struct sev_data_pdh_cert_export, pdh_cert_address,
+				pdh_cert_len, false, &sev->snp_host_map[0]))
+			goto err;
+		if (prep_buffer(struct sev_data_pdh_cert_export, cert_chain_address,
+				cert_chain_len, false, &sev->snp_host_map[1]))
+			goto err;
+		break;
+	case SEV_CMD_GET_ID:
+		if (prep_buffer(struct sev_data_get_id, address, len,
+				false, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_PEK_CSR:
+		if (prep_buffer(struct sev_data_pek_csr, address, len,
+				    false, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_LAUNCH_UPDATE_DATA:
+		if (prep_buffer(struct sev_data_launch_update_data, address, len,
+				    true, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_LAUNCH_UPDATE_VMSA:
+		if (prep_buffer(struct sev_data_launch_update_vmsa, address, len,
+				true, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_LAUNCH_MEASURE:
+		if (prep_buffer(struct sev_data_launch_measure, address, len,
+				false, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_LAUNCH_UPDATE_SECRET:
+		if (prep_buffer(struct sev_data_launch_secret, guest_address, guest_len,
+				true, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_DBG_DECRYPT:
+		if (prep_buffer(struct sev_data_dbg, dst_addr, len, false,
+				&sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_DBG_ENCRYPT:
+		if (prep_buffer(struct sev_data_dbg, dst_addr, len, true,
+				&sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_ATTESTATION_REPORT:
+		if (prep_buffer(struct sev_data_attestation_report, address, len,
+				false, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_SEND_START:
+		if (prep_buffer(struct sev_data_send_start, session_address,
+				session_len, false, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_SEND_UPDATE_DATA:
+		if (prep_buffer(struct sev_data_send_update_data, hdr_address, hdr_len,
+				false, &sev->snp_host_map[0]))
+			goto err;
+		if (prep_buffer(struct sev_data_send_update_data, trans_address,
+				trans_len, false, &sev->snp_host_map[1]))
+			goto err;
+		break;
+	case SEV_CMD_SEND_UPDATE_VMSA:
+		if (prep_buffer(struct sev_data_send_update_vmsa, hdr_address, hdr_len,
+				false, &sev->snp_host_map[0]))
+			goto err;
+		if (prep_buffer(struct sev_data_send_update_vmsa, trans_address,
+				trans_len, false, &sev->snp_host_map[1]))
+			goto err;
+		break;
+	case SEV_CMD_RECEIVE_UPDATE_DATA:
+		if (prep_buffer(struct sev_data_receive_update_data, guest_address,
+				guest_len, true, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	case SEV_CMD_RECEIVE_UPDATE_VMSA:
+		if (prep_buffer(struct sev_data_receive_update_vmsa, guest_address,
+				guest_len, true, &sev->snp_host_map[0]))
+			goto err;
+		break;
+	default:
+		break;
+	}
+
+	/* The command buffer need to be in the firmware state. */
+	if (to_fw && sev_legacy_cmd_buf_writable(cmd)) {
+		val.assigned = true;
+		val.immutable = true;
+		ret = snp_set_rmptable_state(__pa(cmd_buf), 1, &val, true, false);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+
+err:
+	return -EINVAL;
+}
+
+static inline bool need_firmware_copy(int cmd)
+{
+	struct sev_device *sev = psp_master->sev_data;
+
+	/* After SNP is INIT'ed, the behavior of legacy SEV command is changed. */
+	return ((cmd < SEV_CMD_SNP_INIT) && sev->snp_inited) ? true : false;
+}
+
+static int snp_aware_copy_to_firmware(int cmd, void *data)
+{
+	return __snp_cmd_buf_copy(cmd, data, true, 0);
+}
+
+static int snp_aware_copy_from_firmware(int cmd, void *data, int fw_err)
+{
+	return __snp_cmd_buf_copy(cmd, data, false, fw_err);
+}
+
 static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 {
 	struct psp_device *psp = psp_master;
 	struct sev_device *sev;
 	unsigned int phys_lsb, phys_msb;
 	unsigned int reg, ret = 0;
+	void *cmd_buf;
 	int buf_len;
 
 	if (!psp || !psp->sev_data)
@@ -290,12 +578,26 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 	 * work for some memory, e.g. vmalloc'd addresses, and @data may not be
 	 * physically contiguous.
 	 */
-	if (data)
-		memcpy(sev->cmd_buf, data, buf_len);
+	if (data) {
+		if (unlikely(sev->cmd_buf_active > 2))
+			return -EBUSY;
+
+		cmd_buf = sev->cmd_buf_active ? sev->cmd_buf_backup : sev->cmd_buf;
+
+		memcpy(cmd_buf, data, buf_len);
+		sev->cmd_buf_active++;
+
+		/*
+		 * The behavior of the SEV-legacy commands is altered when the
+		 * SNP firmware is in the INIT state.
+		 */
+		if (need_firmware_copy(cmd) && snp_aware_copy_to_firmware(cmd, sev->cmd_buf))
+			return -EFAULT;
+	}
 
 	/* Get the physical address of the command buffer */
-	phys_lsb = data ? lower_32_bits(__psp_pa(sev->cmd_buf)) : 0;
-	phys_msb = data ? upper_32_bits(__psp_pa(sev->cmd_buf)) : 0;
+	phys_lsb = data ? lower_32_bits(__psp_pa(cmd_buf)) : 0;
+	phys_msb = data ? upper_32_bits(__psp_pa(cmd_buf)) : 0;
 
 	dev_dbg(sev->dev, "sev command id %#x buffer 0x%08x%08x timeout %us\n",
 		cmd, phys_msb, phys_lsb, psp_timeout);
@@ -336,15 +638,24 @@ static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
 		ret = -EIO;
 	}
 
-	print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
-			     buf_len, false);
-
 	/*
 	 * Copy potential output from the PSP back to data.  Do this even on
 	 * failure in case the caller wants to glean something from the error.
 	 */
-	if (data)
-		memcpy(data, sev->cmd_buf, buf_len);
+	if (data) {
+		/*
+		 * Restore the page state after the command completes.
+		 */
+		if (need_firmware_copy(cmd) &&
+		    snp_aware_copy_from_firmware(cmd, cmd_buf, ret))
+			return -EFAULT;
+
+		memcpy(data, cmd_buf, buf_len);
+		sev->cmd_buf_active--;
+	}
+
+	print_hex_dump_debug("(out): ", DUMP_PREFIX_OFFSET, 16, 2, data,
+			     buf_len, false);
 
 	return ret;
 }
@@ -1219,10 +1530,12 @@ int sev_dev_init(struct psp_device *psp)
 	if (!sev)
 		goto e_err;
 
-	sev->cmd_buf = (void *)devm_get_free_pages(dev, GFP_KERNEL, 0);
+	sev->cmd_buf = (void *)devm_get_free_pages(dev, GFP_KERNEL, 1);
 	if (!sev->cmd_buf)
 		goto e_sev;
 
+	sev->cmd_buf_backup = (uint8_t *)sev->cmd_buf + PAGE_SIZE;
+
 	psp->sev_data = sev;
 
 	sev->dev = dev;
@@ -1276,6 +1589,12 @@ static void sev_firmware_shutdown(struct sev_device *sev)
 		sev_es_tmr = NULL;
 	}
 
+	/*
+	 * The host map need to clear the immutable bit so it must be free'd before the
+	 * SNP firmware shutdown.
+	 */
+	free_snp_host_map(sev);
+
 	sev_snp_shutdown(NULL);
 }
 
@@ -1335,6 +1654,14 @@ void sev_pci_init(void)
 			 */
 			dev_err(sev->dev, "SEV-SNP: failed to INIT error %#x\n", error);
 		}
+
+		/*
+		 * Allocate the intermediate buffers used for the legacy command handling.
+		 */
+		if (alloc_snp_host_map(sev)) {
+			dev_notice(sev->dev, "Failed to alloc host map (disabling legacy SEV)\n");
+			goto skip_legacy;
+		}
 	}
 
 	/* Obtain the TMR memory area for SEV-ES use */
@@ -1364,12 +1691,14 @@ void sev_pci_init(void)
 	if (rc)
 		dev_err(sev->dev, "SEV: failed to INIT error %#x\n", error);
 
+skip_legacy:
 	dev_info(sev->dev, "SEV%s API:%d.%d build:%d\n", sev->snp_inited ?
 		"-SNP" : "", sev->api_major, sev->api_minor, sev->build);
 
 	return;
 
 err:
+	free_snp_host_map(sev);
 	psp_master->sev_data = NULL;
 }
 
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 186ad20cbd24..fe5d7a3ebace 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -29,11 +29,20 @@
 #define SEV_CMDRESP_CMD_SHIFT		16
 #define SEV_CMDRESP_IOC			BIT(0)
 
+#define MAX_SNP_HOST_MAP_BUFS		2
+
 struct sev_misc_dev {
 	struct kref refcount;
 	struct miscdevice misc;
 };
 
+struct snp_host_map {
+	u64 paddr;
+	u32 len;
+	void *host;
+	bool active;
+};
+
 struct sev_device {
 	struct device *dev;
 	struct psp_device *psp;
@@ -52,8 +61,11 @@ struct sev_device {
 	u8 build;
 
 	void *cmd_buf;
+	void *cmd_buf_backup;
+	int cmd_buf_active;
 
 	bool snp_inited;
+	struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
 };
 
 int sev_dev_init(struct psp_device *psp);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 17/40] crypto: ccp: Add the SNP_PLATFORM_STATUS command
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (15 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 16/40] crypto: ccp: Handle the legacy SEV command " Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 18/40] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command Brijesh Singh
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The command can be used by the userspace to query the SNP platform status
report. See the SEV-SNP spec for more details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 Documentation/virt/coco/sevguest.rst | 27 ++++++++++++++++++++++++
 drivers/crypto/ccp/sev-dev.c         | 31 ++++++++++++++++++++++++++++
 drivers/crypto/ccp/sev-dev.h         |  1 +
 include/uapi/linux/psp-sev.h         |  1 +
 4 files changed, 60 insertions(+)

diff --git a/Documentation/virt/coco/sevguest.rst b/Documentation/virt/coco/sevguest.rst
index 7acb8696fca4..7c51da010039 100644
--- a/Documentation/virt/coco/sevguest.rst
+++ b/Documentation/virt/coco/sevguest.rst
@@ -52,6 +52,22 @@ to execute due to the firmware error, then fw_err code will be set.
                 __u64 fw_err;
         };
 
+The host ioctl should be called to /dev/sev device. The ioctl accepts command
+id and command input structure.
+
+::
+        struct sev_issue_cmd {
+                /* Command ID */
+                __u32 cmd;
+
+                /* Command request structure */
+                __u64 data;
+
+                /* firmware error code on failure (see psp-sev.h) */
+                __u32 error;
+        };
+
+
 2.1 SNP_GET_REPORT
 ------------------
 
@@ -107,3 +123,14 @@ length of the blob is lesser than expected then snp_ext_report_req.certs_len wil
 be updated with the expected value.
 
 See GHCB specification for further detail on how to parse the certificate blob.
+
+2.3 SNP_PLATFORM_STATUS
+-----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_platform_status
+:Returns (out): 0 on success, -negative on error
+
+The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
+status includes API major, minor version and more. See the SEV-SNP
+specification for further details.
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 16f0d9211739..65003aba807a 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1056,6 +1056,7 @@ static int __sev_snp_init_locked(int *error)
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
 
 	sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
+	sev->snp_plat_status_page = __snp_alloc_firmware_pages(GFP_KERNEL_ACCOUNT, 0, true);
 
 	return rc;
 }
@@ -1083,6 +1084,9 @@ static int __sev_snp_shutdown_locked(int *error)
 	if (!sev->snp_inited)
 		return 0;
 
+	/* Free the status page */
+	__snp_free_firmware_pages(sev->snp_plat_status_page, 0, true);
+
 	/* SHUTDOWN requires the DF_FLUSH */
 	wbinvd_on_all_cpus();
 	__sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
@@ -1345,6 +1349,30 @@ static int sev_ioctl_do_pdh_export(struct sev_issue_cmd *argp, bool writable)
 	return ret;
 }
 
+static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
+{
+	struct sev_device *sev = psp_master->sev_data;
+	struct sev_data_snp_platform_status_buf buf;
+	int ret;
+
+	if (!sev->snp_inited || !argp->data)
+		return -EINVAL;
+
+	if (!sev->snp_plat_status_page)
+		return -ENOMEM;
+
+	buf.status_paddr = __psp_pa(page_address(sev->snp_plat_status_page));
+	ret = __sev_do_cmd_locked(SEV_CMD_SNP_PLATFORM_STATUS, &buf, &argp->error);
+	if (ret)
+		return ret;
+
+	if (copy_to_user((void __user *)argp->data, page_address(sev->snp_plat_status_page),
+			sizeof(struct sev_user_data_snp_status)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
 {
 	void __user *argp = (void __user *)arg;
@@ -1396,6 +1424,9 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
 	case SEV_GET_ID2:
 		ret = sev_ioctl_do_get_id2(&input);
 		break;
+	case SNP_PLATFORM_STATUS:
+		ret = sev_ioctl_snp_platform_status(&input);
+		break;
 	default:
 		ret = -EINVAL;
 		goto out;
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index fe5d7a3ebace..5efe162ad82d 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -66,6 +66,7 @@ struct sev_device {
 
 	bool snp_inited;
 	struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
+	struct page *snp_plat_status_page;
 };
 
 int sev_dev_init(struct psp_device *psp);
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 226de6330a18..0c383d322097 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -28,6 +28,7 @@ enum {
 	SEV_PEK_CERT_IMPORT,
 	SEV_GET_ID,	/* This command is deprecated, use SEV_GET_ID2 */
 	SEV_GET_ID2,
+	SNP_PLATFORM_STATUS = 256,
 
 	SEV_MAX,
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 18/40] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (16 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 17/40] crypto: ccp: Add the SNP_PLATFORM_STATUS command Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 19/40] crypto: ccp: provide APIs to query extended attestation report Brijesh Singh
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The SEV-SNP firmware provides the SNP_CONFIG command used to set the
system-wide configuration value for SNP guests. The information includes
the TCB version string to be reported in guest attestation reports.

Version 2 of the GHCB specification adds an NAE (SNP extended guest
request) that a guest can use to query the reports that include additional
certificates.

In both cases, userspace provided additional data is included in the
attestation reports. The userspace will use the SNP_SET_EXT_CONFIG
command to give the certificate blob and the reported TCB version string
at once. Note that the specification defines certificate blob with a
specific GUID format; the userspace is responsible for building the
proper certificate blob. The ioctl treats it an opaque blob.

While it is not defined in the spec, but let's add SNP_GET_EXT_CONFIG
command that can be used to obtain the data programmed through the
SNP_SET_EXT_CONFIG.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 Documentation/virt/coco/sevguest.rst |  28 +++++++
 drivers/crypto/ccp/sev-dev.c         | 117 +++++++++++++++++++++++++++
 drivers/crypto/ccp/sev-dev.h         |   3 +
 include/uapi/linux/psp-sev.h         |  16 ++++
 4 files changed, 164 insertions(+)

diff --git a/Documentation/virt/coco/sevguest.rst b/Documentation/virt/coco/sevguest.rst
index 7c51da010039..64a1b5167b33 100644
--- a/Documentation/virt/coco/sevguest.rst
+++ b/Documentation/virt/coco/sevguest.rst
@@ -134,3 +134,31 @@ See GHCB specification for further detail on how to parse the certificate blob.
 The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
 status includes API major, minor version and more. See the SEV-SNP
 specification for further details.
+
+2.4 SNP_SET_EXT_CONFIG
+----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_ext_config
+:Returns (out): 0 on success, -negative on error
+
+The SNP_SET_EXT_CONFIG is used to set the system-wide configuration such as
+reported TCB version in the attestation report. The command is similar to
+SNP_CONFIG command defined in the SEV-SNP spec. The main difference is the
+command also accepts an additional certificate blob defined in the GHCB
+specification.
+
+If the certs_address is zero, then previous certificate blob will deleted.
+For more information on the certificate blob layout, see the GHCB spec
+(extended guest request message).
+
+
+2.4 SNP_GET_EXT_CONFIG
+----------------------
+:Technology: sev-snp
+:Type: hypervisor ioctl cmd
+:Parameters (in): struct sev_data_snp_ext_config
+:Returns (out): 0 on success, -negative on error
+
+The SNP_SET_EXT_CONFIG is used to query the system-wide configuration set
+through the SNP_SET_EXT_CONFIG.
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 65003aba807a..1984a7b2c4e1 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1087,6 +1087,10 @@ static int __sev_snp_shutdown_locked(int *error)
 	/* Free the status page */
 	__snp_free_firmware_pages(sev->snp_plat_status_page, 0, true);
 
+	/* Free the memory used for caching the certificate data */
+	kfree(sev->snp_certs_data);
+	sev->snp_certs_data = NULL;
+
 	/* SHUTDOWN requires the DF_FLUSH */
 	wbinvd_on_all_cpus();
 	__sev_do_cmd_locked(SEV_CMD_SNP_DF_FLUSH, NULL, NULL);
@@ -1373,6 +1377,113 @@ static int sev_ioctl_snp_platform_status(struct sev_issue_cmd *argp)
 	return 0;
 }
 
+static int sev_ioctl_snp_get_config(struct sev_issue_cmd *argp)
+{
+	struct sev_device *sev = psp_master->sev_data;
+	struct sev_user_data_ext_snp_config input;
+	int ret;
+
+	if (!sev->snp_inited || !argp->data)
+		return -EINVAL;
+
+	if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
+		return -EFAULT;
+
+	/* Copy the TCB version programmed through the SET_CONFIG to userspace */
+	if (input.config_address) {
+		if (copy_to_user((void * __user)input.config_address,
+				&sev->snp_config, sizeof (struct sev_user_data_snp_config)))
+			return -EFAULT;
+	}
+
+	/* Copy the extended certs programmed through the SNP_SET_CONFIG */
+	if (input.certs_address && sev->snp_certs_data) {
+		if (input.certs_len < sev->snp_certs_len) {
+			/* Return the certs length to userspace */
+			input.certs_len = sev->snp_certs_len;
+
+			ret = -ENOSR;
+			goto e_done;
+		}
+
+		if (copy_to_user((void * __user)input.certs_address,
+				sev->snp_certs_data, sev->snp_certs_len))
+			return -EFAULT;
+	}
+
+	ret = 0;
+
+e_done:
+	if (copy_to_user((void __user *)argp->data, &input, sizeof(input)))
+		ret = -EFAULT;
+
+	return ret;
+}
+
+static int sev_ioctl_snp_set_config(struct sev_issue_cmd *argp, bool writable)
+{
+	struct sev_device *sev = psp_master->sev_data;
+	struct sev_user_data_ext_snp_config input;
+	struct sev_user_data_snp_config config;
+	void *certs = NULL;
+	int ret = 0;
+
+	if (!sev->snp_inited || !argp->data)
+		return -EINVAL;
+
+	if (!writable)
+		return -EPERM;
+
+	if (copy_from_user(&input, (void __user *)argp->data, sizeof(input)))
+		return -EFAULT;
+
+	/* Copy the certs from userspace */
+	if (input.certs_address) {
+		if (!input.certs_len || !IS_ALIGNED(input.certs_len, PAGE_SIZE))
+			return -EINVAL;
+
+		certs = psp_copy_user_blob(input.certs_address, input.certs_len);
+		if (IS_ERR(certs))
+			return PTR_ERR(certs);
+
+	}
+
+	/* Issue the PSP command to update the TCB version using the SNP_CONFIG. */
+	if (input.config_address) {
+		if (copy_from_user(&config,
+				   (void __user *)input.config_address, sizeof(config))) {
+			ret = -EFAULT;
+			goto e_free;
+		}
+
+		ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error);
+		if (ret)
+			goto e_free;
+
+		memcpy(&sev->snp_config, &config, sizeof(config));
+	}
+
+	/*
+	 * If the new certs are passed then cache it else free the old certs.
+	 */
+	if (certs) {
+		kfree(sev->snp_certs_data);
+		sev->snp_certs_data = certs;
+		sev->snp_certs_len = input.certs_len;
+	} else {
+		kfree(sev->snp_certs_data);
+		sev->snp_certs_data = NULL;
+		sev->snp_certs_len = 0;
+	}
+
+	return 0;
+
+e_free:
+	kfree(certs);
+	return ret;
+}
+
+
 static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
 {
 	void __user *argp = (void __user *)arg;
@@ -1427,6 +1538,12 @@ static long sev_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
 	case SNP_PLATFORM_STATUS:
 		ret = sev_ioctl_snp_platform_status(&input);
 		break;
+	case SNP_SET_EXT_CONFIG:
+		ret = sev_ioctl_snp_set_config(&input, writable);
+		break;
+	case SNP_GET_EXT_CONFIG:
+		ret = sev_ioctl_snp_get_config(&input);
+		break;
 	default:
 		ret = -EINVAL;
 		goto out;
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 5efe162ad82d..37dc58c09cb6 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -67,6 +67,9 @@ struct sev_device {
 	bool snp_inited;
 	struct snp_host_map snp_host_map[MAX_SNP_HOST_MAP_BUFS];
 	struct page *snp_plat_status_page;
+	void *snp_certs_data;
+	u32 snp_certs_len;
+	struct sev_user_data_snp_config snp_config;
 };
 
 int sev_dev_init(struct psp_device *psp);
diff --git a/include/uapi/linux/psp-sev.h b/include/uapi/linux/psp-sev.h
index 0c383d322097..12c758b616c2 100644
--- a/include/uapi/linux/psp-sev.h
+++ b/include/uapi/linux/psp-sev.h
@@ -29,6 +29,8 @@ enum {
 	SEV_GET_ID,	/* This command is deprecated, use SEV_GET_ID2 */
 	SEV_GET_ID2,
 	SNP_PLATFORM_STATUS = 256,
+	SNP_SET_EXT_CONFIG,
+	SNP_GET_EXT_CONFIG,
 
 	SEV_MAX,
 };
@@ -190,6 +192,20 @@ struct sev_user_data_snp_config {
 	__u8 rsvd[52];
 } __packed;
 
+/**
+ * struct sev_data_snp_ext_config - system wide configuration value for SNP.
+ *
+ * @config_address: address of the struct sev_user_data_snp_config or 0 when
+ *      	reported_tcb does not need to be updated.
+ * @certs_address: address of extended guest request certificate chain or
+ *              0 when previous certificate should be removed on SNP_SET_EXT_CONFIG.
+ * @certs_len: length of the certs
+ */
+struct sev_user_data_ext_snp_config {
+	__u64 config_address;		/* In */
+	__u64 certs_address;		/* In */
+	__u32 certs_len;		/* In */
+};
 
 /**
  * struct sev_issue_cmd - SEV ioctl parameters
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 19/40] crypto: ccp: provide APIs to query extended attestation report
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (17 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 18/40] crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Version 2 of the GHCB specification defines VMGEXIT that is used to get
the extended attestation report. The extended attestation report includes
the certificate blobs provided through the SNP_SET_EXT_CONFIG.

The snp_guest_ext_guest_request() will be used by the hypervisor to get
the extended attestation report. See the GHCB specification for more
details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 43 ++++++++++++++++++++++++++++++++++++
 include/linux/psp-sev.h      | 24 ++++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 1984a7b2c4e1..4cc9c1dff49f 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -22,6 +22,7 @@
 #include <linux/firmware.h>
 #include <linux/gfp.h>
 #include <linux/cpufeature.h>
+#include <linux/sev-guest.h>
 
 #include <asm/smp.h>
 
@@ -1616,6 +1617,48 @@ int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
 }
 EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
 
+int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+				unsigned long vaddr, unsigned long *npages, unsigned long *fw_err)
+{
+	unsigned long expected_npages;
+	struct sev_device *sev;
+	int rc;
+
+	if (!psp_master || !psp_master->sev_data)
+		return -ENODEV;
+
+	sev = psp_master->sev_data;
+
+	if (!sev->snp_inited)
+		return -EINVAL;
+
+	/*
+	 * Check if we have enough space to copy the certificate chain. Otherwise
+	 * return ERROR code defined in the GHCB specification.
+	 */
+	expected_npages = sev->snp_certs_len >> PAGE_SHIFT;
+	if (*npages < expected_npages) {
+		*npages = expected_npages;
+		*fw_err = SNP_GUEST_REQ_INVALID_LEN;
+		return -EINVAL;
+	}
+
+	rc = sev_do_cmd(SEV_CMD_SNP_GUEST_REQUEST, data, (int *)&fw_err);
+	if (rc)
+		return rc;
+
+	/* Copy the certificate blob */
+	if (sev->snp_certs_data) {
+		*npages = expected_npages;
+		memcpy((void *)vaddr, sev->snp_certs_data, *npages << PAGE_SHIFT);
+	} else {
+		*npages = 0;
+	}
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(snp_guest_ext_guest_request);
+
 static void sev_exit(struct kref *ref)
 {
 	misc_deregister(&misc_dev->misc);
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index b72a74f6a4e9..2345ac6ae431 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -925,6 +925,23 @@ void *psp_copy_user_blob(u64 uaddr, u32 len);
 void *snp_alloc_firmware_page(gfp_t mask);
 void snp_free_firmware_page(void *addr);
 
+/**
+ * snp_guest_ext_guest_request - perform the SNP extended guest request command
+ *  defined in the GHCB specification.
+ *
+ * @data: the input guest request structure
+ * @vaddr: address where the certificate blob need to be copied.
+ * @npages: number of pages for the certificate blob.
+ *    If the specified page count is less than the certificate blob size, then the
+ *    required page count is returned with error code defined in the GHCB spec.
+ *    If the specified page count is more than the certificate blob size, then
+ *    page count is updated to reflect the amount of valid data copied in the
+ *    vaddr.
+ */
+int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+				unsigned long vaddr, unsigned long *npages,
+				unsigned long *error);
+
 #else	/* !CONFIG_CRYPTO_DEV_SP_PSP */
 
 static inline int
@@ -972,6 +989,13 @@ static inline void *snp_alloc_firmware_page(gfp_t mask)
 
 static inline void snp_free_firmware_page(void *addr) { }
 
+static inline int snp_guest_ext_guest_request(struct sev_data_snp_guest_request *data,
+					      unsigned long vaddr, unsigned long *n,
+					      unsigned long *error)
+{
+	return -ENODEV;
+}
+
 #endif	/* CONFIG_CRYPTO_DEV_SP_PSP */
 
 #endif	/* __PSP_SEV_H__ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (18 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 19/40] crypto: ccp: provide APIs to query extended attestation report Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-14 13:35   ` Marc Orr
  2021-07-20 18:02   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support Brijesh Singh
                   ` (20 subsequent siblings)
  40 siblings, 2 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

When SEV-SNP is globally enabled on a system, the VMRUN instruction
performs additional security checks on AVIC backing, VMSA, and VMCB page.
On a successful VMRUN, these pages are marked "in-use" by the
hardware in the RMP entry, and any attempt to modify the RMP entry for
these pages will result in page-fault (RMP violation check).

While performing the RMP check, hardware will try to create a 2MB TLB
entry for the large page accesses. When it does this, it first reads
the RMP for the base of 2MB region and verifies that all this memory is
safe. If AVIC backing, VMSA, and VMCB memory happen to be the base of
2MB region, then RMP check will fail because of the "in-use" marking for
the base entry of this 2MB region.

e.g.

1. A VMCB was allocated on 2MB-aligned address.
2. The VMRUN instruction marks this RMP entry as "in-use".
3. Another process allocated some other page of memory that happened to be
   within the same 2MB region.
4. That process tried to write its page using physmap.

If the physmap entry in step #4 uses a large (1G/2M) page, then the
hardware will attempt to create a 2M TLB entry. The hardware will find
that the "in-use" bit is set in the RMP entry (because it was a
VMCB page) and will cause an RMP violation check.

See APM2 section 15.36.12 for more information on VMRUN checks when
SEV-SNP is globally active.

A generic allocator can return a page which are 2M aligned and will not
be safe to be used when SEV-SNP is globally enabled. Add a
snp_safe_alloc_page() helper that can be used for allocating the
SNP safe memory. The helper allocated 2 pages and splits them into order-1
allocation. It frees one page and keeps one of the page which is not
2M aligned.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/lapic.c            |  5 ++++-
 arch/x86/kvm/svm/sev.c          | 27 +++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          | 16 ++++++++++++++--
 arch/x86/kvm/svm/svm.h          |  1 +
 5 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55efbacfc244..188110ab2c02 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1383,6 +1383,7 @@ struct kvm_x86_ops {
 	int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c0ebef560bd1..d4c77f66d7d5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2441,7 +2441,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
 
 	vcpu->arch.apic = apic;
 
-	apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+	if (kvm_x86_ops.alloc_apic_backing_page)
+		apic->regs = kvm_x86_ops.alloc_apic_backing_page(vcpu);
+	else
+		apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 	if (!apic->regs) {
 		printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
 		       vcpu->vcpu_id);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b8505710c36b..411ed72f63af 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2692,3 +2692,30 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 		break;
 	}
 }
+
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
+{
+	unsigned long pfn;
+	struct page *p;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+
+	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
+	if (!p)
+		return NULL;
+
+	/* split the page order */
+	split_page(p, 1);
+
+	/* Find a non-2M aligned page */
+	pfn = page_to_pfn(p);
+	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
+		pfn++;
+		__free_page(p);
+	} else {
+		__free_page(pfn_to_page(pfn + 1));
+	}
+
+	return pfn_to_page(pfn);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2acf187a3100..a7adf6ca1713 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1336,7 +1336,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 	svm = to_svm(vcpu);
 
 	err = -ENOMEM;
-	vmcb01_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	vmcb01_page = snp_safe_alloc_page(vcpu);
 	if (!vmcb01_page)
 		goto out;
 
@@ -1345,7 +1345,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 		 * SEV-ES guests require a separate VMSA page used to contain
 		 * the encrypted register state of the guest.
 		 */
-		vmsa_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+		vmsa_page = snp_safe_alloc_page(vcpu);
 		if (!vmsa_page)
 			goto error_free_vmcb_page;
 
@@ -4439,6 +4439,16 @@ static int svm_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
+{
+	struct page *page = snp_safe_alloc_page(vcpu);
+
+	if (!page)
+		return NULL;
+
+	return page_address(page);
+}
+
 static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.hardware_unsetup = svm_hardware_teardown,
 	.hardware_enable = svm_hardware_enable,
@@ -4564,6 +4574,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.complete_emulated_msr = svm_complete_emulated_msr,
 
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+
+	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 5f874168551b..1175edb02d33 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -554,6 +554,7 @@ void sev_es_create_vcpu(struct vcpu_svm *svm);
 void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
+struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 
 /* vmenter.S */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (19 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-16 18:00   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command Brijesh Singh
                   ` (19 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The next generation of SEV is called SEV-SNP (Secure Nested Paging).
SEV-SNP builds upon existing SEV and SEV-ES functionality  while adding new
hardware based security protection. SEV-SNP adds strong memory encryption
integrity protection to help prevent malicious hypervisor-based attacks
such as data replay, memory re-mapping, and more, to create an isolated
execution environment.

The SNP feature can be enabled in the KVM by passing the sev-snp module
parameter.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 18 ++++++++++++++++++
 arch/x86/kvm/svm/svm.h | 12 ++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 411ed72f63af..abca2b9dee83 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -52,9 +52,14 @@ module_param_named(sev, sev_enabled, bool, 0444);
 /* enable/disable SEV-ES support */
 static bool sev_es_enabled = true;
 module_param_named(sev_es, sev_es_enabled, bool, 0444);
+
+/* enable/disable SEV-SNP support */
+static bool sev_snp_enabled = true;
+module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
 #else
 #define sev_enabled false
 #define sev_es_enabled false
+#define sev_snp_enabled  false
 #endif /* CONFIG_KVM_AMD_SEV */
 
 #define AP_RESET_HOLD_NONE		0
@@ -1825,6 +1830,7 @@ void __init sev_hardware_setup(void)
 {
 #ifdef CONFIG_KVM_AMD_SEV
 	unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
+	bool sev_snp_supported = false;
 	bool sev_es_supported = false;
 	bool sev_supported = false;
 
@@ -1888,9 +1894,21 @@ void __init sev_hardware_setup(void)
 	pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
 	sev_es_supported = true;
 
+	/* SEV-SNP support requested? */
+	if (!sev_snp_enabled)
+		goto out;
+
+	/* Is SEV-SNP enabled? */
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		goto out;
+
+	pr_info("SEV-SNP supported: %u ASIDs\n", min_sev_asid - 1);
+	sev_snp_supported = true;
+
 out:
 	sev_enabled = sev_supported;
 	sev_es_enabled = sev_es_supported;
+	sev_snp_enabled = sev_snp_supported;
 #endif
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1175edb02d33..b9ea99f8579e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -58,6 +58,7 @@ enum {
 struct kvm_sev_info {
 	bool active;		/* SEV enabled guest */
 	bool es_active;		/* SEV-ES enabled guest */
+	bool snp_active;	/* SEV-SNP enabled guest */
 	unsigned int asid;	/* ASID used for this guest */
 	unsigned int handle;	/* SEV firmware handle */
 	int fd;			/* SEV device fd */
@@ -232,6 +233,17 @@ static inline bool sev_es_guest(struct kvm *kvm)
 #endif
 }
 
+static inline bool sev_snp_guest(struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_AMD_SEV
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+
+	return sev_es_guest(kvm) && sev->snp_active;
+#else
+	return false;
+#endif
+}
+
 static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
 {
 	vmcb->control.clean = 0;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (20 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-16 19:33   ` Sean Christopherson
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
                   ` (18 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The KVM_SNP_INIT command is used by the hypervisor to initialize the
SEV-SNP platform context. In a typical workflow, this command should be the
first command issued. When creating SEV-SNP guest, the VMM must use this
command instead of the KVM_SEV_INIT or KVM_SEV_ES_INIT.

The flags value must be zero, it will be extended in future SNP support to
communicate the optional features (such as restricted INT injection etc).

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 .../virt/kvm/amd-memory-encryption.rst        | 16 ++++++++
 arch/x86/kvm/svm/sev.c                        | 37 ++++++++++++++++++-
 include/uapi/linux/kvm.h                      |  7 ++++
 3 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst
index 5c081c8c7164..75ca60b6d40a 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -427,6 +427,22 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SNP_INIT
+----------------
+
+The KVM_SNP_INIT command can be used by the hypervisor to initialize SEV-SNP
+context. In a typical workflow, this command should be the first command issued.
+
+Parameters (in): struct kvm_snp_init
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_snp_init {
+                __u64 flags;    /* must be zero */
+        };
+
 References
 ==========
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index abca2b9dee83..be31221f0a47 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -228,10 +228,24 @@ static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
 	sev_guest_decommission(&decommission, NULL);
 }
 
+static int verify_snp_init_flags(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_snp_init params;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	if (params.flags)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
+	bool es_active = (argp->id == KVM_SEV_ES_INIT || argp->id == KVM_SEV_SNP_INIT);
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
-	bool es_active = argp->id == KVM_SEV_ES_INIT;
+	bool snp_active = argp->id == KVM_SEV_SNP_INIT;
 	int asid, ret;
 
 	if (kvm->created_vcpus)
@@ -242,12 +256,22 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
 		return ret;
 
 	sev->es_active = es_active;
+	sev->snp_active = snp_active;
 	asid = sev_asid_new(sev);
 	if (asid < 0)
 		goto e_no_asid;
 	sev->asid = asid;
 
-	ret = sev_platform_init(&argp->error);
+	if (snp_active) {
+		ret = verify_snp_init_flags(kvm, argp);
+		if (ret)
+			goto e_free;
+
+		ret = sev_snp_init(&argp->error);
+	} else {
+		ret = sev_platform_init(&argp->error);
+	}
+
 	if (ret)
 		goto e_free;
 
@@ -591,6 +615,9 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 	save->pkru = svm->vcpu.arch.pkru;
 	save->xss  = svm->vcpu.arch.ia32_xss;
 
+	if (sev_snp_guest(svm->vcpu.kvm))
+		save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
+
 	return 0;
 }
 
@@ -1523,6 +1550,12 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	}
 
 	switch (sev_cmd.id) {
+	case KVM_SEV_SNP_INIT:
+		if (!sev_snp_enabled) {
+			r = -ENOTTY;
+			goto out;
+		}
+		fallthrough;
 	case KVM_SEV_ES_INIT:
 		if (!sev_es_enabled) {
 			r = -ENOTTY;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3fd9a7e9d90c..989a64aa1ae5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1678,6 +1678,9 @@ enum sev_cmd_id {
 	/* Guest Migration Extension */
 	KVM_SEV_SEND_CANCEL,
 
+	/* SNP specific commands */
+	KVM_SEV_SNP_INIT = 256,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -1774,6 +1777,10 @@ struct kvm_sev_receive_update_data {
 	__u32 trans_len;
 };
 
+struct kvm_snp_init {
+	__u64 flags;
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (21 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command Brijesh Singh
@ 2021-07-07 18:35 ` Brijesh Singh
  2021-07-12 18:45   ` Peter Gonda
  2021-07-16 19:43   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command Brijesh Singh
                   ` (17 subsequent siblings)
  40 siblings, 2 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:35 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. If the guest is expected to be migrated,
the command also binds a migration agent (MA) to the guest.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 .../virt/kvm/amd-memory-encryption.rst        |  25 ++++
 arch/x86/kvm/svm/sev.c                        | 132 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 include/uapi/linux/kvm.h                      |   9 ++
 4 files changed, 166 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst
index 75ca60b6d40a..8620383d405a 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -443,6 +443,31 @@ Returns: 0 on success, -negative on error
                 __u64 flags;    /* must be zero */
         };
 
+
+19. KVM_SNP_LAUNCH_START
+------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. To create the encryption context, user must
+provide a guest policy, migration agent (if any) and guest OS visible
+workarounds value as defined SEV-SNP specification.
+
+Parameters (in): struct  kvm_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u64 ma_uaddr;         /* userspace address of migration agent */
+                __u8 ma_en;             /* 1 if the migtation agent is enabled */
+                __u8 imi_en;            /* set IMI to 1. */
+                __u8 gosvw[16];         /* guest OS visible workarounds */
+        };
+
+See the SEV-SNP specification for further detail on the launch input.
+
 References
 ==========
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index be31221f0a47..f44a657e8912 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -20,6 +20,7 @@
 #include <asm/fpu/internal.h>
 
 #include <asm/trapnr.h>
+#include <asm/sev.h>
 
 #include "x86.h"
 #include "svm.h"
@@ -75,6 +76,8 @@ static unsigned long sev_me_mask;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -1527,6 +1530,100 @@ static int sev_receive_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return sev_issue_cmd(kvm, SEV_CMD_RECEIVE_FINISH, &data, &argp->error);
 }
 
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_gctx_create data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.gctx_paddr = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {};
+	int asid = sev_get_asid(kvm);
+	int ret, retry_count = 0;
+
+	/* Activate ASID on the given context */
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid   = asid;
+again:
+	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+
+	/* Check if the DF_FLUSH is required, and try again */
+	if (ret && (*error == SEV_RET_DFFLUSH_REQUIRED) && (!retry_count)) {
+		/* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */
+		down_read(&sev_deactivate_lock);
+		wbinvd_on_all_cpus();
+		ret = snp_guest_df_flush(error);
+		up_read(&sev_deactivate_lock);
+
+		if (ret)
+			return ret;
+
+		/* only one retry */
+		retry_count = 1;
+
+		goto again;
+	}
+
+	return ret;
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	/* Initialize the guest context */
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	/* Issue the LAUNCH_START command */
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc)
+		goto e_free_context;
+
+	/* Bind ASID to this guest */
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc)
+		goto e_free_context;
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1616,6 +1713,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -1809,6 +1909,28 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_decommission data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data.gctx_paddr = __sme_pa(sev->snp_context);
+	ret = snp_guest_decommission(&data, NULL);
+	if (ret)
+		return ret;
+
+	/* free the context page now */
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -1847,7 +1969,15 @@ void sev_vm_destroy(struct kvm *kvm)
 
 	mutex_unlock(&kvm->lock);
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			pr_err("Failed to free SNP guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index b9ea99f8579e..bc5582b44356 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -67,6 +67,7 @@ struct kvm_sev_info {
 	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
 	struct kvm *enc_context_owner; /* Owner of copied encryption context */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 989a64aa1ae5..dbd05179d8fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1680,6 +1680,7 @@ enum sev_cmd_id {
 
 	/* SNP specific commands */
 	KVM_SEV_SNP_INIT = 256,
+	KVM_SEV_SNP_LAUNCH_START,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1781,6 +1782,14 @@ struct kvm_snp_init {
 	__u64 flags;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u64 ma_uaddr;
+	__u8 ma_en;
+	__u8 imi_en;
+	__u8 gosvw[16];
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (22 preceding siblings ...)
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 20:01   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates Brijesh Singh
                   ` (16 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the
guest's memory. The data is encrypted with the cryptographic context
created with the KVM_SEV_SNP_LAUNCH_START.

In addition to the inserting data, it can insert a two special pages
into the guests memory: the secrets page and the CPUID page.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 .../virt/kvm/amd-memory-encryption.rst        |  28 ++++
 arch/x86/kvm/svm/sev.c                        | 142 ++++++++++++++++++
 include/linux/sev.h                           |   2 +
 include/uapi/linux/kvm.h                      |  18 +++
 4 files changed, 190 insertions(+)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst
index 8620383d405a..60ace54438c3 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -468,6 +468,34 @@ Returns: 0 on success, -negative on error
 
 See the SEV-SNP specification for further detail on the launch input.
 
+20. KVM_SNP_LAUNCH_UPDATE
+-------------------------
+
+The KVM_SNP_LAUNCH_UPDATE is used for encrypting a memory region. It also
+calculates a measurement of the memory contents. The measurement is a signature
+of the memory contents that can be sent to the guest owner as an attestation
+that the memory was encrypted correctly by the firmware.
+
+Parameters (in): struct  kvm_snp_launch_update
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_update {
+                __u64 uaddr;            /* userspace address need to be encrypted */
+                __u32 len;              /* length of memory region */
+                __u8 imi_page;          /* 1 if memory is part of the IMI */
+                __u8 page_type;         /* page type */
+                __u8 vmpl3_perms;       /* VMPL3 permission mask */
+                __u8 vmpl2_perms;       /* VMPL2 permission mask */
+                __u8 vmpl1_perms;       /* VMPL1 permission mask */
+        };
+
+See the SEV-SNP spec for further details on how to build the VMPL permission
+mask and page type.
+
+
 References
 ==========
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f44a657e8912..1f0635ac9ff9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -17,6 +17,7 @@
 #include <linux/misc_cgroup.h>
 #include <linux/processor.h>
 #include <linux/trace_events.h>
+#include <linux/sev.h>
 #include <asm/fpu/internal.h>
 
 #include <asm/trapnr.h>
@@ -1624,6 +1625,144 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return rc;
 }
 
+static struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, unsigned long hva)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot;
+
+	kvm_for_each_memslot(memslot, slots) {
+		if (hva >= memslot->userspace_addr &&
+		    hva < memslot->userspace_addr + (memslot->npages << PAGE_SHIFT))
+			return memslot;
+	}
+
+	return NULL;
+}
+
+static bool hva_to_gpa(struct kvm *kvm, unsigned long hva, gpa_t *gpa)
+{
+	struct kvm_memory_slot *memslot;
+	gpa_t gpa_offset;
+
+	memslot = hva_to_memslot(kvm, hva);
+	if (!memslot)
+		return false;
+
+	gpa_offset = hva - memslot->userspace_addr;
+	*gpa = ((memslot->base_gfn << PAGE_SHIFT) + gpa_offset);
+
+	return true;
+}
+
+static int snp_page_reclaim(struct page *page, int rmppage_size)
+{
+	struct sev_data_snp_page_reclaim data = {};
+	struct rmpupdate e = {};
+	int rc, err;
+
+	data.paddr = __sme_page_pa(page) | rmppage_size;
+	rc = snp_guest_page_reclaim(&data, &err);
+	if (rc)
+		return rc;
+
+	return rmpupdate(page, &e);
+}
+
+static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	unsigned long npages, vaddr, vaddr_end, i, next_vaddr;
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_update data = {};
+	struct kvm_sev_snp_launch_update params;
+	int *error = &argp->error;
+	struct kvm_vcpu *vcpu;
+	struct page **inpages;
+	struct rmpupdate e;
+	int ret;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (!sev->snp_context)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+
+	/* Lock the user memory. */
+	inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
+	if (!inpages)
+		return -ENOMEM;
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	vaddr = params.uaddr;
+	vaddr_end = vaddr + params.len;
+
+	for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i++) {
+		unsigned long psize, pmask;
+		int level = PG_LEVEL_4K;
+		gpa_t gpa;
+
+		if (!hva_to_gpa(kvm, vaddr, &gpa)) {
+			ret = -EINVAL;
+			goto e_unpin;
+		}
+
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+		gpa = gpa & pmask;
+
+		/* Transition the page state to pre-guest */
+		memset(&e, 0, sizeof(e));
+		e.assigned = 1;
+		e.gpa = gpa;
+		e.asid = sev_get_asid(kvm);
+		e.immutable = true;
+		e.pagesize = X86_TO_RMP_PG_LEVEL(level);
+		ret = rmpupdate(inpages[i], &e);
+		if (ret) {
+			ret = -EFAULT;
+			goto e_unpin;
+		}
+
+		data.address = __sme_page_pa(inpages[i]);
+		data.page_size = e.pagesize;
+		data.page_type = params.page_type;
+		data.vmpl3_perms = params.vmpl3_perms;
+		data.vmpl2_perms = params.vmpl2_perms;
+		data.vmpl1_perms = params.vmpl1_perms;
+		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
+		if (ret) {
+			snp_page_reclaim(inpages[i], e.pagesize);
+			goto e_unpin;
+		}
+
+		next_vaddr = (vaddr & pmask) + psize;
+	}
+
+e_unpin:
+	/* Content of memory is updated, mark pages dirty */
+	memset(&e, 0, sizeof(e));
+	for (i = 0; i < npages; i++) {
+		set_page_dirty_lock(inpages[i]);
+		mark_page_accessed(inpages[i]);
+
+		/*
+		 * If its an error, then update RMP entry to change page ownership
+		 * to the hypervisor.
+		 */
+		if (ret)
+			rmpupdate(inpages[i], &e);
+	}
+
+	/* Unlock the user pages */
+	sev_unpin_memory(kvm, inpages, npages);
+
+	return ret;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1716,6 +1855,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_SNP_LAUNCH_START:
 		r = snp_launch_start(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_UPDATE:
+		r = snp_launch_update(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/include/linux/sev.h b/include/linux/sev.h
index bcd4d75d87c8..82e804a2ee0d 100644
--- a/include/linux/sev.h
+++ b/include/linux/sev.h
@@ -36,8 +36,10 @@ struct __packed rmpentry {
 
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
+#define RMP_PG_SIZE_2M			1
 
 #define RMP_TO_X86_PG_LEVEL(level)	(((level) == RMP_PG_SIZE_4K) ? PG_LEVEL_4K : PG_LEVEL_2M)
+#define X86_TO_RMP_PG_LEVEL(level)	(((level) == PG_LEVEL_4K) ? RMP_PG_SIZE_4K : RMP_PG_SIZE_2M)
 
 struct rmpupdate {
 	u64 gpa;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dbd05179d8fa..c9b453fb31d4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1681,6 +1681,7 @@ enum sev_cmd_id {
 	/* SNP specific commands */
 	KVM_SEV_SNP_INIT = 256,
 	KVM_SEV_SNP_LAUNCH_START,
+	KVM_SEV_SNP_LAUNCH_UPDATE,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1790,6 +1791,23 @@ struct kvm_sev_snp_launch_start {
 	__u8 gosvw[16];
 };
 
+#define KVM_SEV_SNP_PAGE_TYPE_NORMAL		0x1
+#define KVM_SEV_SNP_PAGE_TYPE_VMSA		0x2
+#define KVM_SEV_SNP_PAGE_TYPE_ZERO		0x3
+#define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED	0x4
+#define KVM_SEV_SNP_PAGE_TYPE_SECRETS		0x5
+#define KVM_SEV_SNP_PAGE_TYPE_CPUID		0x6
+
+struct kvm_sev_snp_launch_update {
+	__u64 uaddr;
+	__u32 len;
+	__u8 imi_page;
+	__u8 page_type;
+	__u8 vmpl3_perms;
+	__u8 vmpl2_perms;
+	__u8 vmpl1_perms;
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (23 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 20:09   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command Brijesh Singh
                   ` (15 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The guest pages of the SEV-SNP VM maybe added as a private page in the
RMP entry (assigned bit is set). The guest private pages must be
transitioned to the hypervisor state before its freed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 1f0635ac9ff9..4468995dd209 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1940,6 +1940,45 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
 static void __unregister_enc_region_locked(struct kvm *kvm,
 					   struct enc_region *region)
 {
+	struct rmpupdate val = {};
+	unsigned long i, pfn;
+	struct rmpentry *e;
+	int level, rc;
+
+	/*
+	 * The guest memory pages are assigned in the RMP table. Unassign it
+	 * before releasing the memory.
+	 */
+	if (sev_snp_guest(kvm)) {
+		for (i = 0; i < region->npages; i++) {
+			pfn = page_to_pfn(region->pages[i]);
+
+			if (need_resched())
+				schedule();
+
+			e = snp_lookup_page_in_rmptable(region->pages[i], &level);
+			if (unlikely(!e))
+				continue;
+
+			/* If its not a guest assigned page then skip it. */
+			if (!rmpentry_assigned(e))
+				continue;
+
+			/* Is the page part of a 2MB RMP entry? */
+			if (level == PG_LEVEL_2M) {
+				val.pagesize = RMP_PG_SIZE_2M;
+				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+			} else {
+				val.pagesize = RMP_PG_SIZE_4K;
+			}
+
+			/* Transition the page to hypervisor owned. */
+			rc = rmpupdate(pfn_to_page(pfn), &val);
+			if (rc)
+				pr_err("Failed to release pfn 0x%lx ret=%d\n", pfn, rc);
+		}
+	}
+
 	sev_unpin_memory(kvm, region->pages, region->npages);
 	list_del(&region->list);
 	kfree(region);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (24 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 20:18   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP Brijesh Singh
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores
it as the measurement of the guest at launch.

While finalizing the launch flow, it also issues the LAUNCH_UPDATE command
to encrypt the VMSA pages.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 .../virt/kvm/amd-memory-encryption.rst        |  22 +++
 arch/x86/kvm/svm/sev.c                        | 125 ++++++++++++++++++
 include/uapi/linux/kvm.h                      |  13 ++
 3 files changed, 160 insertions(+)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst
index 60ace54438c3..a3d863e88869 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -495,6 +495,28 @@ Returns: 0 on success, -negative on error
 See the SEV-SNP spec for further details on how to build the VMPL permission
 mask and page type.
 
+21. KVM_SNP_LAUNCH_FINISH
+-------------------------
+
+After completion of the SNP guest launch flow, the KVM_SNP_LAUNCH_FINISH command can be
+issued to make the guest ready for the execution.
+
+Parameters (in): struct kvm_sev_snp_launch_finish
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_finish {
+                __u64 id_block_uaddr;
+                __u64 id_auth_uaddr;
+                __u8 id_block_en;
+                __u8 auth_key_en;
+                __u8 host_data[32];
+        };
+
+
+See SEV-SNP specification for further details on launch finish input parameters.
 
 References
 ==========
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4468995dd209..3f8824c9a5dc 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1763,6 +1763,111 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	return ret;
 }
 
+static int snp_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_update data = {};
+	int i, ret;
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.page_type = SNP_PAGE_TYPE_VMSA;
+
+	for (i = 0; i < kvm->created_vcpus; i++) {
+		struct vcpu_svm *svm = to_svm(kvm->vcpus[i]);
+		struct rmpupdate e = {};
+
+		/* Perform some pre-encryption checks against the VMSA */
+		ret = sev_es_sync_vmsa(svm);
+		if (ret)
+			return ret;
+
+		/* Transition the VMSA page to a firmware state. */
+		e.assigned = 1;
+		e.immutable = 1;
+		e.asid = sev->asid;
+		e.gpa = -1;
+		e.pagesize = RMP_PG_SIZE_4K;
+		ret = rmpupdate(virt_to_page(svm->vmsa), &e);
+		if (ret)
+			return ret;
+
+		/* Issue the SNP command to encrypt the VMSA */
+		data.address = __sme_pa(svm->vmsa);
+		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE,
+				      &data, &argp->error);
+		if (ret) {
+			snp_page_reclaim(virt_to_page(svm->vmsa), RMP_PG_SIZE_4K);
+			return ret;
+		}
+
+		svm->vcpu.arch.guest_state_protected = true;
+	}
+
+	return 0;
+}
+
+static int snp_launch_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_finish *data;
+	void *id_block = NULL, *id_auth = NULL;
+	struct kvm_sev_snp_launch_finish params;
+	int ret;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (!sev->snp_context)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
+		return -EFAULT;
+
+	/* Measure all vCPUs using LAUNCH_UPDATE before we finalize the launch flow. */
+	ret = snp_launch_update_vmsa(kvm, argp);
+	if (ret)
+		return ret;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
+	if (!data)
+		return -ENOMEM;
+
+	if (params.id_block_en) {
+		id_block = psp_copy_user_blob(params.id_block_uaddr, KVM_SEV_SNP_ID_BLOCK_SIZE);
+		if (IS_ERR(id_block)) {
+			ret = PTR_ERR(id_block);
+			goto e_free;
+		}
+
+		data->id_block_en = 1;
+		data->id_block_paddr = __sme_pa(id_block);
+	}
+
+	if (params.auth_key_en) {
+		id_auth = psp_copy_user_blob(params.id_auth_uaddr, KVM_SEV_SNP_ID_AUTH_SIZE);
+		if (IS_ERR(id_auth)) {
+			ret = PTR_ERR(id_auth);
+			goto e_free_id_block;
+		}
+
+		data->auth_key_en = 1;
+		data->id_auth_paddr = __sme_pa(id_auth);
+	}
+
+	data->gctx_paddr = __psp_pa(sev->snp_context);
+	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
+
+	kfree(id_auth);
+
+e_free_id_block:
+	kfree(id_block);
+
+e_free:
+	kfree(data);
+
+	return ret;
+}
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1858,6 +1963,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_SNP_LAUNCH_UPDATE:
 		r = snp_launch_update(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_FINISH:
+		r = snp_launch_finish(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2346,8 +2454,25 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
 
 	if (vcpu->arch.guest_state_protected)
 		sev_flush_guest_memory(svm, svm->vmsa, PAGE_SIZE);
+
+	/*
+	 * If its an SNP guest, then VMSA was added in the RMP entry as a guest owned page.
+	 * Transition the page to hyperivosr state before releasing it back to the system.
+	 */
+	if (sev_snp_guest(vcpu->kvm)) {
+		struct rmpupdate e = {};
+		int rc;
+
+		rc = rmpupdate(virt_to_page(svm->vmsa), &e);
+		if (rc) {
+			pr_err("Failed to release SNP guest VMSA page (rc %d), leaking it\n", rc);
+			goto skip_vmsa_free;
+		}
+	}
+
 	__free_page(virt_to_page(svm->vmsa));
 
+skip_vmsa_free:
 	if (svm->ghcb_sa_free)
 		kfree(svm->ghcb_sa);
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c9b453fb31d4..fb3f6e1defd9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1682,6 +1682,7 @@ enum sev_cmd_id {
 	KVM_SEV_SNP_INIT = 256,
 	KVM_SEV_SNP_LAUNCH_START,
 	KVM_SEV_SNP_LAUNCH_UPDATE,
+	KVM_SEV_SNP_LAUNCH_FINISH,
 
 	KVM_SEV_NR_MAX,
 };
@@ -1808,6 +1809,18 @@ struct kvm_sev_snp_launch_update {
 	__u8 vmpl1_perms;
 };
 
+#define KVM_SEV_SNP_ID_BLOCK_SIZE	96
+#define KVM_SEV_SNP_ID_AUTH_SIZE	4096
+#define KVM_SEV_SNP_FINISH_DATA_SIZE	32
+
+struct kvm_sev_snp_launch_finish {
+	__u64 id_block_uaddr;
+	__u64 id_auth_uaddr;
+	__u8 id_block_en;
+	__u8 auth_key_en;
+	__u8 host_data[KVM_SEV_SNP_FINISH_DATA_SIZE];
+};
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (25 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 19:19   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 28/40] KVM: X86: Introduce kvm_mmu_map_tdp_page() for use by SEV Brijesh Singh
                   ` (13 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

When running an SEV-SNP VM, the sPA used to index the RMP entry is
obtained through the TDP translation (gva->gpa->spa). The TDP page
level is checked against the page level programmed in the RMP entry.
If the page level does not match, then it will cause a nested page
fault with the RMP bit set to indicate the RMP violation.

To keep the TDP and RMP page level's in sync, the KVM fault handle
kvm_handle_page_fault() will call get_tdp_max_page_level() to get
the maximum allowed page level so that it can limit the TDP level.

In the case of SEV-SNP guest, the get_tdp_max_page_level() will consult
the RMP table to compute the maximum allowed page level for a given
GPA.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu/mmu.c          |  6 ++++--
 arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
 6 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 188110ab2c02..cd2e19e1d323 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1384,6 +1384,7 @@ struct kvm_x86_ops {
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0144c40d09c7..7991ffae7b31 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3781,11 +3781,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 				u32 error_code, bool prefault)
 {
+	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);
+
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
 	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
 	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
-				 PG_LEVEL_2M, false);
+				 max_level, false);
 }
 
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
@@ -3826,7 +3828,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 {
 	int max_level;
 
-	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
+	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);
 	     max_level > PG_LEVEL_4K;
 	     max_level--) {
 		int page_num = KVM_PAGES_PER_HPAGE(max_level);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3f8824c9a5dc..fd2d00ad80b7 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3206,3 +3206,23 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return pfn_to_page(pfn);
 }
+
+int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
+{
+	struct rmpentry *e;
+	kvm_pfn_t pfn;
+	int level;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		return max_level;
+
+	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	if (is_error_noslot_pfn(pfn))
+		return max_level;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+	if (unlikely(!e))
+		return max_level;
+
+	return min_t(uint32_t, level, max_level);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a7adf6ca1713..2632eae52aa3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4576,6 +4576,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
 
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+	.get_tdp_max_page_level = sev_get_tdp_max_page_level,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bc5582b44356..32abcbd774d0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -568,6 +568,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4bceb5ca3a89..fbc9034edf16 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7612,6 +7612,12 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
+
+static int vmx_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
+{
+	return max_level;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.hardware_unsetup = hardware_unsetup,
 
@@ -7742,6 +7748,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.get_tdp_max_page_level = vmx_get_tdp_max_page_level,
 };
 
 static __init void vmx_setup_user_return_msrs(void)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 28/40] KVM: X86: Introduce kvm_mmu_map_tdp_page() for use by SEV
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (26 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 18:15   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 29/40] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use Brijesh Singh
                   ` (12 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Introduce a helper to directly fault-in a TDP page without going through
the full page fault path.  This allows SEV-SNP to build the netsted page
table while handling the page state change VMGEXIT. A guest may issue a
page state change VMGEXIT before accessing the page. Create a fault so
that VMGEXIT handler can get the TDP page level and keep the TDP and RMP
page level in sync.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 88d0ed5225a4..005ce139c97d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -114,6 +114,8 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		       bool prefault);
 
+int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level);
+
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u32 err, bool prefault)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7991ffae7b31..df8923fb664f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3842,6 +3842,26 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 				 max_level, true);
 }
 
+int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level)
+{
+	int r;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		r = direct_page_fault(vcpu, gpa, error_code, false, max_level, true);
+	} while (r == RET_PF_RETRY);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 29/40] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (27 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 28/40] KVM: X86: Introduce kvm_mmu_map_tdp_page() for use by SEV Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 30/40] KVM: X86: Define new RMP check related #NPF error bits Brijesh Singh
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The SEV-SNP VMs may call the page state change VMGEXIT to add the GPA
as private or shared in the RMP table. The page state change VMGEXIT
will contain the RMP page level to be used in the RMP entry. If the
page level between the TDP and RMP does not match then, it will result
in nested-page-fault (RMP violation).

The SEV-SNP VMGEXIT handler will use the kvm_mmu_get_tdp_walk() to get
the current page-level in the TDP for the given GPA and calculate a
workable page level. If a GPA is mapped as a 4K-page in the TDP, but
the guest requested to add the GPA as a 2M in the RMP entry then the
2M request will be broken into 4K-pages to keep the RMP and TDP
page-levels in sync.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 005ce139c97d..147e76ab1536 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -115,6 +115,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		       bool prefault);
 
 int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level);
+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level);
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u32 err, bool prefault)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df8923fb664f..4abc0dc49d55 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3862,6 +3862,35 @@ int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int m
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
 
+bool kvm_mmu_get_tdp_walk(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t *pfn, int *level)
+{
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1];
+	int leaf, root;
+
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		leaf = kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, &root);
+	else
+		leaf = get_walk(vcpu, gpa, sptes, &root);
+
+	if (unlikely(leaf < 0))
+		return false;
+
+	/* Check if the leaf SPTE is present */
+	if (!is_shadow_present_pte(sptes[leaf]))
+		return false;
+
+	*pfn = spte_to_pfn(sptes[leaf]);
+	if (leaf > PG_LEVEL_4K) {
+		u64 page_mask = KVM_PAGES_PER_HPAGE(leaf) - KVM_PAGES_PER_HPAGE(leaf - 1);
+		*pfn |= (gpa_to_gfn(gpa) & page_mask);
+	}
+
+	*level = leaf;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_get_tdp_walk);
+
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 				   struct kvm_mmu *context)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 30/40] KVM: X86: Define new RMP check related #NPF error bits
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (28 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 29/40] KVM: X86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 20:22   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 31/40] KVM: X86: update page-fault trace to log the 64-bit error code Brijesh Singh
                   ` (10 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

When SEV-SNP is enabled globally, the hardware places restrictions on all
memory accesses based on the RMP entry, whether the hyperviso or a VM,
performs the accesses. When hardware encounters an RMP access violation
during a guest access, it will cause a #VMEXIT(NPF).

See APM2 section 16.36.10 for more details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cd2e19e1d323..59185b6bc82a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,8 +239,12 @@ enum x86_intercept_stage;
 #define PFERR_FETCH_BIT 4
 #define PFERR_PK_BIT 5
 #define PFERR_SGX_BIT 15
+#define PFERR_GUEST_RMP_BIT 31
 #define PFERR_GUEST_FINAL_BIT 32
 #define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
+#define PFERR_GUEST_SIZEM_BIT 35
+#define PFERR_GUEST_VMPL_BIT 36
 
 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
@@ -251,6 +255,10 @@ enum x86_intercept_stage;
 #define PFERR_SGX_MASK (1U << PFERR_SGX_BIT)
 #define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT)
 #define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_RMP_MASK (1ULL << PFERR_GUEST_RMP_BIT)
+#define PFERR_GUEST_ENC_MASK (1ULL << PFERR_GUEST_ENC_BIT)
+#define PFERR_GUEST_SIZEM_MASK (1ULL << PFERR_GUEST_SIZEM_BIT)
+#define PFERR_GUEST_VMPL_MASK (1ULL << PFERR_GUEST_VMPL_BIT)
 
 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
 				 PFERR_WRITE_MASK |		\
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 31/40] KVM: X86: update page-fault trace to log the 64-bit error code
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (29 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 30/40] KVM: X86: Define new RMP check related #NPF error bits Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 20:25   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT Brijesh Singh
                   ` (9 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The page-fault error code is a 64-bit value, but the trace prints only
the lower 32-bits. Some of the SEV-SNP RMP fault error codes are
available in the upper 32-bits.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/trace.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a61c015870e3..78cbf53bf412 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -365,12 +365,12 @@ TRACE_EVENT(kvm_inj_exception,
  * Tracepoint for page fault.
  */
 TRACE_EVENT(kvm_page_fault,
-	TP_PROTO(unsigned long fault_address, unsigned int error_code),
+	TP_PROTO(unsigned long fault_address, u64 error_code),
 	TP_ARGS(fault_address, error_code),
 
 	TP_STRUCT__entry(
 		__field(	unsigned long,	fault_address	)
-		__field(	unsigned int,	error_code	)
+		__field(	u64,		error_code	)
 	),
 
 	TP_fast_assign(
@@ -378,7 +378,7 @@ TRACE_EVENT(kvm_page_fault,
 		__entry->error_code	= error_code;
 	),
 
-	TP_printk("address %lx error_code %x",
+	TP_printk("address %lx error_code %llx",
 		  __entry->fault_address, __entry->error_code)
 );
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (30 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 31/40] KVM: X86: update page-fault trace to log the 64-bit error code Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 20:45   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT Brijesh Singh
                   ` (8 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

SEV-SNP guests are required to perform a GHCB GPA registration (see
section 2.5.2 in GHCB specification). Before using a GHCB GPA for a vCPU
the first time, a guest must register the vCPU GHCB GPA. If hypervisor
can work with the guest requested GPA then it must respond back with the
same GPA otherwise return -1.

On VMEXIT, Verify that GHCB GPA matches with the registered value. If a
mismatch is detected then abort the guest.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-common.h |  2 ++
 arch/x86/kvm/svm/sev.c            | 25 +++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h            |  7 +++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 466baa9cd0f5..6990d5a9d73c 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -60,8 +60,10 @@
 	GHCB_MSR_GPA_REG_REQ)
 
 #define GHCB_MSR_GPA_REG_RESP		0x013
+#define GHCB_MSR_GPA_REG_ERROR		GENMASK_ULL(51, 0)
 #define GHCB_MSR_GPA_REG_RESP_VAL(v)	((v) >> GHCB_MSR_GPA_REG_VALUE_POS)
 
+
 /* SNP Page State Change */
 #define GHCB_MSR_PSC_REQ		0x014
 #define SNP_PAGE_STATE_PRIVATE		1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index fd2d00ad80b7..3af5d1ad41bf 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2922,6 +2922,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_GPA_REG_REQ: {
+		kvm_pfn_t pfn;
+		u64 gfn;
+
+		gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_REG_GFN_MASK,
+					GHCB_MSR_GPA_REG_VALUE_POS);
+
+		pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
+		if (is_error_noslot_pfn(pfn))
+			gfn = GHCB_MSR_GPA_REG_ERROR;
+		else
+			svm->ghcb_registered_gpa = gfn_to_gpa(gfn);
+
+		set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_REG_GFN_MASK,
+				  GHCB_MSR_GPA_REG_VALUE_POS);
+		set_ghcb_msr_bits(svm, GHCB_MSR_GPA_REG_RESP, GHCB_MSR_INFO_MASK,
+				  GHCB_MSR_INFO_POS);
+		break;
+	}
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
@@ -2970,6 +2989,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		return -EINVAL;
 	}
 
+	/* SEV-SNP guest requires that the GHCB GPA must be registered */
+	if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
+		vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
+		return -EINVAL;
+	}
+
 	svm->ghcb = svm->ghcb_map.hva;
 	ghcb = svm->ghcb_map.hva;
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 32abcbd774d0..af4cce39b30f 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -185,6 +185,8 @@ struct vcpu_svm {
 	bool ghcb_sa_free;
 
 	bool guest_state_loaded;
+
+	u64 ghcb_registered_gpa;
 };
 
 struct svm_cpu_data {
@@ -245,6 +247,11 @@ static inline bool sev_snp_guest(struct kvm *kvm)
 #endif
 }
 
+static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
+{
+	return svm->ghcb_registered_gpa == val;
+}
+
 static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
 {
 	vmcb->control.clean = 0;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (31 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 21:00   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 34/40] KVM: SVM: Add support to handle " Brijesh Singh
                   ` (7 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

Before changing the page state in the RMP entry, we lookup the page in
the TDP to make sure that there is a valid mapping for it. If the mapping
exist then try to find a workable page level between the TDP and RMP for
the page. If the page is not mapped in the TDP, then create a fault such
that it gets mapped before we change the page state in the RMP entry.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-common.h |   3 +
 arch/x86/kvm/svm/sev.c            | 141 ++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 6990d5a9d73c..2561413cb316 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -81,6 +81,9 @@
 
 #define GHCB_MSR_PSC_RESP		0x015
 #define GHCB_MSR_PSC_ERROR_POS		32
+#define GHCB_MSR_PSC_ERROR_MASK		GENMASK_ULL(31, 0)
+#define GHCB_MSR_PSC_RSVD_POS		12
+#define GHCB_MSR_PSC_RSVD_MASK		GENMASK_ULL(19, 0)
 #define GHCB_MSR_PSC_RESP_VAL(val)	((val) >> GHCB_MSR_PSC_ERROR_POS)
 
 /* GHCB Hypervisor Feature Request */
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3af5d1ad41bf..68d275b2a660 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -28,6 +28,7 @@
 #include "svm_ops.h"
 #include "cpuid.h"
 #include "trace.h"
+#include "mmu.h"
 
 #define __ex(x) __kvm_handle_fault_on_reboot(x)
 
@@ -2843,6 +2844,127 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(struct kvm_vcpu *vcpu, kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn_to_page(pfn));
+}
+
+static int snp_make_page_shared(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+	struct rmpupdate val;
+	int rc, rmp_level;
+	struct rmpentry *e;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (!e)
+		return -EINVAL;
+
+	if (!rmpentry_assigned(e))
+		return 0;
+
+	/* Log if the entry is validated */
+	if (rmpentry_validated(e))
+		pr_warn_ratelimited("Remove RMP entry for a validated gpa 0x%llx\n", gpa);
+
+	/*
+	 * Is the page part of an existing 2M RMP entry ? Split the 2MB into multiple
+	 * of 4K-page before making the memory shared.
+	 */
+	if ((level == PG_LEVEL_4K) && (rmp_level == PG_LEVEL_2M)) {
+		rc = snp_rmptable_psmash(vcpu, pfn);
+		if (rc)
+			return rc;
+	}
+
+	memset(&val, 0, sizeof(val));
+	val.pagesize = X86_TO_RMP_PG_LEVEL(level);
+	return rmpupdate(pfn_to_page(pfn), &val);
+}
+
+static int snp_make_page_private(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
+	struct rmpupdate val;
+	struct rmpentry *e;
+	int rmp_level;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
+	if (!e)
+		return -EINVAL;
+
+	/* Log if the entry is validated */
+	if (rmpentry_validated(e))
+		pr_warn_ratelimited("Asked to make a pre-validated gpa %llx private\n", gpa);
+
+	memset(&val, 0, sizeof(val));
+	val.gpa = gpa;
+	val.asid = sev->asid;
+	val.pagesize = X86_TO_RMP_PG_LEVEL(level);
+	val.assigned = true;
+
+	return rmpupdate(pfn_to_page(pfn), &val);
+}
+
+static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
+{
+	struct kvm *kvm = vcpu->kvm;
+	int rc, tdp_level;
+	kvm_pfn_t pfn;
+	gpa_t gpa_end;
+
+	gpa_end = gpa + page_level_size(level);
+
+	while (gpa < gpa_end) {
+		/*
+		 * Get the pfn and level for the gpa from the nested page table.
+		 *
+		 * If the TDP walk failed, then its safe to say that we don't have a valid
+		 * mapping for the gpa in the nested page table. Create a fault to map the
+		 * page is nested page table.
+		 */
+		if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) {
+			pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level);
+			if (is_error_noslot_pfn(pfn))
+				goto out;
+
+			if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level))
+				goto out;
+		}
+
+		/* Adjust the level so that we don't go higher than the backing page level */
+		level = min_t(size_t, level, tdp_level);
+
+		write_lock(&kvm->mmu_lock);
+
+		switch (op) {
+		case SNP_PAGE_STATE_SHARED:
+			rc = snp_make_page_shared(vcpu, gpa, pfn, level);
+			break;
+		case SNP_PAGE_STATE_PRIVATE:
+			rc = snp_make_page_private(vcpu, gpa, pfn, level);
+			break;
+		default:
+			rc = -EINVAL;
+			break;
+		}
+
+		write_unlock(&kvm->mmu_lock);
+
+		if (rc) {
+			pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
+					   op, gpa, pfn, level, rc);
+			goto out;
+		}
+
+		gpa = gpa + page_level_size(level);
+	}
+
+out:
+	return rc;
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -2941,6 +3063,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PSC_REQ: {
+		gfn_t gfn;
+		int ret;
+		u8 op;
+
+		gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS);
+		op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS);
+
+		ret = __snp_handle_psc(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K);
+
+		/* If failed to change the state then spec requires to return all F's */
+		if (ret)
+			ret = -1;
+
+		set_ghcb_msr_bits(svm, ret, GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS);
+		set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS);
+		set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
+		break;
+	}
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 34/40] KVM: SVM: Add support to handle Page State Change VMGEXIT
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (32 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-16 21:14   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 35/40] KVM: Add arch hooks to track the host write to guest memory Brijesh Singh
                   ` (6 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification section 4.1.6.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/sev-common.h |  7 +++
 arch/x86/kvm/svm/sev.c            | 80 ++++++++++++++++++++++++++++++-
 include/linux/sev.h               |  3 ++
 3 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 2561413cb316..a02175752f2d 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,6 +101,13 @@
 /* SNP Page State Change NAE event */
 #define VMGEXIT_PSC_MAX_ENTRY		253
 
+/* The page state change hdr structure in not valid */
+#define PSC_INVALID_HDR			1
+/* The hdr.cur_entry or hdr.end_entry is not valid */
+#define PSC_INVALID_ENTRY		2
+/* Page state change encountered undefined error */
+#define PSC_UNDEF_ERR			3
+
 struct __packed psc_hdr {
 	u16 cur_entry;
 	u16 end_entry;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 68d275b2a660..0155d9b3127d 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2662,6 +2662,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 	case SVM_VMGEXIT_HV_FT:
+	case SVM_VMGEXIT_PSC:
 		break;
 	default:
 		goto vmgexit_err;
@@ -2910,7 +2911,8 @@ static int snp_make_page_private(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn
 static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
 {
 	struct kvm *kvm = vcpu->kvm;
-	int rc, tdp_level;
+	int rc = PSC_UNDEF_ERR;
+	int tdp_level;
 	kvm_pfn_t pfn;
 	gpa_t gpa_end;
 
@@ -2945,8 +2947,11 @@ static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
 		case SNP_PAGE_STATE_PRIVATE:
 			rc = snp_make_page_private(vcpu, gpa, pfn, level);
 			break;
+		case SNP_PAGE_STATE_PSMASH:
+		case SNP_PAGE_STATE_UNSMASH:
+			/* TODO: Add support to handle it */
 		default:
-			rc = -EINVAL;
+			rc = PSC_INVALID_ENTRY;
 			break;
 		}
 
@@ -2965,6 +2970,68 @@ static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
 	return rc;
 }
 
+static inline unsigned long map_to_psc_vmgexit_code(int rc)
+{
+	switch (rc) {
+	case PSC_INVALID_HDR:
+		return ((1ul << 32) | 1);
+	case PSC_INVALID_ENTRY:
+		return ((1ul << 32) | 2);
+	case RMPUPDATE_FAIL_OVERLAP:
+		return ((3ul << 32) | 2);
+	default: return (4ul << 32);
+	}
+}
+
+static unsigned long snp_handle_psc(struct vcpu_svm *svm, struct ghcb *ghcb)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	int level, op, rc = PSC_UNDEF_ERR;
+	struct snp_psc_desc *info;
+	struct psc_entry *entry;
+	gpa_t gpa;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		goto out;
+
+	if (!setup_vmgexit_scratch(svm, true, sizeof(ghcb->save.sw_scratch))) {
+		pr_err("vmgexit: scratch area is not setup.\n");
+		rc = PSC_INVALID_HDR;
+		goto out;
+	}
+
+	info = (struct snp_psc_desc *)svm->ghcb_sa;
+	entry = &info->entries[info->hdr.cur_entry];
+
+	if ((info->hdr.cur_entry >= VMGEXIT_PSC_MAX_ENTRY) ||
+	    (info->hdr.end_entry >= VMGEXIT_PSC_MAX_ENTRY) ||
+	    (info->hdr.cur_entry > info->hdr.end_entry)) {
+		rc = PSC_INVALID_ENTRY;
+		goto out;
+	}
+
+	while (info->hdr.cur_entry <= info->hdr.end_entry) {
+		entry = &info->entries[info->hdr.cur_entry];
+		gpa = gfn_to_gpa(entry->gfn);
+		level = RMP_TO_X86_PG_LEVEL(entry->pagesize);
+		op = entry->operation;
+
+		if (!IS_ALIGNED(gpa, page_level_size(level))) {
+			rc = PSC_INVALID_ENTRY;
+			goto out;
+		}
+
+		rc = __snp_handle_psc(vcpu, op, gpa, level);
+		if (rc)
+			goto out;
+
+		info->hdr.cur_entry++;
+	}
+
+out:
+	return rc ? map_to_psc_vmgexit_code(rc) : 0;
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3209,6 +3276,15 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		ret = 1;
 		break;
 	}
+	case SVM_VMGEXIT_PSC: {
+		unsigned long rc;
+
+		ret = 1;
+
+		rc = snp_handle_psc(svm, ghcb);
+		ghcb_set_sw_exit_info_2(ghcb, rc);
+		break;
+	}
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/include/linux/sev.h b/include/linux/sev.h
index 82e804a2ee0d..d96900b52aa5 100644
--- a/include/linux/sev.h
+++ b/include/linux/sev.h
@@ -57,6 +57,9 @@ struct rmpupdate {
  */
 #define FAIL_INUSE              3
 
+/* RMUPDATE detected 4K page and 2MB page overlap. */
+#define RMPUPDATE_FAIL_OVERLAP	7
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level);
 int psmash(struct page *page);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 35/40] KVM: Add arch hooks to track the host write to guest memory
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (33 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 34/40] KVM: SVM: Add support to handle " Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-19 23:30   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 36/40] KVM: X86: Export the kvm_zap_gfn_range() for the SNP use Brijesh Singh
                   ` (5 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

The kvm_write_guest{_page} and kvm_vcpu_write_guest{_page} are used by
the hypevisor to write to the guest memory. The kvm_vcpu_map() and
kvm_map_gfn() are used by the hypervisor to map the guest memory and
and access it later.

When SEV-SNP is enabled in the guest VM, the guest memory pages can
either be a private or shared. A write from the hypervisor goes through
the RMP checks. If hardware sees that hypervisor is attempting to write
to a guest private page, then it triggers an RMP violation (i.e, #PF with
RMP bit set).

Enhance the KVM guest write helpers to invoke an architecture specific
hooks (kvm_arch_write_gfn_{begin,end}) to track the write access from the
hypervisor.

When SEV-SNP is enabled, the guest uses the PAGE_STATE vmgexit to ask the
hypervisor to change the page state from shared to private or vice versa.
While changing the page state to private, use the
kvm_host_write_track_is_active() to check whether the page is being
tracked for the host write access (i.e either mapped or kvm_write_guest
is in progress). If its tracked, then do not change the page state.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  6 +++
 arch/x86/kvm/svm/sev.c          | 51 +++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  2 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/x86.c              | 78 +++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h        |  3 ++
 virt/kvm/kvm_main.c             | 21 +++++++--
 7 files changed, 159 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 59185b6bc82a..678992e9966a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -865,10 +865,13 @@ struct kvm_lpage_info {
 	int disallow_lpage;
 };
 
+bool kvm_host_write_track_is_active(struct kvm *kvm, gfn_t gfn);
+
 struct kvm_arch_memory_slot {
 	struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
 	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 	unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
+	unsigned short *host_write_track[KVM_PAGE_TRACK_MAX];
 };
 
 /*
@@ -1393,6 +1396,9 @@ struct kvm_x86_ops {
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
+
+	void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
+	void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0155d9b3127d..839cf321c6dd 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2884,6 +2884,19 @@ static int snp_make_page_shared(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
 	return rmpupdate(pfn_to_page(pfn), &val);
 }
 
+static inline bool kvm_host_write_track_gpa_range_is_active(struct kvm *kvm,
+							    gpa_t start, gpa_t end)
+{
+	while (start < end) {
+		if (kvm_host_write_track_is_active(kvm, gpa_to_gfn(start)))
+			return 1;
+
+		start += PAGE_SIZE;
+	}
+
+	return false;
+}
+
 static int snp_make_page_private(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
@@ -2895,6 +2908,14 @@ static int snp_make_page_private(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn
 	if (!e)
 		return -EINVAL;
 
+	/*
+	 * If the GPA is tracked for the write access then do not change the
+	 * page state from shared to private.
+	 */
+	if (kvm_host_write_track_gpa_range_is_active(vcpu->kvm,
+		gpa, gpa + page_level_size(level)))
+		return -EBUSY;
+
 	/* Log if the entry is validated */
 	if (rmpentry_validated(e))
 		pr_warn_ratelimited("Asked to make a pre-validated gpa %llx private\n", gpa);
@@ -3468,3 +3489,33 @@ int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
 
 	return min_t(uint32_t, level, max_level);
 }
+
+void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct rmpentry *e;
+	int level, rc;
+	kvm_pfn_t pfn;
+
+	if (!sev_snp_guest(kvm))
+		return;
+
+	pfn = gfn_to_pfn(kvm, gfn);
+	if (is_error_noslot_pfn(pfn))
+		return;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+	if (unlikely(!e))
+		return;
+
+	/*
+	 * A hypervisor should never write to the guest private page. A write to the
+	 * guest private will cause an RMP violation. If the guest page is private,
+	 * then make it shared.
+	 */
+	if (rmpentry_assigned(e)) {
+		pr_err("SEV-SNP: write to guest private gfn %llx\n", gfn);
+		rc = snp_make_page_shared(kvm_get_vcpu(kvm, 0),
+				gfn << PAGE_SHIFT, pfn, PG_LEVEL_4K);
+		BUG_ON(rc != 0);
+	}
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2632eae52aa3..4ff6fc86dd18 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4577,6 +4577,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 	.get_tdp_max_page_level = sev_get_tdp_max_page_level,
+
+	.write_page_begin = sev_snp_write_page_begin,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index af4cce39b30f..e0276ad8a1ae 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -576,6 +576,7 @@ void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
+void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbc4e04e67ad..1398b8021982 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9076,6 +9076,48 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 }
 
+static void update_gfn_track(struct kvm_memory_slot *slot, gfn_t gfn,
+			     enum kvm_page_track_mode mode, short count)
+{
+	int index, val;
+
+	index = gfn_to_index(gfn, slot->base_gfn, PG_LEVEL_4K);
+
+	val = slot->arch.host_write_track[mode][index];
+
+	if (WARN_ON(val + count < 0 || val + count > USHRT_MAX))
+		return;
+
+	slot->arch.host_write_track[mode][index] += count;
+}
+
+bool kvm_host_write_track_is_active(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot;
+	int index;
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!slot)
+		return false;
+
+	index = gfn_to_index(gfn, slot->base_gfn, PG_LEVEL_4K);
+	return !!READ_ONCE(slot->arch.host_write_track[KVM_PAGE_TRACK_WRITE][index]);
+}
+EXPORT_SYMBOL_GPL(kvm_host_write_track_is_active);
+
+void kvm_arch_write_gfn_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	update_gfn_track(slot, gfn, KVM_PAGE_TRACK_WRITE, 1);
+
+	if (kvm_x86_ops.write_page_begin)
+		kvm_x86_ops.write_page_begin(kvm, slot, gfn);
+}
+
+void kvm_arch_write_gfn_end(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	update_gfn_track(slot, gfn, KVM_PAGE_TRACK_WRITE, -1);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	if (!lapic_in_kernel(vcpu))
@@ -10896,6 +10938,36 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_hv_destroy_vm(kvm);
 }
 
+static void kvm_write_page_track_free_memslot(struct kvm_memory_slot *slot)
+{
+	int i;
+
+	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+		kvfree(slot->arch.host_write_track[i]);
+		slot->arch.host_write_track[i] = NULL;
+	}
+}
+
+static int kvm_write_page_track_create_memslot(struct kvm_memory_slot *slot,
+					       unsigned long npages)
+{
+	int  i;
+
+	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
+		slot->arch.host_write_track[i] =
+			kvcalloc(npages, sizeof(*slot->arch.host_write_track[i]),
+				 GFP_KERNEL_ACCOUNT);
+		if (!slot->arch.host_write_track[i])
+			goto track_free;
+	}
+
+	return 0;
+
+track_free:
+	kvm_write_page_track_free_memslot(slot);
+	return -ENOMEM;
+}
+
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 	int i;
@@ -10969,8 +11041,14 @@ static int kvm_alloc_memslot_metadata(struct kvm_memory_slot *slot,
 	if (kvm_page_track_create_memslot(slot, npages))
 		goto out_free;
 
+	if (kvm_write_page_track_create_memslot(slot, npages))
+		goto e_free_page_track;
+
 	return 0;
 
+e_free_page_track:
+	kvm_page_track_free_memslot(slot);
+
 out_free:
 	for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
 		kvfree(slot->arch.rmap[i]);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2f34487e21f2..f22e22cd2179 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1550,6 +1550,9 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 					    unsigned long start, unsigned long end);
 
+void kvm_arch_write_gfn_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
+void kvm_arch_write_gfn_end(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
+
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
 #else
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..bc805c15d0de 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -160,6 +160,14 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 {
 }
 
+__weak void kvm_arch_write_gfn_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
+{
+}
+
+__weak void kvm_arch_write_gfn_end(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
+{
+}
+
 bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
 {
 	/*
@@ -2309,7 +2317,8 @@ static void kvm_cache_gfn_to_pfn(struct kvm_memory_slot *slot, gfn_t gfn,
 	cache->generation = gen;
 }
 
-static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
+static int __kvm_map_gfn(struct kvm *kvm,
+			 struct kvm_memslots *slots, gfn_t gfn,
 			 struct kvm_host_map *map,
 			 struct gfn_to_pfn_cache *cache,
 			 bool atomic)
@@ -2361,20 +2370,22 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 	map->pfn = pfn;
 	map->gfn = gfn;
 
+	kvm_arch_write_gfn_begin(kvm, slot, map->gfn);
+
 	return 0;
 }
 
 int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
 		struct gfn_to_pfn_cache *cache, bool atomic)
 {
-	return __kvm_map_gfn(kvm_memslots(vcpu->kvm), gfn, map,
+	return __kvm_map_gfn(vcpu->kvm, kvm_memslots(vcpu->kvm), gfn, map,
 			cache, atomic);
 }
 EXPORT_SYMBOL_GPL(kvm_map_gfn);
 
 int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
 {
-	return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map,
+	return __kvm_map_gfn(vcpu->kvm, kvm_vcpu_memslots(vcpu), gfn, map,
 		NULL, false);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_map);
@@ -2412,6 +2423,8 @@ static void __kvm_unmap_gfn(struct kvm *kvm,
 	else
 		kvm_release_pfn(map->pfn, dirty, NULL);
 
+	kvm_arch_write_gfn_end(kvm, memslot, map->gfn);
+
 	map->hva = NULL;
 	map->page = NULL;
 }
@@ -2612,7 +2625,9 @@ static int __kvm_write_guest_page(struct kvm *kvm,
 	addr = gfn_to_hva_memslot(memslot, gfn);
 	if (kvm_is_error_hva(addr))
 		return -EFAULT;
+	kvm_arch_write_gfn_begin(kvm, memslot, gfn);
 	r = __copy_to_user((void __user *)addr + offset, data, len);
+	kvm_arch_write_gfn_end(kvm, memslot, gfn);
 	if (r)
 		return -EFAULT;
 	mark_page_dirty_in_slot(kvm, memslot, gfn);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 36/40] KVM: X86: Export the kvm_zap_gfn_range() for the SNP use
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (34 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 35/40] KVM: Add arch hooks to track the host write to guest memory Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault Brijesh Singh
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

While resolving the RMP page fault, we may run into cases where the page
level between the RMP entry and TDP does not match and the 2M RMP entry
must be split into 4K RMP entries. Or a 2M TDP page need to be broken
into multiple of 4K pages.

To keep the RMP and TDP page level in sync, we will zap the gfn range
after splitting the pages in the RMP entry. The zap should force the
TDP to gets rebuilt with the new page level.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/mmu.h              | 2 --
 arch/x86/kvm/mmu/mmu.c          | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 678992e9966a..46323af09995 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1490,6 +1490,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+
 
 int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
 bool pdptrs_changed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 147e76ab1536..eec62011bb2e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -228,8 +228,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return -(u32)fault & errcode;
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4abc0dc49d55..e60f54455cdc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5657,6 +5657,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 
 	return need_tlb_flush;
 }
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
 
 void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (35 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 36/40] KVM: X86: Export the kvm_zap_gfn_range() for the SNP use Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-20  0:10   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event Brijesh Singh
                   ` (3 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to
resolve the RMP violation encountered during the NPT table walk.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/mmu/mmu.c          | 20 ++++++++++++
 arch/x86/kvm/svm/sev.c          | 57 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  2 ++
 arch/x86/kvm/svm/svm.h          |  2 ++
 5 files changed, 84 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 46323af09995..117e2e08d7ed 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1399,6 +1399,9 @@ struct kvm_x86_ops {
 
 	void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
 	void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
+
+	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
+			int level, u64 error_code);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e60f54455cdc..b6a676ba1862 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5096,6 +5096,18 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
+static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	kvm_pfn_t pfn;
+	int level;
+
+	if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level)))
+		return RET_PF_RETRY;
+
+	kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code);
+	return RET_PF_RETRY;
+}
+
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
@@ -5112,6 +5124,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 			goto emulate;
 	}
 
+	if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) {
+		r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code);
+		if (r == RET_PF_RETRY)
+			return 1;
+		else
+			return r;
+	}
+
 	if (r == RET_PF_INVALID) {
 		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
 					  lower_32_bits(error_code), false);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 839cf321c6dd..53a60edc810e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3519,3 +3519,60 @@ void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn
 		BUG_ON(rc != 0);
 	}
 }
+
+int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
+			      int level, u64 error_code)
+{
+	struct rmpentry *e;
+	int rlevel, rc = 0;
+	bool private;
+	gfn_t gfn;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rlevel);
+	if (!e)
+		return 1;
+
+	private = !!(error_code & PFERR_GUEST_ENC_MASK);
+
+	/*
+	 * See APM section 15.36.11 on how to handle the RMP fault for the large pages.
+	 *
+	 *  npt	     rmp    access      action
+	 *  --------------------------------------------------
+	 *  4k       2M     C=1       psmash
+	 *  x        x      C=1       if page is not private then add a new RMP entry
+	 *  x        x      C=0       if page is private then make it shared
+	 *  2M       4k     C=x       zap
+	 */
+	if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
+	    ((level == PG_LEVEL_4K) && (rlevel == PG_LEVEL_2M) && private)) {
+		rc = snp_rmptable_psmash(vcpu, pfn);
+		goto zap_gfn;
+	}
+
+	/*
+	 * If it's a private access, and the page is not assigned in the RMP table, create a
+	 * new private RMP entry.
+	 */
+	if (!rmpentry_assigned(e) && private) {
+		rc = snp_make_page_private(vcpu, gpa, pfn, PG_LEVEL_4K);
+		goto zap_gfn;
+	}
+
+	/*
+	 * If it's a shared access, then make the page shared in the RMP table.
+	 */
+	if (rmpentry_assigned(e) && !private)
+		rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K);
+
+zap_gfn:
+	/*
+	 * Now that we have updated the RMP pagesize, zap the existing rmaps for
+	 * large entry ranges so that nested page table gets rebuilt with the updated RMP
+	 * pagesize.
+	 */
+	gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+	kvm_zap_gfn_range(vcpu->kvm, gfn, gfn + 512);
+
+	return 0;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 4ff6fc86dd18..32e35d396508 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4579,6 +4579,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.get_tdp_max_page_level = sev_get_tdp_max_page_level,
 
 	.write_page_begin = sev_snp_write_page_begin,
+
+	.handle_rmp_page_fault = snp_handle_rmp_page_fault,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e0276ad8a1ae..ccdaaa4e1fb1 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -577,6 +577,8 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
 int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
+int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
+			      int level, u64 error_code);
 
 /* vmenter.S */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (36 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-19 22:50   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 39/40] KVM: SVM: Use a VMSA physical address variable for populating VMCB Brijesh Singh
                   ` (2 subsequent siblings)
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

Version 2 of GHCB specification added the support two SNP Guest Request
Message NAE event. The events allows for an SEV-SNP guest to make request
to the SEV-SNP firmware through hypervisor using the SNP_GUEST_REQUEST
API define in the SEV-SNP firmware specification.

The SNP_GUEST_REQUEST requires two unique pages, one page for the request
and one page for the response. The response page need to be in the firmware
state. The GHCB specification says that both the pages need to be in the
hypervisor state but before executing the SEV-SNP command the response page
need to be in the firmware state.

The SNP_EXT_GUEST_REQUEST is similar to SNP_GUEST_REQUEST with the
difference of an additional certificate blob that can be passed through
the SNP_SET_CONFIG ioctl defined in the CCP driver. The CCP driver exposes
snp_guest_ext_guest_request() that is used by the KVM to get the both
report and additional data at once.

In order to minimize the page state transition during the command handling,
pre-allocate a firmware page on guest creation. Use the pre-allocated
firmware page to complete the command execution and copy the result in the
guest response page.

Ratelimit the handling of SNP_GUEST_REQUEST NAE to avoid the possibility
of a guest creating a denial of service attack aginst the SNP firmware.

Now that KVM supports all the VMGEXIT NAEs required for the base SEV-SNP
feature, set the hypervisor feature to advertise it.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 223 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.h |   6 +-
 2 files changed, 225 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 53a60edc810e..4cb4c1d7e444 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -18,6 +18,8 @@
 #include <linux/processor.h>
 #include <linux/trace_events.h>
 #include <linux/sev.h>
+#include <linux/kvm_host.h>
+#include <linux/sev-guest.h>
 #include <asm/fpu/internal.h>
 
 #include <asm/trapnr.h>
@@ -1534,6 +1536,7 @@ static int sev_receive_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
 	struct sev_data_snp_gctx_create data = {};
 	void *context;
 	int rc;
@@ -1543,14 +1546,24 @@ static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	if (!context)
 		return NULL;
 
-	data.gctx_paddr = __psp_pa(context);
-	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
-	if (rc) {
+	/* Allocate a firmware buffer used during the guest command handling. */
+	sev->snp_resp_page = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!sev->snp_resp_page) {
 		snp_free_firmware_page(context);
 		return NULL;
 	}
 
+	data.gctx_paddr = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc)
+		goto e_free;
+
 	return context;
+
+e_free:
+	snp_free_firmware_page(context);
+	snp_free_firmware_page(sev->snp_resp_page);
+	return NULL;
 }
 
 static int snp_bind_asid(struct kvm *kvm, int *error)
@@ -1618,6 +1631,12 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	if (rc)
 		goto e_free_context;
 
+	/* Used for rate limiting SNP guest message request, use the default settings */
+	ratelimit_default_init(&sev->snp_guest_msg_rs);
+
+	/* Allocate memory used for the certs data in SNP guest request */
+	sev->snp_certs_data = kmalloc(SEV_FW_BLOB_MAX_SIZE, GFP_KERNEL_ACCOUNT);
+
 	return 0;
 
 e_free_context:
@@ -2218,6 +2237,9 @@ static int snp_decommission_context(struct kvm *kvm)
 	snp_free_firmware_page(sev->snp_context);
 	sev->snp_context = NULL;
 
+	/* Free the response page. */
+	snp_free_firmware_page(sev->snp_resp_page);
+
 	return 0;
 }
 
@@ -2268,6 +2290,9 @@ void sev_vm_destroy(struct kvm *kvm)
 		sev_unbind_asid(kvm, sev->handle);
 	}
 
+
+	kfree(sev->snp_certs_data);
+
 	sev_asid_free(sev);
 }
 
@@ -2663,6 +2688,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 	case SVM_VMGEXIT_HV_FT:
 	case SVM_VMGEXIT_PSC:
+	case SVM_VMGEXIT_GUEST_REQUEST:
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
 		break;
 	default:
 		goto vmgexit_err;
@@ -3053,6 +3080,181 @@ static unsigned long snp_handle_psc(struct vcpu_svm *svm, struct ghcb *ghcb)
 	return rc ? map_to_psc_vmgexit_code(rc) : 0;
 }
 
+static int snp_build_guest_buf(struct vcpu_svm *svm, struct sev_data_snp_guest_request *data,
+			       gpa_t req_gpa, gpa_t resp_gpa)
+{
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	kvm_pfn_t req_pfn, resp_pfn;
+	struct kvm_sev_info *sev;
+
+	if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE)) {
+		pr_err_ratelimited("svm: guest request (%#llx) or response (%#llx) is not page aligned\n",
+			req_gpa, resp_gpa);
+		return -EINVAL;
+	}
+
+	req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa));
+	if (is_error_noslot_pfn(req_pfn)) {
+		pr_err_ratelimited("svm: guest request invalid gpa=%#llx\n", req_gpa);
+		return -EINVAL;
+	}
+
+	resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa));
+	if (is_error_noslot_pfn(resp_pfn)) {
+		pr_err_ratelimited("svm: guest response invalid gpa=%#llx\n", resp_gpa);
+		return -EINVAL;
+	}
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	data->gctx_paddr = __psp_pa(sev->snp_context);
+	data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
+	data->res_paddr = __psp_pa(sev->snp_resp_page);
+
+	return 0;
+}
+
+static void snp_handle_guest_request(struct vcpu_svm *svm, struct ghcb *ghcb,
+				     gpa_t req_gpa, gpa_t resp_gpa)
+{
+	struct sev_data_snp_guest_request data = {};
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_sev_info *sev;
+	int rc, err = 0;
+
+	if (!sev_snp_guest(vcpu->kvm)) {
+		rc = -ENODEV;
+		goto e_fail;
+	}
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	if (!__ratelimit(&sev->snp_guest_msg_rs)) {
+		pr_info_ratelimited("svm: too many guest message requests\n");
+		rc = -EAGAIN;
+		goto e_fail;
+	}
+
+	rc = snp_build_guest_buf(svm, &data, req_gpa, resp_gpa);
+	if (rc)
+		goto e_fail;
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	mutex_lock(&kvm->lock);
+
+	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
+	if (rc) {
+		mutex_unlock(&kvm->lock);
+
+		/* If we have a firmware error code then use it. */
+		if (err)
+			rc = err;
+
+		goto e_fail;
+	}
+
+	/* Copy the response after the firmware returns success. */
+	rc = kvm_write_guest(kvm, resp_gpa, sev->snp_resp_page, PAGE_SIZE);
+
+	mutex_unlock(&kvm->lock);
+
+e_fail:
+	ghcb_set_sw_exit_info_2(ghcb, rc);
+}
+
+static void snp_handle_ext_guest_request(struct vcpu_svm *svm, struct ghcb *ghcb,
+					 gpa_t req_gpa, gpa_t resp_gpa)
+{
+	struct sev_data_snp_guest_request req = {};
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long data_npages;
+	struct kvm_sev_info *sev;
+	unsigned long err;
+	u64 data_gpa;
+	int rc;
+
+	if (!sev_snp_guest(vcpu->kvm)) {
+		rc = -ENODEV;
+		goto e_fail;
+	}
+
+	sev = &to_kvm_svm(kvm)->sev_info;
+
+	if (!__ratelimit(&sev->snp_guest_msg_rs)) {
+		pr_info_ratelimited("svm: too many guest message requests\n");
+		rc = -EAGAIN;
+		goto e_fail;
+	}
+
+	if (!sev->snp_certs_data) {
+		pr_err("svm: certs data memory is not allocated\n");
+		rc = -EFAULT;
+		goto e_fail;
+	}
+
+	data_gpa = ghcb_get_rax(ghcb);
+	data_npages = ghcb_get_rbx(ghcb);
+
+	if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) {
+		pr_err_ratelimited("svm: certs data GPA is not page aligned (%#llx)\n", data_gpa);
+		rc = -EINVAL;
+		goto e_fail;
+	}
+
+	/* Verify that requested blob will fit in our intermediate buffer */
+	if ((data_npages << PAGE_SHIFT) > SEV_FW_BLOB_MAX_SIZE) {
+		rc = -EINVAL;
+		goto e_fail;
+	}
+
+	rc = snp_build_guest_buf(svm, &req, req_gpa, resp_gpa);
+	if (rc)
+		goto e_fail;
+
+	mutex_lock(&kvm->lock);
+	rc = snp_guest_ext_guest_request(&req, (unsigned long)sev->snp_certs_data,
+					 &data_npages, &err);
+	if (rc) {
+		mutex_unlock(&kvm->lock);
+
+		/*
+		 * If buffer length is small then return the expected
+		 * length in rbx.
+		 */
+		if (err == SNP_GUEST_REQ_INVALID_LEN) {
+			vcpu->arch.regs[VCPU_REGS_RBX] = data_npages;
+			ghcb_set_sw_exit_info_2(ghcb, err);
+			return;
+		}
+
+		/* If we have a firmware error code then use it. */
+		if (err)
+			rc = (int)err;
+
+		goto e_fail;
+	}
+
+	/* Copy the response after the firmware returns success. */
+	rc = kvm_write_guest(kvm, resp_gpa, sev->snp_resp_page, PAGE_SIZE);
+
+	mutex_unlock(&kvm->lock);
+
+	if (rc)
+		goto e_fail;
+
+	/* Copy the certificate blob in the guest memory */
+	if (data_npages &&
+	    kvm_write_guest(kvm, data_gpa, sev->snp_certs_data, data_npages << PAGE_SHIFT))
+		rc = -EFAULT;
+
+e_fail:
+	ghcb_set_sw_exit_info_2(ghcb, rc);
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3306,6 +3508,21 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		ghcb_set_sw_exit_info_2(ghcb, rc);
 		break;
 	}
+	case SVM_VMGEXIT_GUEST_REQUEST: {
+		snp_handle_guest_request(svm, ghcb, control->exit_info_1, control->exit_info_2);
+
+		ret = 1;
+		break;
+	}
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST: {
+		snp_handle_ext_guest_request(svm,
+					     ghcb,
+					     control->exit_info_1,
+					     control->exit_info_2);
+
+		ret = 1;
+		break;
+	}
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ccdaaa4e1fb1..9fcfc0a51737 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -18,6 +18,7 @@
 #include <linux/kvm_types.h>
 #include <linux/kvm_host.h>
 #include <linux/bits.h>
+#include <linux/ratelimit.h>
 
 #include <asm/svm.h>
 #include <asm/sev-common.h>
@@ -68,6 +69,9 @@ struct kvm_sev_info {
 	struct kvm *enc_context_owner; /* Owner of copied encryption context */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	void *snp_context;      /* SNP guest context page */
+	void *snp_resp_page;	/* SNP guest response page */
+	struct ratelimit_state snp_guest_msg_rs; /* Rate limit the SNP guest message */
+	void *snp_certs_data;
 };
 
 struct kvm_svm {
@@ -550,7 +554,7 @@ void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 #define GHCB_VERSION_MAX	2ULL
 #define GHCB_VERSION_MIN	1ULL
 
-#define GHCB_HV_FT_SUPPORTED	0
+#define GHCB_HV_FT_SUPPORTED	GHCB_HV_FT_SNP
 
 extern unsigned int max_sev_asid;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 39/40] KVM: SVM: Use a VMSA physical address variable for populating VMCB
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (37 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-21  0:20   ` Sean Christopherson
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event Brijesh Singh
  2021-07-08 15:40 ` [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Dave Hansen
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

From: Tom Lendacky <thomas.lendacky@amd.com>

In preparation to support SEV-SNP AP Creation, use a variable that holds
the VMSA physical address rather than converting the virtual address.
This will allow SEV-SNP AP Creation to set the new physical address that
will be used should the vCPU reset path be taken.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kvm/svm/sev.c | 5 ++---
 arch/x86/kvm/svm/svm.c | 9 ++++++++-
 arch/x86/kvm/svm/svm.h | 1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 4cb4c1d7e444..d8ad6dd58c87 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3553,10 +3553,9 @@ void sev_es_init_vmcb(struct vcpu_svm *svm)
 
 	/*
 	 * An SEV-ES guest requires a VMSA area that is a separate from the
-	 * VMCB page. Do not include the encryption mask on the VMSA physical
-	 * address since hardware will access it using the guest key.
+	 * VMCB page.
 	 */
-	svm->vmcb->control.vmsa_pa = __pa(svm->vmsa);
+	svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
 
 	/* Can't intercept CR register access, HV can't modify CR registers */
 	svm_clr_intercept(svm, INTERCEPT_CR0_READ);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 32e35d396508..74bc635c9608 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1379,9 +1379,16 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 	svm->vmcb01.ptr = page_address(vmcb01_page);
 	svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
 
-	if (vmsa_page)
+	if (vmsa_page) {
 		svm->vmsa = page_address(vmsa_page);
 
+		/*
+		 * Do not include the encryption mask on the VMSA physical
+		 * address since hardware will access it using the guest key.
+		 */
+		svm->vmsa_pa = __pa(svm->vmsa);
+	}
+
 	svm->guest_state_loaded = false;
 
 	svm_switch_vmcb(svm, &svm->vmcb01);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9fcfc0a51737..285d9b97b4d2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -177,6 +177,7 @@ struct vcpu_svm {
 
 	/* SEV-ES support */
 	struct sev_es_save_area *vmsa;
+	hpa_t vmsa_pa;
 	struct ghcb *ghcb;
 	struct kvm_host_map ghcb_map;
 	bool received_first_sipi;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* [PATCH Part2 RFC v4 40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (38 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 39/40] KVM: SVM: Use a VMSA physical address variable for populating VMCB Brijesh Singh
@ 2021-07-07 18:36 ` Brijesh Singh
  2021-07-21  0:01   ` Sean Christopherson
  2021-07-08 15:40 ` [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Dave Hansen
  40 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-07 18:36 UTC (permalink / raw)
  To: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Brijesh Singh

From: Tom Lendacky <thomas.lendacky@amd.com>

Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to create and start APs on their own.

A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.

For CREATE
  The guest supplies the GPA of the VMSA to be used for the vCPU with the
  specified APIC ID. The GPA is saved in the svm struct of the target
  vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the
  vCPU and then the vCPU is kicked.

For CREATE_ON_INIT:
  The guest supplies the GPA of the VMSA to be used for the vCPU with the
  specified APIC ID the next time an INIT is performed. The GPA is saved
  in the svm struct of the target vCPU.

For DESTROY:
  The guest indicates it wishes to stop the vCPU. The GPA is cleared from
  the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
  to vCPU and then the vCPU is kicked.


The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as
a result of the event or as a result of an INIT. The handler sets the vCPU
to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the
vCPU as not runnable. Any previous VMSA pages that were installed as
part of an SEV-SNP AP Creation NAE event are un-pinned. If a new VMSA is
to be installed, the VMSA guest page is pinned and set as the VMSA in the
vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new
VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the
vCPU state is left as KVM_MP_STATE_UNINITIALIZED to prevent it from being
run.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/include/asm/svm.h      |   3 +
 arch/x86/kvm/svm/sev.c          | 133 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |   7 +-
 arch/x86/kvm/svm/svm.h          |  16 +++-
 arch/x86/kvm/x86.c              |  11 ++-
 6 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 117e2e08d7ed..881e05b3f74e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -91,6 +91,7 @@
 #define KVM_REQ_MSR_FILTER_CHANGED	KVM_ARCH_REQ(29)
 #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
 	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(31)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1402,6 +1403,8 @@ struct kvm_x86_ops {
 
 	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
 			int level, u64 error_code);
+
+	void (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 5e72faa00cf2..6634a952563e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -220,6 +220,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define SVM_SEV_FEATURES_DEBUG_SWAP		BIT(5)
 #define SVM_SEV_FEATURES_PREVENT_HOST_IBS	BIT(6)
 #define SVM_SEV_FEATURES_BTB_ISOLATION		BIT(7)
+#define SVM_SEV_FEATURES_INT_INJ_MODES			\
+	(SVM_SEV_FEATURES_RESTRICTED_INJECTION |	\
+	 SVM_SEV_FEATURES_ALTERNATE_INJECTION)
 
 struct vmcb_seg {
 	u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index d8ad6dd58c87..95f5d25b4f08 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -582,6 +582,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 {
+	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
 	struct sev_es_save_area *save = svm->vmsa;
 
 	/* Check some debug related fields before encrypting the VMSA */
@@ -625,6 +626,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 	if (sev_snp_guest(svm->vcpu.kvm))
 		save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
 
+	/*
+	 * Save the VMSA synced SEV features. For now, they are the same for
+	 * all vCPUs, so just save each time.
+	 */
+	sev->sev_features = save->sev_features;
+
 	return 0;
 }
 
@@ -2682,6 +2689,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 		if (!ghcb_sw_scratch_is_valid(ghcb))
 			goto vmgexit_err;
 		break;
+	case SVM_VMGEXIT_AP_CREATION:
+		if (!ghcb_rax_is_valid(ghcb))
+			goto vmgexit_err;
+		break;
 	case SVM_VMGEXIT_NMI_COMPLETE:
 	case SVM_VMGEXIT_AP_HLT_LOOP:
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
@@ -3395,6 +3406,121 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 	return ret;
 }
 
+void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	kvm_pfn_t pfn;
+
+	mutex_lock(&svm->snp_vmsa_mutex);
+
+	vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
+
+	/* Clear use of the VMSA in the sev_es_init_vmcb() path */
+	svm->vmsa_pa = 0;
+
+	/* Clear use of the VMSA from the VMCB */
+	svm->vmcb->control.vmsa_pa = 0;
+
+	/* Un-pin previous VMSA */
+	if (svm->snp_vmsa_pfn) {
+		kvm_release_pfn_dirty(svm->snp_vmsa_pfn);
+		svm->snp_vmsa_pfn = 0;
+	}
+
+	if (svm->snp_vmsa_gpa) {
+		/* Validate that the GPA is page aligned */
+		if (!PAGE_ALIGNED(svm->snp_vmsa_gpa))
+			goto e_unlock;
+
+		/*
+		 * The VMSA is referenced by thy hypervisor physical address,
+		 * so retrieve the PFN and pin it.
+		 */
+		pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(svm->snp_vmsa_gpa));
+		if (is_error_pfn(pfn))
+			goto e_unlock;
+
+		svm->snp_vmsa_pfn = pfn;
+
+		/* Use the new VMSA in the sev_es_init_vmcb() path */
+		svm->vmsa_pa = pfn_to_hpa(pfn);
+		svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
+
+		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	} else {
+		vcpu->arch.pv.pv_unhalted = false;
+		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
+	}
+
+e_unlock:
+	mutex_unlock(&svm->snp_vmsa_mutex);
+}
+
+static void sev_snp_ap_creation(struct vcpu_svm *svm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm_vcpu *target_vcpu;
+	struct vcpu_svm *target_svm;
+	unsigned int request;
+	unsigned int apic_id;
+	bool kick;
+
+	request = lower_32_bits(svm->vmcb->control.exit_info_1);
+	apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
+
+	/* Validate the APIC ID */
+	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
+	if (!target_vcpu)
+		return;
+
+	target_svm = to_svm(target_vcpu);
+
+	kick = true;
+
+	mutex_lock(&target_svm->snp_vmsa_mutex);
+
+	target_svm->snp_vmsa_gpa = 0;
+	target_svm->snp_vmsa_update_on_init = false;
+
+	/* Interrupt injection mode shouldn't change for AP creation */
+	if (request < SVM_VMGEXIT_AP_DESTROY) {
+		u64 sev_features;
+
+		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
+		sev_features ^= sev->sev_features;
+		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {
+			vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
+				    vcpu->arch.regs[VCPU_REGS_RAX]);
+			goto out;
+		}
+	}
+
+	switch (request) {
+	case SVM_VMGEXIT_AP_CREATE_ON_INIT:
+		kick = false;
+		target_svm->snp_vmsa_update_on_init = true;
+		fallthrough;
+	case SVM_VMGEXIT_AP_CREATE:
+		target_svm->snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
+		break;
+	case SVM_VMGEXIT_AP_DESTROY:
+		break;
+	default:
+		vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
+			    request);
+		break;
+	}
+
+out:
+	mutex_unlock(&target_svm->snp_vmsa_mutex);
+
+	if (kick) {
+		kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
+		kvm_vcpu_kick(target_vcpu);
+	}
+}
+
 int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3523,6 +3649,11 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		ret = 1;
 		break;
 	}
+	case SVM_VMGEXIT_AP_CREATION:
+		sev_snp_ap_creation(svm);
+
+		ret = 1;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
@@ -3597,6 +3728,8 @@ void sev_es_create_vcpu(struct vcpu_svm *svm)
 	set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
 					    GHCB_VERSION_MIN,
 					    sev_enc_bit));
+
+	mutex_init(&svm->snp_vmsa_mutex);
 }
 
 void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 74bc635c9608..078a569c85a8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1304,7 +1304,10 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	svm->spec_ctrl = 0;
 	svm->virt_spec_ctrl = 0;
 
-	if (!init_event) {
+	if (init_event && svm->snp_vmsa_update_on_init) {
+		svm->snp_vmsa_update_on_init = false;
+		sev_snp_update_protected_guest_state(vcpu);
+	} else {
 		vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
 				       MSR_IA32_APICBASE_ENABLE;
 		if (kvm_vcpu_is_reset_bsp(vcpu))
@@ -4588,6 +4591,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.write_page_begin = sev_snp_write_page_begin,
 
 	.handle_rmp_page_fault = snp_handle_rmp_page_fault,
+
+	.update_protected_guest_state = sev_snp_update_protected_guest_state,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 285d9b97b4d2..f9d25d944f26 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -60,18 +60,26 @@ struct kvm_sev_info {
 	bool active;		/* SEV enabled guest */
 	bool es_active;		/* SEV-ES enabled guest */
 	bool snp_active;	/* SEV-SNP enabled guest */
+
 	unsigned int asid;	/* ASID used for this guest */
 	unsigned int handle;	/* SEV firmware handle */
 	int fd;			/* SEV device fd */
+
 	unsigned long pages_locked; /* Number of pages locked */
 	struct list_head regions_list;  /* List of registered regions */
+
 	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
+
 	struct kvm *enc_context_owner; /* Owner of copied encryption context */
+
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
+
 	void *snp_context;      /* SNP guest context page */
 	void *snp_resp_page;	/* SNP guest response page */
 	struct ratelimit_state snp_guest_msg_rs; /* Rate limit the SNP guest message */
 	void *snp_certs_data;
+
+	u64 sev_features;	/* Features set at VMSA creation */
 };
 
 struct kvm_svm {
@@ -192,6 +200,11 @@ struct vcpu_svm {
 	bool guest_state_loaded;
 
 	u64 ghcb_registered_gpa;
+
+	struct mutex snp_vmsa_mutex;
+	gpa_t snp_vmsa_gpa;
+	kvm_pfn_t snp_vmsa_pfn;
+	bool snp_vmsa_update_on_init;	/* SEV-SNP AP Creation on INIT-SIPI */
 };
 
 struct svm_cpu_data {
@@ -555,7 +568,7 @@ void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 #define GHCB_VERSION_MAX	2ULL
 #define GHCB_VERSION_MIN	1ULL
 
-#define GHCB_HV_FT_SUPPORTED	GHCB_HV_FT_SNP
+#define GHCB_HV_FT_SUPPORTED	(GHCB_HV_FT_SNP | GHCB_HV_FT_SNP_AP_CREATION)
 
 extern unsigned int max_sev_asid;
 
@@ -584,6 +597,7 @@ int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
 int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
 			      int level, u64 error_code);
+void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1398b8021982..e9fd59913bc2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9279,6 +9279,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+		if (kvm_check_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu)) {
+			kvm_x86_ops.update_protected_guest_state(vcpu);
+			if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE) {
+				r = 1;
+				goto out;
+			}
+		}
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -11236,7 +11244,8 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	if (!list_empty_careful(&vcpu->async_pf.done))
 		return true;
 
-	if (kvm_apic_has_events(vcpu))
+	if (kvm_apic_has_events(vcpu) ||
+	    kvm_test_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu))
 		return true;
 
 	if (vcpu->arch.pv.pv_unhalted)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault Brijesh Singh
@ 2021-07-07 19:21   ` Dave Hansen
  2021-07-08 15:02     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-07 19:21 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

> @@ -502,6 +503,81 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index)
>  		 name, index, addr, (desc.limit0 | (desc.limit1 << 16)));
>  }
>  
> +static void dump_rmpentry(unsigned long address)
> +{

A comment on this sucker would be nice.  I *think* this must be a kernel
virtual address.  Reflecting that into the naming or a comment would be
nice.

> +	struct rmpentry *e;
> +	unsigned long pfn;
> +	pgd_t *pgd;
> +	pte_t *pte;
> +	int level;
> +
> +	pgd = __va(read_cr3_pa());
> +	pgd += pgd_index(address);
> +
> +	pte = lookup_address_in_pgd(pgd, address, &level);
> +	if (unlikely(!pte))
> +		return;

It's a little annoying this is doing *another* separate page walk.
Don't we already do this for dumping the page tables themselves at oops
time?

Also, please get rid of all of the likely/unlikely()s in your patches.
They are pure noise unless you have specific knowledge of the compiler
getting something so horribly wrong that it affects real-world performance.

> +	switch (level) {
> +	case PG_LEVEL_4K: {
> +		pfn = pte_pfn(*pte);
> +		break;
> +	}

These superfluous brackets are really strange looking.  Could you remove
them, please?

> +	case PG_LEVEL_2M: {
> +		pfn = pmd_pfn(*(pmd_t *)pte);
> +		break;
> +	}
> +	case PG_LEVEL_1G: {
> +		pfn = pud_pfn(*(pud_t *)pte);
> +		break;
> +	}
> +	case PG_LEVEL_512G: {
> +		pfn = p4d_pfn(*(p4d_t *)pte);
> +		break;
> +	}
> +	default:
> +		return;
> +	}
> +
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);

So, lookup_address_in_pgd() looks to me like it will return pretty
random page table entries as long as the entry isn't
p{gd,4d,ud,md,te}_none().  It can certainly return !p*_present()
entries.  Those are *NOT* safe to call pfn_to_page() on.

> +	if (unlikely(!e))
> +		return;
> +
> +	/*
> +	 * If the RMP entry at the faulting address was not assigned, then
> +	 * dump may not provide any useful debug information. Iterate
> +	 * through the entire 2MB region, and dump the RMP entries if one
> +	 * of the bit in the RMP entry is set.
> +	 */

Some of this comment should be moved down to the loop itself.

> +	if (rmpentry_assigned(e)) {
> +		pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
> +			" asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
> +			rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
> +			rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
> +			rmpentry_validated(e));
> +
> +		pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", pfn << PAGE_SHIFT,
> +			e->high, e->low);

Could you please include an entire oops in the changelog that also
includes this information?  It would be really nice if this was at least
consistent in style to the stuff around it.

Also, how much of this stuff like rmpentry_asid() is duplicated in the
"raw" dump of e->high and e->low?

> +	} else {
> +		unsigned long pfn_end;
> +
> +		pfn = pfn & ~0x1ff;

There's a nice magic number.  Why not:

	pfn = pfn & ~(PTRS_PER_PMD-1);

?

This also needs a comment about *WHY* this case is looking at a 2MB region.

> +		pfn_end = pfn + PTRS_PER_PMD;
> +
> +		while (pfn < pfn_end) {
> +			e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
> +
> +			if (unlikely(!e))
> +				return;
> +
> +			if (e->low || e->high)
> +				pr_alert("RMPEntry paddr 0x%lx: %016llx %016llx\n",
> +					pfn << PAGE_SHIFT, e->high, e->low);

Why does this dump "raw" RMP entries while the above stuff filters them
through a bunch of helper macros?

> +			pfn++;
> +		}
> +	}
> +}
> +
>  static void
>  show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address)
>  {
> @@ -578,6 +654,9 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
>  	}
>  
>  	dump_pagetable(address);
> +
> +	if (error_code & X86_PF_RMP)
> +		dump_rmpentry(address);
>  }
>  
>  static noinline void
> 


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-07 19:21   ` Dave Hansen
@ 2021-07-08 15:02     ` Brijesh Singh
  2021-07-08 15:30       ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-08 15:02 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

Hi Dave,


On 7/7/21 2:21 PM, Dave Hansen wrote:
>> @@ -502,6 +503,81 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index)
>>   		 name, index, addr, (desc.limit0 | (desc.limit1 << 16)));
>>   }
>>   
>> +static void dump_rmpentry(unsigned long address)
>> +{
> 
> A comment on this sucker would be nice.  I *think* this must be a kernel
> virtual address.  Reflecting that into the naming or a comment would be
> nice.

Ack, I will add some comment.

> 
>> +	struct rmpentry *e;
>> +	unsigned long pfn;
>> +	pgd_t *pgd;
>> +	pte_t *pte;
>> +	int level;
>> +
>> +	pgd = __va(read_cr3_pa());
>> +	pgd += pgd_index(address);
>> +
>> +	pte = lookup_address_in_pgd(pgd, address, &level);
>> +	if (unlikely(!pte))
>> +		return;
> 
> It's a little annoying this is doing *another* separate page walk.
> Don't we already do this for dumping the page tables themselves at oops
> time?
> 

Yes, we already do the walk in oops function, I'll extend the 
dump_rmpentry() to use the level from the oops to avoid the duplicate walk.


> Also, please get rid of all of the likely/unlikely()s in your patches.
> They are pure noise unless you have specific knowledge of the compiler
> getting something so horribly wrong that it affects real-world performance.
> 
>> +	switch (level) {
>> +	case PG_LEVEL_4K: {
>> +		pfn = pte_pfn(*pte);
>> +		break;
>> +	}
> 
> These superfluous brackets are really strange looking.  Could you remove
> them, please?

Noted.

> 
>> +	case PG_LEVEL_2M: {
>> +		pfn = pmd_pfn(*(pmd_t *)pte);
>> +		break;
>> +	}
>> +	case PG_LEVEL_1G: {
>> +		pfn = pud_pfn(*(pud_t *)pte);
>> +		break;
>> +	}
>> +	case PG_LEVEL_512G: {
>> +		pfn = p4d_pfn(*(p4d_t *)pte);
>> +		break;
>> +	}
>> +	default:
>> +		return;
>> +	}
>> +
>> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
> 
> So, lookup_address_in_pgd() looks to me like it will return pretty
> random page table entries as long as the entry isn't
> p{gd,4d,ud,md,te}_none().  It can certainly return !p*_present()
> entries.  Those are *NOT* safe to call pfn_to_page() on.
> 

I will add some checks to make sure that we are accessing only safe pfn's.

>> +	if (unlikely(!e))
>> +		return;
>> +
>> +	/*
>> +	 * If the RMP entry at the faulting address was not assigned, then
>> +	 * dump may not provide any useful debug information. Iterate
>> +	 * through the entire 2MB region, and dump the RMP entries if one
>> +	 * of the bit in the RMP entry is set.
>> +	 */
> 
> Some of this comment should be moved down to the loop itself.

Noted.

> 
>> +	if (rmpentry_assigned(e)) {
>> +		pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
>> +			" asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
>> +			rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
>> +			rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
>> +			rmpentry_validated(e));
>> +
>> +		pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", pfn << PAGE_SHIFT,
>> +			e->high, e->low);
> 
> Could you please include an entire oops in the changelog that also
> includes this information?  It would be really nice if this was at least
> consistent in style to the stuff around it.

Here is one example: (in this case page was immutable and HV attempted 
to write to it).

BUG: unable to handle page fault for address: ffff98c78ee00000
#PF: supervisor write access in kernel mode
#PF: error_code(0x80000003) - rmp violation
PGD 304b201067 P4D 304b201067 PUD 20c7f06063 PMD 20c8976063 PTE 
80000020cee00163
RMPEntry paddr 0x20cee00000 [assigned=1 immutable=1 pagesize=0 gpa=0x0 
asid=0 vmsa=0 validated=0]
RMPEntry paddr 0x20cee00000 000000000000000f 8000000000000ffd


> 
> Also, how much of this stuff like rmpentry_asid() is duplicated in the
> "raw" dump of e->high and e->low?
> 

Most of the rmpentry_xxx assessors read the e->low. The RMP entry is a 
16-bytes. AMD APM defines only a few bits and keeps everything else 
reserved. We are in the process of updating APM to document few more 
bits. I am not adding assessors for the undocumented fields. Until then, 
we dump the entire 16-bytes.

I agree that we are duplicating the information. I can live with just a 
raw dump. That means anyone who is debugging the crash will look at the 
APM to decode the fields.


>> +	} else {
>> +		unsigned long pfn_end;
>> +
>> +		pfn = pfn & ~0x1ff;
> 
> There's a nice magic number.  Why not:
> 
> 	pfn = pfn & ~(PTRS_PER_PMD-1);
> 
> ?

Noted.

> 
> This also needs a comment about *WHY* this case is looking at a 2MB region.
> 

Actually the comment above says why we are looking for the 2MB region. 
Let me rearrange the comment block so that its more clear.

The reason for iterating through 2MB region is; if the faulting address 
is not assigned in the RMP table, and page table walk level is 2MB then 
one of entry within the large page is the root cause of the fault. Since 
we don't know which entry hence I dump all the non-zero entries.


>> +		pfn_end = pfn + PTRS_PER_PMD;
>> +
>> +		while (pfn < pfn_end) {
>> +			e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
>> +
>> +			if (unlikely(!e))
>> +				return;
>> +
>> +			if (e->low || e->high)
>> +				pr_alert("RMPEntry paddr 0x%lx: %016llx %016llx\n",
>> +					pfn << PAGE_SHIFT, e->high, e->low);
> 
> Why does this dump "raw" RMP entries while the above stuff filters them
> through a bunch of helper macros?
> 

There are two cases which we need to consider:

1) the faulting page is a guest private (aka assigned)
2) the faulting page is a hypervisor (aka shared)

We will be primarily seeing #1. In this case, we know its a assigned 
page, and we can decode the fields.

The #2 will happen in rare conditions, if it happens, one of the 
undocumented bit in the RMP entry can provide us some useful information 
hence we dump the raw values.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-08 15:02     ` Brijesh Singh
@ 2021-07-08 15:30       ` Dave Hansen
  2021-07-08 16:48         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-08 15:30 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/8/21 8:02 AM, Brijesh Singh wrote:
...
>>> +    pgd = __va(read_cr3_pa());
>>> +    pgd += pgd_index(address);
>>> +
>>> +    pte = lookup_address_in_pgd(pgd, address, &level);
>>> +    if (unlikely(!pte))
>>> +        return;
>>
>> It's a little annoying this is doing *another* separate page walk.
>> Don't we already do this for dumping the page tables themselves at oops
>> time?
> 
> Yes, we already do the walk in oops function, I'll extend the
> dump_rmpentry() to use the level from the oops to avoid the duplicate walk.

I was even thinking that you could use the pmd/pte entries that come
from the walk in dump_pagetable().

BTW, I think the snp_lookup_page_in_rmptable() interface is probably
wrong.  It takes a 'struct page':

+struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)

but then immediately converts it to a paddr:

> +	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;

If you just had it take a paddr, you wouldn't have to mess with all of
this pfn_valid() and phys_to_page() error checking.

>>> +    case PG_LEVEL_2M: {
>>> +        pfn = pmd_pfn(*(pmd_t *)pte);
>>> +        break;
>>> +    }
>>> +    case PG_LEVEL_1G: {
>>> +        pfn = pud_pfn(*(pud_t *)pte);
>>> +        break;
>>> +    }
>>> +    case PG_LEVEL_512G: {
>>> +        pfn = p4d_pfn(*(p4d_t *)pte);
>>> +        break;
>>> +    }
>>> +    default:
>>> +        return;
>>> +    }
>>> +
>>> +    e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
>>
>> So, lookup_address_in_pgd() looks to me like it will return pretty
>> random page table entries as long as the entry isn't
>> p{gd,4d,ud,md,te}_none().  It can certainly return !p*_present()
>> entries.  Those are *NOT* safe to call pfn_to_page() on.
>>
> 
> I will add some checks to make sure that we are accessing only safe pfn's.

Or fix the snp_lookup_page_in_rmptable() interface, please.

>>> +    if (rmpentry_assigned(e)) {
>>> +        pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d
>>> pagesize=%d gpa=0x%lx"
>>> +            " asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
>>> +            rmpentry_assigned(e), rmpentry_immutable(e),
>>> rmpentry_pagesize(e),
>>> +            rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
>>> +            rmpentry_validated(e));
>>> +
>>> +        pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", pfn <<
>>> PAGE_SHIFT,
>>> +            e->high, e->low);
>>
>> Could you please include an entire oops in the changelog that also
>> includes this information?  It would be really nice if this was at least
>> consistent in style to the stuff around it.
> 
> Here is one example: (in this case page was immutable and HV attempted
> to write to it).
> 
> BUG: unable to handle page fault for address: ffff98c78ee00000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x80000003) - rmp violation

Let's capitalize "RMP" here, please.

> PGD 304b201067 P4D 304b201067 PUD 20c7f06063 PMD 20c8976063 PTE
> 80000020cee00163
> RMPEntry paddr 0x20cee00000 [assigned=1 immutable=1 pagesize=0 gpa=0x0
> asid=0 vmsa=0 validated=0]
> RMPEntry paddr 0x20cee00000 000000000000000f 8000000000000ffd

That's a good example, thanks!

But, it does make me think that we shouldn't be spitting out
"immutable".  Should we call it "readonly" or something so that folks
have a better chance of figuring out what's wrong?  Even better, should
we be looking specifically for X86_PF_RMP *and* immutable=1 and spitting
out something in english about it?

This also *looks* to be spitting out the same "RMPEntry paddr
0x20cee00000" more than once.  Maybe we should just indent the extra
entries instead of repeating things.  The high/low are missing a "0x"
prefix, they also don't have any kind of text label.

>> Also, how much of this stuff like rmpentry_asid() is duplicated in the
>> "raw" dump of e->high and e->low?
> 
> Most of the rmpentry_xxx assessors read the e->low. The RMP entry is a
> 16-bytes. AMD APM defines only a few bits and keeps everything else
> reserved. We are in the process of updating APM to document few more
> bits. I am not adding assessors for the undocumented fields. Until then,
> we dump the entire 16-bytes.
> 
> I agree that we are duplicating the information. I can live with just a
> raw dump. That means anyone who is debugging the crash will look at the
> APM to decode the fields.

I actually really like processing the fields.  I think it's a good
investment to make the error messages as self-documenting as possible
and not require the poor souls who are decoding oopses to also keep each
vendor's architecture manuals at hand.

>> This also needs a comment about *WHY* this case is looking at a 2MB
>> region.
>>
> 
> Actually the comment above says why we are looking for the 2MB region.
> Let me rearrange the comment block so that its more clear.
> 
> The reason for iterating through 2MB region is; if the faulting address
> is not assigned in the RMP table, and page table walk level is 2MB then
> one of entry within the large page is the root cause of the fault. Since
> we don't know which entry hence I dump all the non-zero entries.

Logically you can figure this out though, right?  Why throw 511 entries
at the console when we *know* they're useless?

>>> +        pfn_end = pfn + PTRS_PER_PMD;
>>> +
>>> +        while (pfn < pfn_end) {
>>> +            e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
>>> +
>>> +            if (unlikely(!e))
>>> +                return;
>>> +
>>> +            if (e->low || e->high)
>>> +                pr_alert("RMPEntry paddr 0x%lx: %016llx %016llx\n",
>>> +                    pfn << PAGE_SHIFT, e->high, e->low);
>>
>> Why does this dump "raw" RMP entries while the above stuff filters them
>> through a bunch of helper macros?
> 
> There are two cases which we need to consider:
> 
> 1) the faulting page is a guest private (aka assigned)
> 2) the faulting page is a hypervisor (aka shared)
> 
> We will be primarily seeing #1. In this case, we know its a assigned
> page, and we can decode the fields.
> 
> The #2 will happen in rare conditions,

What rare conditions?

> if it happens, one of the undocumented bit in the RMP entry can
> provide us some useful information hence we dump the raw values.
You're saying that there are things that can cause RMP faults that
aren't documented?  That's rather nasty for your users, don't you think?

I'd be fine if you want to define a mask of unknown bits and spit out to
the users that some unknown bits are set.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support
  2021-07-07 18:35 [PATCH Part2 RFC v4 00/40] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Brijesh Singh
                   ` (39 preceding siblings ...)
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event Brijesh Singh
@ 2021-07-08 15:40 ` Dave Hansen
  40 siblings, 0 replies; 176+ messages in thread
From: Dave Hansen @ 2021-07-08 15:40 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/7/21 11:35 AM, Brijesh Singh wrote:
> Changes since v3:
>  * Add support for extended guest message request.
>  * Add ioctl to query the SNP Platform status.
>  * Add ioctl to get and set the SNP config.
>  * Add check to verify that memory reserved for the RMP covers the full system RAM.
>  * Start the SNP specific commands from 256 instead of 255.
>  * Multiple cleanup and fixes based on the review feedback.
> 
> Changes since v2:
>  * Add AP creation support.
>  * Drop the patch to handle the RMP fault for the kernel address.
>  * Add functions to track the write access from the hypervisor.
>  * Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
>  * Dump the RMP entry on RMP violation for the debug.
>  * Shorten the GHCB macro names.
>  * Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
>  * Sync the header with the latest 0.9 SNP spec.

What happened to the THP splitting on RMP violations?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address Brijesh Singh
@ 2021-07-08 16:16   ` Dave Hansen
  2021-07-12 15:43     ` Brijesh Singh
  2021-07-30 16:00   ` Vlastimil Babka
  1 sibling, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-08 16:16 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

Oh, here's the THP code.  The subject just changed.

On 7/7/21 11:35 AM, Brijesh Singh wrote:
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. When the host writes to pages, hardware checks the following
> conditions at the end of page walk:
> 
> 1. Assigned bit in the RMP table is zero (i.e page is shared).
> 2. If the page table entry that gives the sPA indicates that the target
>    page size is a large page, then all RMP entries for the 4KB
>    constituting pages of the target must have the assigned bit 0.
> 3. Immutable bit in the RMP table is not zero.
> 
> The hardware will raise page fault if one of the above conditions is not
> met. Try resolving the fault instead of taking fault again and again. If
> the host attempts to write to the guest private memory then send the
> SIGBUG signal to kill the process. If the page level between the host and

"SIGBUG"?

> RMP entry does not match, then split the address to keep the RMP and host
> page levels in sync.


> ---
>  arch/x86/mm/fault.c | 69 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h  |  6 +++-
>  mm/memory.c         | 13 +++++++++
>  3 files changed, 87 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 195149eae9b6..cdf48019c1a7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1281,6 +1281,58 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>  }
>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>  
> +#define RMP_FAULT_RETRY		0
> +#define RMP_FAULT_KILL		1
> +#define RMP_FAULT_PAGE_SPLIT	2
> +
> +static inline size_t pages_per_hpage(int level)
> +{
> +	return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +static int handle_user_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
> +{
> +	unsigned long pfn, mask;
> +	int rmp_level, level;
> +	struct rmpentry *e;
> +	pte_t *pte;
> +
> +	if (unlikely(!cpu_feature_enabled(X86_FEATURE_SEV_SNP)))
> +		return RMP_FAULT_KILL;

Shouldn't this be a WARN_ON_ONCE()?  How can we get RMP faults without
SEV-SNP?

> +	/* Get the native page level */
> +	pte = lookup_address_in_mm(current->mm, address, &level);
> +	if (unlikely(!pte))
> +		return RMP_FAULT_KILL;

What would this mean?  There was an RMP fault on a non-present page?
How could that happen?  What if there was a race between an unmapping
event and the RMP fault delivery?

> +	pfn = pte_pfn(*pte);
> +	if (level > PG_LEVEL_4K) {
> +		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> +		pfn |= (address >> PAGE_SHIFT) & mask;
> +	}

This looks inherently racy.  What happens if there are two parallel RMP
faults on the same 2M page.  One of them splits the page tables, the
other gets a fault for an already-split page table.

Is that handled here somehow?

> +	/* Get the page level from the RMP entry. */
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
> +	if (!e)
> +		return RMP_FAULT_KILL;

The snp_lookup_page_in_rmptable() failure cases looks WARN-worthly.
Either you're doing a lookup for something not *IN* the RMP table, or
you don't support SEV-SNP, in which case you shouldn't be in this code
in the first place.

> +	/*
> +	 * Check if the RMP violation is due to the guest private page access.
> +	 * We can not resolve this RMP fault, ask to kill the guest.
> +	 */
> +	if (rmpentry_assigned(e))
> +		return RMP_FAULT_KILL;

No "We's", please.  Speak in imperative voice.

> +	/*
> +	 * The backing page level is higher than the RMP page level, request
> +	 * to split the page.
> +	 */
> +	if (level > rmp_level)
> +		return RMP_FAULT_PAGE_SPLIT;

This can theoretically trigger on a hugetlbfs page.  Right?

I thought I asked about this before... more below...

> +	return RMP_FAULT_RETRY;
> +}
> +
>  /*
>   * Handle faults in the user portion of the address space.  Nothing in here
>   * should check X86_PF_USER without a specific justification: for almost
> @@ -1298,6 +1350,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	struct task_struct *tsk;
>  	struct mm_struct *mm;
>  	vm_fault_t fault;
> +	int ret;
>  	unsigned int flags = FAULT_FLAG_DEFAULT;
>  
>  	tsk = current;
> @@ -1378,6 +1431,22 @@ void 
(struct pt_regs *regs,
>  	if (error_code & X86_PF_INSTR)
>  		flags |= FAULT_FLAG_INSTRUCTION;
>  
> +	/*
> +	 * If its an RMP violation, try resolving it.
> +	 */
> +	if (error_code & X86_PF_RMP) {
> +		ret = handle_user_rmp_page_fault(error_code, address);
> +		if (ret == RMP_FAULT_PAGE_SPLIT) {
> +			flags |= FAULT_FLAG_PAGE_SPLIT;
> +		} else if (ret == RMP_FAULT_KILL) {
> +			fault |= VM_FAULT_SIGBUS;
> +			do_sigbus(regs, error_code, address, fault);
> +			return;
> +		} else {
> +			return;
> +		}
> +	}

Why not just have handle_user_rmp_page_fault() return a VM_FAULT_* code
directly?

I also suspect you can just set VM_FAULT_SIGBUS and let the do_sigbus()
call later on in the function do its work.

>  	 * Faults in the vsyscall page might need emulation.  The
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 322ec61d0da7..211dfe5d3b1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -450,6 +450,8 @@ extern pgprot_t protection_map[16];
>   * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>   * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>   * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
> + *  region to smaller page size and retry.
>   *
>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>   * whether we would allow page faults to retry by specifying these two
> @@ -481,6 +483,7 @@ enum fault_flag {
>  	FAULT_FLAG_REMOTE =		1 << 7,
>  	FAULT_FLAG_INSTRUCTION =	1 << 8,
>  	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
> +	FAULT_FLAG_PAGE_SPLIT =		1 << 10,
>  };
>  
>  /*
> @@ -520,7 +523,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
>  	{ FAULT_FLAG_USER,		"USER" }, \
>  	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
>  	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
> -	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
> +	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
> +	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
>  
>  /*
>   * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/mm/memory.c b/mm/memory.c
> index 730daa00952b..aef261d94e33 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4407,6 +4407,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> +	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
> +		return VM_FAULT_SIGBUS;
> +
> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> +	return 0;
> +}

What will this do when you hand it a hugetlbfs page?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-08 15:30       ` Dave Hansen
@ 2021-07-08 16:48         ` Brijesh Singh
  2021-07-08 16:58           ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-08 16:48 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh



On 7/8/21 10:30 AM, Dave Hansen wrote:
> I was even thinking that you could use the pmd/pte entries that come
> from the walk in dump_pagetable().
> 
> BTW, I think the snp_lookup_page_in_rmptable() interface is probably
> wrong.  It takes a 'struct page':
> 

In some cases the caller already have 'struct page' so it was easier on 
them. I can change it to snp_lookup_pfn_in_rmptable() to simplify the 
things. In the cases where the caller already have 'struct page' will 
simply do page_to_pfn().


> +struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)
> 
> but then immediately converts it to a paddr:
> 
>> +	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
> 
> If you just had it take a paddr, you wouldn't have to mess with all of
> this pfn_valid() and phys_to_page() error checking.

Noted.

> 
> Or fix the snp_lookup_page_in_rmptable() interface, please.

Yes.

> 
> Let's capitalize "RMP" here, please.

Noted.

> 
>> PGD 304b201067 P4D 304b201067 PUD 20c7f06063 PMD 20c8976063 PTE
>> 80000020cee00163
>> RMPEntry paddr 0x20cee00000 [assigned=1 immutable=1 pagesize=0 gpa=0x0
                                 ^^^^^^^^^^

>> asid=0 vmsa=0 validated=0]
>> RMPEntry paddr 0x20cee00000 000000000000000f 8000000000000ffd
> 
> That's a good example, thanks!
> 
> But, it does make me think that we shouldn't be spitting out
> "immutable".  Should we call it "readonly" or something so that folks
> have a better chance of figuring out what's wrong?  Even better, should
> we be looking specifically for X86_PF_RMP *and* immutable=1 and spitting
> out something in english about it?
> 

A write to an assigned page will cause the RMP violation. In this case, 
the page happen to be firmware page hence the immutable bit was also 
set. I am trying to use the field name as documented in the APM and 
SEV-SNP firmware spec.


> This also *looks* to be spitting out the same "RMPEntry paddr
> 0x20cee00000" more than once.  Maybe we should just indent the extra
> entries instead of repeating things.  The high/low are missing a "0x"
> prefix, they also don't have any kind of text label.
> 
Noted, I will fix it.

> 
> I actually really like processing the fields.  I think it's a good
> investment to make the error messages as self-documenting as possible
> and not require the poor souls who are decoding oopses to also keep each
> vendor's architecture manuals at hand.
> 
Sounds good, I will keep it as-is.


>>
>> The reason for iterating through 2MB region is; if the faulting address
>> is not assigned in the RMP table, and page table walk level is 2MB then
>> one of entry within the large page is the root cause of the fault. Since
>> we don't know which entry hence I dump all the non-zero entries.
> 
> Logically you can figure this out though, right?  Why throw 511 entries
> at the console when we *know* they're useless?

Logically its going to be tricky to figure out which exact entry caused 
the fault, hence I dump any non-zero entry. I understand it may dump 
some useless.


>> There are two cases which we need to consider:
>>
>> 1) the faulting page is a guest private (aka assigned)
>> 2) the faulting page is a hypervisor (aka shared)
>>
>> We will be primarily seeing #1. In this case, we know its a assigned
>> page, and we can decode the fields.
>>
>> The #2 will happen in rare conditions,
> 
> What rare conditions?
> 

One such condition is RMP "in-use" bit is set; see the patch 20/40. 
After applying the patch we should not see "in-use" bit set. If we run 
into similar issues, a full RMP dump will greatly help debug.


>> if it happens, one of the undocumented bit in the RMP entry can
>> provide us some useful information hence we dump the raw values.
> You're saying that there are things that can cause RMP faults that
> aren't documented?  That's rather nasty for your users, don't you think?
> 

The "in-use" bit in the RMP entry caught me off guard. The AMD APM does 
says that hardware sets in-use bit but it *never* explained in the 
detail on how to check if the fault was due to in-use bit in the RMP 
table. As I said, the documentation folks will be updating the RMP entry 
to document the in-use bit. I hope we will not see any other 
undocumented surprises, I am keeping my finger cross :)


> I'd be fine if you want to define a mask of unknown bits and spit out to
> the users that some unknown bits are set.
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-08 16:48         ` Brijesh Singh
@ 2021-07-08 16:58           ` Dave Hansen
  2021-07-08 17:11             ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-08 16:58 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/8/21 9:48 AM, Brijesh Singh wrote:
> On 7/8/21 10:30 AM, Dave Hansen wrote:
>>> The reason for iterating through 2MB region is; if the faulting address
>>> is not assigned in the RMP table, and page table walk level is 2MB then
>>> one of entry within the large page is the root cause of the fault. Since
>>> we don't know which entry hence I dump all the non-zero entries.
>>
>> Logically you can figure this out though, right?  Why throw 511 entries
>> at the console when we *know* they're useless?
> 
> Logically its going to be tricky to figure out which exact entry caused
> the fault, hence I dump any non-zero entry. I understand it may dump
> some useless.

What's tricky about it?

Sure, there's a possibility that more than one entry could contribute to
a fault.  But, you always know *IF* an entry could contribute to a fault.

I'm fine if you run through the logic, don't find a known reason
(specific RMP entry) for the fault, and dump the whole table in that
case.  But, unconditionally polluting the kernel log with noise isn't
very nice for debugging.

>>> There are two cases which we need to consider:
>>>
>>> 1) the faulting page is a guest private (aka assigned)
>>> 2) the faulting page is a hypervisor (aka shared)
>>>
>>> We will be primarily seeing #1. In this case, we know its a assigned
>>> page, and we can decode the fields.
>>>
>>> The #2 will happen in rare conditions,
>>
>> What rare conditions?
> 
> One such condition is RMP "in-use" bit is set; see the patch 20/40.
> After applying the patch we should not see "in-use" bit set. If we run
> into similar issues, a full RMP dump will greatly help debug.

OK... so dump the "in-use" bit here if you see it.

>>> if it happens, one of the undocumented bit in the RMP entry can
>>> provide us some useful information hence we dump the raw values.
>> You're saying that there are things that can cause RMP faults that
>> aren't documented?  That's rather nasty for your users, don't you think?
> 
> The "in-use" bit in the RMP entry caught me off guard. The AMD APM does
> says that hardware sets in-use bit but it *never* explained in the
> detail on how to check if the fault was due to in-use bit in the RMP
> table. As I said, the documentation folks will be updating the RMP entry
> to document the in-use bit. I hope we will not see any other
> undocumented surprises, I am keeping my finger cross :)

Oh, ok.  That sounds fine.  Documentation is out of date all the time.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-08 16:58           ` Dave Hansen
@ 2021-07-08 17:11             ` Brijesh Singh
  2021-07-08 17:15               ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-08 17:11 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh



On 7/8/21 11:58 AM, Dave Hansen wrote:>> Logically its going to be 
tricky to figure out which exact entry caused
>> the fault, hence I dump any non-zero entry. I understand it may dump
>> some useless.
> 
> What's tricky about it?
> 
> Sure, there's a possibility that more than one entry could contribute to
> a fault.  But, you always know *IF* an entry could contribute to a fault.
> 
> I'm fine if you run through the logic, don't find a known reason
> (specific RMP entry) for the fault, and dump the whole table in that
> case.  But, unconditionally polluting the kernel log with noise isn't
> very nice for debugging.

The tricky part is to determine which undocumented bit to check to know 
that we should stop dump. I can go with your suggestion that first try 
with the known reasons and fallback to dump whole table for unknown 
reasons only.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
  2021-07-08 17:11             ` Brijesh Singh
@ 2021-07-08 17:15               ` Dave Hansen
  0 siblings, 0 replies; 176+ messages in thread
From: Dave Hansen @ 2021-07-08 17:15 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/8/21 10:11 AM, Brijesh Singh wrote:
> On 7/8/21 11:58 AM, Dave Hansen wrote:>> Logically its going to be
> tricky to figure out which exact entry caused
>>> the fault, hence I dump any non-zero entry. I understand it may dump
>>> some useless.
>>
>> What's tricky about it?
>>
>> Sure, there's a possibility that more than one entry could contribute to
>> a fault.  But, you always know *IF* an entry could contribute to a fault.
>>
>> I'm fine if you run through the logic, don't find a known reason
>> (specific RMP entry) for the fault, and dump the whole table in that
>> case.  But, unconditionally polluting the kernel log with noise isn't
>> very nice for debugging.
> 
> The tricky part is to determine which undocumented bit to check to know
> that we should stop dump. I can go with your suggestion that first try
> with the known reasons and fallback to dump whole table for unknown
> reasons only.

You *can't* stop because of undocumented bits.  Fundamentally.  You
literally don't know if the bit means "this caused a fault" versus "this
definitely couldn't cause a fault".

Basically, if we get to the point of dumping the whole table, we should
also spit out an error message saying that the kernel is dazed and
confused and can't figure out why the hardware caused a fault.  Then,
dump out the whole table so that the "hardware" folks can have a look.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 14/40] crypto:ccp: Provide APIs to issue SEV-SNP commands
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 14/40] crypto:ccp: Provide APIs to issue SEV-SNP commands Brijesh Singh
@ 2021-07-08 18:56   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 176+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-08 18:56 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

* Brijesh Singh (brijesh.singh@amd.com) wrote:
> Provide the APIs for the hypervisor to manage an SEV-SNP guest. The
> commands for SEV-SNP is defined in the SEV-SNP firmware specification.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  drivers/crypto/ccp/sev-dev.c | 24 ++++++++++++
>  include/linux/psp-sev.h      | 74 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 84c91bab00bd..ad9a0c8111e0 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -1017,6 +1017,30 @@ int sev_guest_df_flush(int *error)
>  }
>  EXPORT_SYMBOL_GPL(sev_guest_df_flush);
>  
> +int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error)
> +{
> +	return sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_decommission);
> +
> +int snp_guest_df_flush(int *error)
> +{
> +	return sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_df_flush);
> +
> +int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
> +{
> +	return sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_page_reclaim);
> +
> +int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> +{
> +	return sev_do_cmd(SEV_CMD_SNP_DBG_DECRYPT, data, error);
> +}
> +EXPORT_SYMBOL_GPL(snp_guest_dbg_decrypt);
> +
>  static void sev_exit(struct kref *ref)
>  {
>  	misc_deregister(&misc_dev->misc);
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 1b53e8782250..63ef766cbd7a 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -860,6 +860,65 @@ int sev_guest_df_flush(int *error);
>   */
>  int sev_guest_decommission(struct sev_data_decommission *data, int *error);
>  
> +/**
> + * snp_guest_df_flush - perform SNP DF_FLUSH command
> + *
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV    if the sev device is not available
> + * -%ENOTSUPP  if the sev does not support SEV

Weird wording.

> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO       if the sev returned a non-zero return code
> + */
> +int snp_guest_df_flush(int *error);
> +
> +/**
> + * snp_guest_decommission - perform SNP_DECOMMISSION command
> + *
> + * @decommission: sev_data_decommission structure to be processed
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV    if the sev device is not available
> + * -%ENOTSUPP  if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO       if the sev returned a non-zero return code
> + */
> +int snp_guest_decommission(struct sev_data_snp_decommission *data, int *error);
> +
> +/**
> + * snp_guest_page_reclaim - perform SNP_PAGE_RECLAIM command
> + *
> + * @decommission: sev_snp_page_reclaim structure to be processed
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV    if the sev device is not available
> + * -%ENOTSUPP  if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO       if the sev returned a non-zero return code
> + */
> +int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error);
> +
> +/**
> + * snp_guest_dbg_decrypt - perform SEV SNP_DBG_DECRYPT command
> + *
> + * @sev_ret: sev command return code
> + *
> + * Returns:
> + * 0 if the sev successfully processed the command
> + * -%ENODEV    if the sev device is not available
> + * -%ENOTSUPP  if the sev does not support SEV
> + * -%ETIMEDOUT if the sev command timed out
> + * -%EIO       if the sev returned a non-zero return code
> + */
> +int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
> +
> +
>  void *psp_copy_user_blob(u64 uaddr, u32 len);
>  
>  #else	/* !CONFIG_CRYPTO_DEV_SP_PSP */
> @@ -887,6 +946,21 @@ sev_issue_cmd_external_user(struct file *filep, unsigned int id, void *data, int
>  
>  static inline void *psp_copy_user_blob(u64 __user uaddr, u32 len) { return ERR_PTR(-EINVAL); }
>  
> +static inline int
> +snp_guest_decommission(struct sev_data_snp_decommission *data, int *error) { return -ENODEV; }
> +
> +static inline int snp_guest_df_flush(int *error) { return -ENODEV; }
> +
> +static inline int snp_guest_page_reclaim(struct sev_data_snp_page_reclaim *data, int *error)
> +{
> +	return -ENODEV;
> +}
> +
> +static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error)
> +{
> +	return -ENODEV;
> +}
> +
>  #endif	/* CONFIG_CRYPTO_DEV_SP_PSP */
>  
>  #endif	/* __PSP_SEV_H__ */
> -- 
> 2.17.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-08 16:16   ` Dave Hansen
@ 2021-07-12 15:43     ` Brijesh Singh
  2021-07-12 16:00       ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-12 15:43 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

Hi Dave,


On 7/8/21 11:16 AM, Dave Hansen wrote:
> 
> "SIGBUG"?

Its typo, it should be SIGBUS

>> +
>> +	if (unlikely(!cpu_feature_enabled(X86_FEATURE_SEV_SNP)))
>> +		return RMP_FAULT_KILL;
> 
> Shouldn't this be a WARN_ON_ONCE()?  How can we get RMP faults without
> SEV-SNP?

Yes, we should *not* get RMP fault if SEV-SNP is not enabled. I can use 
the WARN_ON_ONCE().


> 
>> +	/* Get the native page level */
>> +	pte = lookup_address_in_mm(current->mm, address, &level);
>> +	if (unlikely(!pte))
>> +		return RMP_FAULT_KILL;
> 
> What would this mean?  There was an RMP fault on a non-present page?
> How could that happen?  What if there was a race between an unmapping
> event and the RMP fault delivery?

We should not have RMP fault for non-present pages. But you have a good 
point that there maybe a race between the unmap event and RMP fault. 
Instead of terminating the process we should simply retry.


> 
>> +	pfn = pte_pfn(*pte);
>> +	if (level > PG_LEVEL_4K) {
>> +		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
>> +		pfn |= (address >> PAGE_SHIFT) & mask;
>> +	}
> 
> This looks inherently racy.  What happens if there are two parallel RMP
> faults on the same 2M page.  One of them splits the page tables, the
> other gets a fault for an already-split page table.
>  > Is that handled here somehow?

Yes, in this particular case we simply retry and hardware should 
re-evaluate the page level and take the corrective action.


> 
>> +	/* Get the page level from the RMP entry. */
>> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
>> +	if (!e)
>> +		return RMP_FAULT_KILL;
> 
> The snp_lookup_page_in_rmptable() failure cases looks WARN-worthly.
> Either you're doing a lookup for something not *IN* the RMP table, or
> you don't support SEV-SNP, in which case you shouldn't be in this code
> in the first place.

Noted.

> 
>> +	/*
>> +	 * Check if the RMP violation is due to the guest private page access.
>> +	 * We can not resolve this RMP fault, ask to kill the guest.
>> +	 */
>> +	if (rmpentry_assigned(e))
>> +		return RMP_FAULT_KILL;
> 
> No "We's", please.  Speak in imperative voice.

Noted.

> 
>> +	/*
>> +	 * The backing page level is higher than the RMP page level, request
>> +	 * to split the page.
>> +	 */
>> +	if (level > rmp_level)
>> +		return RMP_FAULT_PAGE_SPLIT;
> 
> This can theoretically trigger on a hugetlbfs page.  Right?
> 

Yes, theoretically.

In the current implementation, the VMM is enlightened to not use the 
hugetlbfs for backing page when creating the SEV-SNP guests.


> I thought I asked about this before... more below...
> 
>> +	return RMP_FAULT_RETRY;
>> +}
>> +
>>   /*
>>    * Handle faults in the user portion of the address space.  Nothing in here
>>    * should check X86_PF_USER without a specific justification: for almost
>> @@ -1298,6 +1350,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>>   	struct task_struct *tsk;
>>   	struct mm_struct *mm;
>>   	vm_fault_t fault;
>> +	int ret;
>>   	unsigned int flags = FAULT_FLAG_DEFAULT;
>>   
>>   	tsk = current;
>> @@ -1378,6 +1431,22 @@ void
> (struct pt_regs *regs,
>>   	if (error_code & X86_PF_INSTR)
>>   		flags |= FAULT_FLAG_INSTRUCTION;
>>   
>> +	/*
>> +	 * If its an RMP violation, try resolving it.
>> +	 */
>> +	if (error_code & X86_PF_RMP) {
>> +		ret = handle_user_rmp_page_fault(error_code, address);
>> +		if (ret == RMP_FAULT_PAGE_SPLIT) {
>> +			flags |= FAULT_FLAG_PAGE_SPLIT;
>> +		} else if (ret == RMP_FAULT_KILL) {
>> +			fault |= VM_FAULT_SIGBUS;
>> +			do_sigbus(regs, error_code, address, fault);
>> +			return;
>> +		} else {
>> +			return;
>> +		}
>> +	}
> 
> Why not just have handle_user_rmp_page_fault() return a VM_FAULT_* code
> directly?
> 

I don't have any strong reason against it. In next rev, I can update to 
use the VM_FAULT_* code and call the do_sigbus() etc.

> I also suspect you can just set VM_FAULT_SIGBUS and let the do_sigbus()
> call later on in the function do its work.
>>   
>> +static int handle_split_page_fault(struct vm_fault *vmf)
>> +{
>> +	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
>> +		return VM_FAULT_SIGBUS;
>> +
>> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>> +	return 0;
>> +}
> 
> What will this do when you hand it a hugetlbfs page?
> 

VMM is updated to not use the hugetlbfs when creating SEV-SNP guests. 
So, we should not run into it.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 15:43     ` Brijesh Singh
@ 2021-07-12 16:00       ` Dave Hansen
  2021-07-12 16:11         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-12 16:00 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/12/21 8:43 AM, Brijesh Singh wrote:
>>> +    /*
>>> +     * The backing page level is higher than the RMP page level,
>>> request
>>> +     * to split the page.
>>> +     */
>>> +    if (level > rmp_level)
>>> +        return RMP_FAULT_PAGE_SPLIT;
>>
>> This can theoretically trigger on a hugetlbfs page.  Right?
> 
> Yes, theoretically.
> 
> In the current implementation, the VMM is enlightened to not use the
> hugetlbfs for backing page when creating the SEV-SNP guests.

"The VMM"?

We try to write kernel code so that it "works" and doesn't do unexpected
things with whatever userspace might throw at it.  This seems to be
written with an assumption that no VMM will ever use hugetlbfs with SEV-SNP.

That worries me.  Not only because someone is sure to try it, but it's
the kind of assumption that an attacker or a fuzzer might try.

Could you please test this kernel code in practice with hugetblfs?

>> I also suspect you can just set VM_FAULT_SIGBUS and let the do_sigbus()
>> call later on in the function do its work.
>>>   +static int handle_split_page_fault(struct vm_fault *vmf)
>>> +{
>>> +    if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
>>> +        return VM_FAULT_SIGBUS;
>>> +
>>> +    __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>>> +    return 0;
>>> +}
>>
>> What will this do when you hand it a hugetlbfs page?
> 
> VMM is updated to not use the hugetlbfs when creating SEV-SNP guests.
> So, we should not run into it.

Please fix this code to handle hugetlbfs along with any other non-THP
source of level>0 mappings.  DAX comes to mind.  "Handle" can mean
rejecting these.  You don't have to find some way to split them and make
the VM work, just fail safely, ideally as early as possible.

To me, this is a fundamental requirement before this code can be accepted.

How many more parts of this series are predicated on the behavior of the
VMM like this?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 16:00       ` Dave Hansen
@ 2021-07-12 16:11         ` Brijesh Singh
  2021-07-12 16:15           ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-12 16:11 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh



On 7/12/21 11:00 AM, Dave Hansen wrote:
> On 7/12/21 8:43 AM, Brijesh Singh wrote:
>>>> +    /*
>>>> +     * The backing page level is higher than the RMP page level,
>>>> request
>>>> +     * to split the page.
>>>> +     */
>>>> +    if (level > rmp_level)
>>>> +        return RMP_FAULT_PAGE_SPLIT;
>>>
>>> This can theoretically trigger on a hugetlbfs page.  Right?
>>
>> Yes, theoretically.
>>
>> In the current implementation, the VMM is enlightened to not use the
>> hugetlbfs for backing page when creating the SEV-SNP guests.
> 
> "The VMM"?

I was meaning a userspace qemu.

> 
> We try to write kernel code so that it "works" and doesn't do unexpected
> things with whatever userspace might throw at it.  This seems to be
> written with an assumption that no VMM will ever use hugetlbfs with SEV-SNP.
> 
> That worries me.  Not only because someone is sure to try it, but it's
> the kind of assumption that an attacker or a fuzzer might try.
> 
> Could you please test this kernel code in practice with hugetblfs?

Yes, I will make sure that hugetlbfs path is tested in non-RFC version.


> 
>>> I also suspect you can just set VM_FAULT_SIGBUS and let the do_sigbus()
>>> call later on in the function do its work.
>>>>    +static int handle_split_page_fault(struct vm_fault *vmf)
>>>> +{
>>>> +    if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
>>>> +        return VM_FAULT_SIGBUS;
>>>> +
>>>> +    __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>>>> +    return 0;
>>>> +}
>>>
>>> What will this do when you hand it a hugetlbfs page?
>>
>> VMM is updated to not use the hugetlbfs when creating SEV-SNP guests.
>> So, we should not run into it.
> 
> Please fix this code to handle hugetlbfs along with any other non-THP
> source of level>0 mappings.  DAX comes to mind.  "Handle" can mean
> rejecting these.  You don't have to find some way to split them and make
> the VM work, just fail safely, ideally as early as possible.
> 
> To me, this is a fundamental requirement before this code can be accepted.

Understood, if userspace decided to use the hugetlbfs backing pages then 
I believe earliest we can detect is when we go about adding the pages in 
the RMP table. I'll add a check, and fail the page state change.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 16:11         ` Brijesh Singh
@ 2021-07-12 16:15           ` Dave Hansen
  2021-07-12 16:24             ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-12 16:15 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/12/21 9:11 AM, Brijesh Singh wrote:
>> Please fix this code to handle hugetlbfs along with any other non-THP
>> source of level>0 mappings.  DAX comes to mind.  "Handle" can mean
>> rejecting these.  You don't have to find some way to split them and make
>> the VM work, just fail safely, ideally as early as possible.
>>
>> To me, this is a fundamental requirement before this code can be
>> accepted.
> 
> Understood, if userspace decided to use the hugetlbfs backing pages then
> I believe earliest we can detect is when we go about adding the pages in
> the RMP table. I'll add a check, and fail the page state change.

Really?  You had to feed the RMP entries from *some* mapping in the
first place.  Is there a reason the originating mapping can't be checked
at that point instead of waiting for the fault?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 16:15           ` Dave Hansen
@ 2021-07-12 16:24             ` Brijesh Singh
  2021-07-12 16:29               ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-12 16:24 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh



On 7/12/21 11:15 AM, Dave Hansen wrote:
> On 7/12/21 9:11 AM, Brijesh Singh wrote:
>>> Please fix this code to handle hugetlbfs along with any other non-THP
>>> source of level>0 mappings.  DAX comes to mind.  "Handle" can mean
>>> rejecting these.  You don't have to find some way to split them and make
>>> the VM work, just fail safely, ideally as early as possible.
>>>
>>> To me, this is a fundamental requirement before this code can be
>>> accepted.
>>
>> Understood, if userspace decided to use the hugetlbfs backing pages then
>> I believe earliest we can detect is when we go about adding the pages in
>> the RMP table. I'll add a check, and fail the page state change.
> 
> Really?  You had to feed the RMP entries from *some* mapping in the
> first place.  Is there a reason the originating mapping can't be checked
> at that point instead of waiting for the fault?
> 

Apologies if I was not clear in the messaging, that's exactly what I 
mean that we don't feed RMP entries during the page state change.

The sequence of the operation is:

1. Guest issues a VMGEXIT (page state change) to add a page in the RMP
2. Hyperivosr adds the page in the RMP table.

The check will be inside the hypervisor (#2), to query the backing page 
type, if the backing page is from the hugetlbfs, then don't add the page 
in the RMP, and fail the page state change VMGEXIT.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 16:24             ` Brijesh Singh
@ 2021-07-12 16:29               ` Dave Hansen
  2021-07-12 16:49                 ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-12 16:29 UTC (permalink / raw)
  To: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/12/21 9:24 AM, Brijesh Singh wrote:
> Apologies if I was not clear in the messaging, that's exactly what I
> mean that we don't feed RMP entries during the page state change.
> 
> The sequence of the operation is:
> 
> 1. Guest issues a VMGEXIT (page state change) to add a page in the RMP
> 2. Hyperivosr adds the page in the RMP table.
> 
> The check will be inside the hypervisor (#2), to query the backing page
> type, if the backing page is from the hugetlbfs, then don't add the page
> in the RMP, and fail the page state change VMGEXIT.

Right, but *LOOOOOONG* before that, something walked the page tables and
stuffed the PFN into the NPT (that's the AMD equivalent of EPT, right?).
 You could also avoid this whole mess by refusing to allow hugetblfs to
be mapped into the guest in the first place.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 16:29               ` Dave Hansen
@ 2021-07-12 16:49                 ` Brijesh Singh
  2021-07-15 21:53                   ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-12 16:49 UTC (permalink / raw)
  To: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto
  Cc: brijesh.singh, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh



On 7/12/21 11:29 AM, Dave Hansen wrote:
> On 7/12/21 9:24 AM, Brijesh Singh wrote:
>> Apologies if I was not clear in the messaging, that's exactly what I
>> mean that we don't feed RMP entries during the page state change.
>>
>> The sequence of the operation is:
>>
>> 1. Guest issues a VMGEXIT (page state change) to add a page in the RMP
>> 2. Hyperivosr adds the page in the RMP table.
>>
>> The check will be inside the hypervisor (#2), to query the backing page
>> type, if the backing page is from the hugetlbfs, then don't add the page
>> in the RMP, and fail the page state change VMGEXIT.
> 
> Right, but *LOOOOOONG* before that, something walked the page tables and
> stuffed the PFN into the NPT (that's the AMD equivalent of EPT, right?).
>   You could also avoid this whole mess by refusing to allow hugetblfs to
> be mapped into the guest in the first place.
> 

Ah, that should be doable. For SEV stuff, we require the VMM to register 
the memory region to the hypervisor during the VM creation time. I can 
check the hugetlbfs while registering the memory region and fail much 
earlier.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction Brijesh Singh
@ 2021-07-12 18:44   ` Peter Gonda
  2021-07-12 19:00     ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Peter Gonda @ 2021-07-12 18:44 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm list, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, Nathaniel McCallum, brijesh.ksingh

> +int psmash(struct page *page)
> +{
> +       unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
> +       int ret;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> +               return -ENXIO;
> +
> +       /* Retry if another processor is modifying the RMP entry. */
> +       do {
> +               /* Binutils version 2.36 supports the PSMASH mnemonic. */
> +               asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> +                             : "=a"(ret)
> +                             : "a"(spa)
> +                             : "memory", "cc");
> +       } while (ret == FAIL_INUSE);

Should there be some retry limit here for safety? Or do we know that
we'll never be stuck in this loop? Ditto for the loop in rmpupdate.

> +
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(psmash);
>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
@ 2021-07-12 18:45   ` Peter Gonda
  2021-07-16 19:43   ` Sean Christopherson
  1 sibling, 0 replies; 176+ messages in thread
From: Peter Gonda @ 2021-07-12 18:45 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm list, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, Nathaniel McCallum, brijesh.ksingh

>
> +static int snp_decommission_context(struct kvm *kvm)
> +{
> +       struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +       struct sev_data_snp_decommission data = {};
> +       int ret;
> +
> +       /* If context is not created then do nothing */
> +       if (!sev->snp_context)
> +               return 0;
> +
> +       data.gctx_paddr = __sme_pa(sev->snp_context);
> +       ret = snp_guest_decommission(&data, NULL);
> +       if (ret)
> +               return ret;

Should we WARN or pr_err here? I see in the case of
snp_launch_start's e_free_context we do not warn the user they have
leaked a firmware page.

>
> +
> +       /* free the context page now */
> +       snp_free_firmware_page(sev->snp_context);
> +       sev->snp_context = NULL;
> +
> +       return 0;
> +}
> +
>  void sev_vm_destroy(struct kvm *kvm)
>  {
>         struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -1847,7 +1969,15 @@ void sev_vm_destroy(struct kvm *kvm)
>
>         mutex_unlock(&kvm->lock);
>
> -       sev_unbind_asid(kvm, sev->handle);
> +       if (sev_snp_guest(kvm)) {
> +               if (snp_decommission_context(kvm)) {
> +                       pr_err("Failed to free SNP guest context, leaking asid!\n");

Should these errors be a WARN since we are leaking some state?


> +                       return;
> +               }
> +       } else {
> +               sev_unbind_asid(kvm, sev->handle);
> +       }
> +
>         sev_asid_free(sev);
>  }
>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  2021-07-12 18:44   ` Peter Gonda
@ 2021-07-12 19:00     ` Dave Hansen
  2021-07-15 18:56       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-12 19:00 UTC (permalink / raw)
  To: Peter Gonda, Brijesh Singh
  Cc: x86, linux-kernel, kvm list, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, Nathaniel McCallum, brijesh.ksingh

On 7/12/21 11:44 AM, Peter Gonda wrote:
>> +int psmash(struct page *page)
>> +{
>> +       unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
>> +       int ret;
>> +
>> +       if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>> +               return -ENXIO;
>> +
>> +       /* Retry if another processor is modifying the RMP entry. */
>> +       do {
>> +               /* Binutils version 2.36 supports the PSMASH mnemonic. */
>> +               asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
>> +                             : "=a"(ret)
>> +                             : "a"(spa)
>> +                             : "memory", "cc");
>> +       } while (ret == FAIL_INUSE);
> Should there be some retry limit here for safety? Or do we know that
> we'll never be stuck in this loop? Ditto for the loop in rmpupdate.

It's probably fine to just leave this.  While you could *theoretically*
lose this race forever, it's unlikely to happen in practice.  If it
does, you'll get an easy-to-understand softlockup backtrace which should
point here pretty quickly.

I think TDX has a few of these as well.  Most of the "SEAMCALL"s from
host to the firmware doing the security enforcement have something like
an -EBUSY as well.  I believe they just retry forever too.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled Brijesh Singh
@ 2021-07-14 13:22   ` Marc Orr
  2021-07-14 16:45     ` Brijesh Singh
  2021-07-15 23:48   ` Sean Christopherson
  1 sibling, 1 reply; 176+ messages in thread
From: Marc Orr @ 2021-07-14 13:22 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm list, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Alper Gun

On Wed, Jul 7, 2021 at 11:37 AM Brijesh Singh <brijesh.singh@amd.com> wrote:
>
> The behavior and requirement for the SEV-legacy command is altered when
> the SNP firmware is in the INIT state. See SEV-SNP firmware specification
> for more details.
>
> When SNP is INIT state, all the SEV-legacy commands that cause the
> firmware to write memory must be in the firmware state. The TMR memory
> is allocated by the host but updated by the firmware, so, it must be
> in the firmware state.  Additionally, the TMR memory must be a 2MB aligned
> instead of the 1MB, and the TMR length need to be 2MB instead of 1MB.
> The helper __snp_{alloc,free}_firmware_pages() can be used for allocating
> and freeing the memory used by the firmware.
>
> While at it, provide API that can be used by others to allocate a page
> that can be used by the firmware. The immediate user for this API will
> be the KVM driver. The KVM driver to need to allocate a firmware context
> page during the guest creation. The context page need to be updated
> by the firmware. See the SEV-SNP specification for further details.
>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  drivers/crypto/ccp/sev-dev.c | 144 +++++++++++++++++++++++++++++++----
>  include/linux/psp-sev.h      |  11 +++
>  2 files changed, 142 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index ad9a0c8111e0..bb07c68834a6 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -54,6 +54,14 @@ static int psp_timeout;
>  #define SEV_ES_TMR_SIZE                (1024 * 1024)
>  static void *sev_es_tmr;
>
> +/* When SEV-SNP is enabled the TMR need to be 2MB aligned and 2MB size. */

nit: "the TMR need" -> "the TMR needs"

> +#define SEV_SNP_ES_TMR_SIZE    (2 * 1024 * 1024)
> +
> +static size_t sev_es_tmr_size = SEV_ES_TMR_SIZE;
> +
> +static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret);
> +static int sev_do_cmd(int cmd, void *data, int *psp_ret);
> +
>  static inline bool sev_version_greater_or_equal(u8 maj, u8 min)
>  {
>         struct sev_device *sev = psp_master->sev_data;
> @@ -151,6 +159,112 @@ static int sev_cmd_buffer_len(int cmd)
>         return 0;
>  }
>
> +static int snp_reclaim_page(struct page *page, bool locked)
> +{
> +       struct sev_data_snp_page_reclaim data = {};

Hmmm.. according to some things I read online, an empty initializer
list is not legal in C. For example:
https://stackoverflow.com/questions/17589533/is-an-empty-initializer-list-valid-c-code
I'm sure this is compiling. Should we change this to `{0}`, which I
believe will initialize all fields in this struct to zero, according
to: https://stackoverflow.com/questions/11152160/initializing-a-struct-to-0?

> +       int ret, err;
> +
> +       data.paddr = page_to_pfn(page) << PAGE_SHIFT;
> +
> +       if (locked)
> +               ret = __sev_do_cmd_locked(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> +       else
> +               ret = sev_do_cmd(SEV_CMD_SNP_PAGE_RECLAIM, &data, &err);
> +
> +       return ret;
> +}
> +
> +static int snp_set_rmptable_state(unsigned long paddr, int npages,
> +                                 struct rmpupdate *val, bool locked, bool need_reclaim)
> +{
> +       unsigned long pfn = __sme_clr(paddr) >> PAGE_SHIFT;
> +       unsigned long pfn_end = pfn + npages;
> +       struct psp_device *psp = psp_master;
> +       struct sev_device *sev;
> +       int rc;
> +
> +       if (!psp || !psp->sev_data)
> +               return 0;

Should this return a non-zero value -- maybe `-ENODEV`? Otherwise, the
`snp_alloc_firmware_page()` API will return a page that the caller
believes is suitable to use with FW. My concern is that someone
decides to use this API to stash a page very early on during kernel
boot and that page becomes a time bomb.

If we initialize `rc` to `-ENODEV` (or something similar), then every
return in this function can be `return rc`.

> +
> +       /* If SEV-SNP is initialized then add the page in RMP table. */
> +       sev = psp->sev_data;
> +       if (!sev->snp_inited)
> +               return 0;

Ditto. Should this turn a non-zero value?

> +
> +       while (pfn < pfn_end) {
> +               if (need_reclaim)
> +                       if (snp_reclaim_page(pfn_to_page(pfn), locked))
> +                               return -EFAULT;
> +
> +               rc = rmpupdate(pfn_to_page(pfn), val);
> +               if (rc)
> +                       return rc;
> +
> +               pfn++;
> +       }
> +
> +       return 0;
> +}
> +
> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
> +{
> +       struct rmpupdate val = {};

`{}` -> `{0}`? (Not sure, see my previous comment.)

> +       unsigned long paddr;
> +       struct page *page;
> +
> +       page = alloc_pages(gfp_mask, order);
> +       if (!page)
> +               return NULL;
> +
> +       val.assigned = 1;
> +       val.immutable = 1;
> +       paddr = __pa((unsigned long)page_address(page));
> +
> +       if (snp_set_rmptable_state(paddr, 1 << order, &val, locked, false)) {
> +               pr_warn("Failed to set page state (leaking it)\n");

Maybe `WARN_ONCE` instead of `pr_warn`? It's both a big attention
grabber and also rate limited.

> +               return NULL;
> +       }
> +
> +       return page;
> +}
> +
> +void *snp_alloc_firmware_page(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
> +
> +       return page ? page_address(page) : NULL;
> +}
> +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
>
> +static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
> +{
> +       struct rmpupdate val = {};

`{}` -> `{0}`? (Not sure, see my previous comment.)

> +       unsigned long paddr;
> +
> +       if (!page)
> +               return;
> +
> +       paddr = __pa((unsigned long)page_address(page));
> +
> +       if (snp_set_rmptable_state(paddr, 1 << order, &val, locked, true)) {
> +               pr_warn("Failed to set page state (leaking it)\n");

WARN_ONCE?

> +               return;
> +       }
> +
> +       __free_pages(page, order);
> +}
> +
> +void snp_free_firmware_page(void *addr)
> +{
> +       if (!addr)
> +               return;
> +
> +       __snp_free_firmware_pages(virt_to_page(addr), 0, false);
> +}
> +EXPORT_SYMBOL(snp_free_firmware_page);
> +
>  static int __sev_do_cmd_locked(int cmd, void *data, int *psp_ret)
>  {
>         struct psp_device *psp = psp_master;
> @@ -273,7 +387,7 @@ static int __sev_platform_init_locked(int *error)
>
>                 data.flags |= SEV_INIT_FLAGS_SEV_ES;
>                 data.tmr_address = tmr_pa;
> -               data.tmr_len = SEV_ES_TMR_SIZE;
> +               data.tmr_len = sev_es_tmr_size;
>         }
>
>         rc = __sev_do_cmd_locked(SEV_CMD_INIT, &data, error);
> @@ -630,6 +744,8 @@ static int __sev_snp_init_locked(int *error)
>         sev->snp_inited = true;
>         dev_dbg(sev->dev, "SEV-SNP firmware initialized\n");
>
> +       sev_es_tmr_size = SEV_SNP_ES_TMR_SIZE;
> +
>         return rc;
>  }
>
> @@ -1153,8 +1269,10 @@ static void sev_firmware_shutdown(struct sev_device *sev)
>                 /* The TMR area was encrypted, flush it from the cache */
>                 wbinvd_on_all_cpus();
>
> -               free_pages((unsigned long)sev_es_tmr,
> -                          get_order(SEV_ES_TMR_SIZE));
> +
> +               __snp_free_firmware_pages(virt_to_page(sev_es_tmr),
> +                                         get_order(sev_es_tmr_size),
> +                                         false);
>                 sev_es_tmr = NULL;
>         }
>
> @@ -1204,16 +1322,6 @@ void sev_pci_init(void)
>             sev_update_firmware(sev->dev) == 0)
>                 sev_get_api_version();
>
> -       /* Obtain the TMR memory area for SEV-ES use */
> -       tmr_page = alloc_pages(GFP_KERNEL, get_order(SEV_ES_TMR_SIZE));
> -       if (tmr_page) {
> -               sev_es_tmr = page_address(tmr_page);
> -       } else {
> -               sev_es_tmr = NULL;
> -               dev_warn(sev->dev,
> -                        "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> -       }
> -
>         /*
>          * If boot CPU supports the SNP, then first attempt to initialize
>          * the SNP firmware.
> @@ -1229,6 +1337,16 @@ void sev_pci_init(void)
>                 }
>         }
>
> +       /* Obtain the TMR memory area for SEV-ES use */
> +       tmr_page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(sev_es_tmr_size), false);
> +       if (tmr_page) {
> +               sev_es_tmr = page_address(tmr_page);
> +       } else {
> +               sev_es_tmr = NULL;
> +               dev_warn(sev->dev,
> +                        "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> +       }
> +
>         /* Initialize the platform */
>         rc = sev_platform_init(&error);
>         if (rc && (error == SEV_RET_SECURE_DATA_INVALID)) {
> diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
> index 63ef766cbd7a..b72a74f6a4e9 100644
> --- a/include/linux/psp-sev.h
> +++ b/include/linux/psp-sev.h
> @@ -12,6 +12,8 @@
>  #ifndef __PSP_SEV_H__
>  #define __PSP_SEV_H__
>
> +#include <linux/sev.h>
> +
>  #include <uapi/linux/psp-sev.h>
>
>  #ifdef CONFIG_X86
> @@ -920,6 +922,8 @@ int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *error);
>
>
>  void *psp_copy_user_blob(u64 uaddr, u32 len);
> +void *snp_alloc_firmware_page(gfp_t mask);
> +void snp_free_firmware_page(void *addr);
>
>  #else  /* !CONFIG_CRYPTO_DEV_SP_PSP */
>
> @@ -961,6 +965,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
>         return -ENODEV;
>  }
>
> +static inline void *snp_alloc_firmware_page(gfp_t mask)
> +{
> +       return NULL;
> +}
> +
> +static inline void snp_free_firmware_page(void *addr) { }
> +
>  #endif /* CONFIG_CRYPTO_DEV_SP_PSP */
>
>  #endif /* __PSP_SEV_H__ */
> --
> 2.17.1
>
>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
@ 2021-07-14 13:35   ` Marc Orr
  2021-07-14 16:47     ` Brijesh Singh
  2021-07-20 18:02   ` Sean Christopherson
  1 sibling, 1 reply; 176+ messages in thread
From: Marc Orr @ 2021-07-14 13:35 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm list, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 7, 2021 at 11:38 AM Brijesh Singh <brijesh.singh@amd.com> wrote:
>
> When SEV-SNP is globally enabled on a system, the VMRUN instruction
> performs additional security checks on AVIC backing, VMSA, and VMCB page.
> On a successful VMRUN, these pages are marked "in-use" by the
> hardware in the RMP entry, and any attempt to modify the RMP entry for
> these pages will result in page-fault (RMP violation check).
>
> While performing the RMP check, hardware will try to create a 2MB TLB
> entry for the large page accesses. When it does this, it first reads
> the RMP for the base of 2MB region and verifies that all this memory is
> safe. If AVIC backing, VMSA, and VMCB memory happen to be the base of
> 2MB region, then RMP check will fail because of the "in-use" marking for
> the base entry of this 2MB region.
>
> e.g.
>
> 1. A VMCB was allocated on 2MB-aligned address.
> 2. The VMRUN instruction marks this RMP entry as "in-use".
> 3. Another process allocated some other page of memory that happened to be
>    within the same 2MB region.
> 4. That process tried to write its page using physmap.
>
> If the physmap entry in step #4 uses a large (1G/2M) page, then the
> hardware will attempt to create a 2M TLB entry. The hardware will find
> that the "in-use" bit is set in the RMP entry (because it was a
> VMCB page) and will cause an RMP violation check.
>
> See APM2 section 15.36.12 for more information on VMRUN checks when
> SEV-SNP is globally active.
>
> A generic allocator can return a page which are 2M aligned and will not
> be safe to be used when SEV-SNP is globally enabled. Add a
> snp_safe_alloc_page() helper that can be used for allocating the
> SNP safe memory. The helper allocated 2 pages and splits them into order-1
> allocation. It frees one page and keeps one of the page which is not
> 2M aligned.
>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>

Co-developed-by: Marc Orr <marcorr@google.com>

The original version of this patch had this tag. I think it got
dropped on accident.

> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/lapic.c            |  5 ++++-
>  arch/x86/kvm/svm/sev.c          | 27 +++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          | 16 ++++++++++++++--
>  arch/x86/kvm/svm/svm.h          |  1 +
>  5 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 55efbacfc244..188110ab2c02 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1383,6 +1383,7 @@ struct kvm_x86_ops {
>         int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
>
>         void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
> +       void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>  };
>
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c0ebef560bd1..d4c77f66d7d5 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2441,7 +2441,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>
>         vcpu->arch.apic = apic;
>
> -       apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> +       if (kvm_x86_ops.alloc_apic_backing_page)
> +               apic->regs = kvm_x86_ops.alloc_apic_backing_page(vcpu);
> +       else
> +               apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>         if (!apic->regs) {
>                 printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
>                        vcpu->vcpu_id);
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index b8505710c36b..411ed72f63af 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2692,3 +2692,30 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>                 break;
>         }
>  }
> +
> +struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
> +{
> +       unsigned long pfn;
> +       struct page *p;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> +               return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +
> +       p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> +       if (!p)
> +               return NULL;
> +
> +       /* split the page order */
> +       split_page(p, 1);
> +
> +       /* Find a non-2M aligned page */
> +       pfn = page_to_pfn(p);
> +       if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
> +               pfn++;
> +               __free_page(p);
> +       } else {
> +               __free_page(pfn_to_page(pfn + 1));
> +       }
> +
> +       return pfn_to_page(pfn);
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2acf187a3100..a7adf6ca1713 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1336,7 +1336,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>         svm = to_svm(vcpu);
>
>         err = -ENOMEM;
> -       vmcb01_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +       vmcb01_page = snp_safe_alloc_page(vcpu);
>         if (!vmcb01_page)
>                 goto out;
>
> @@ -1345,7 +1345,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>                  * SEV-ES guests require a separate VMSA page used to contain
>                  * the encrypted register state of the guest.
>                  */
> -               vmsa_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +               vmsa_page = snp_safe_alloc_page(vcpu);
>                 if (!vmsa_page)
>                         goto error_free_vmcb_page;
>
> @@ -4439,6 +4439,16 @@ static int svm_vm_init(struct kvm *kvm)
>         return 0;
>  }
>
> +static void *svm_alloc_apic_backing_page(struct kvm_vcpu *vcpu)
> +{
> +       struct page *page = snp_safe_alloc_page(vcpu);
> +
> +       if (!page)
> +               return NULL;
> +
> +       return page_address(page);
> +}
> +
>  static struct kvm_x86_ops svm_x86_ops __initdata = {
>         .hardware_unsetup = svm_hardware_teardown,
>         .hardware_enable = svm_hardware_enable,
> @@ -4564,6 +4574,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>         .complete_emulated_msr = svm_complete_emulated_msr,
>
>         .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
> +
> +       .alloc_apic_backing_page = svm_alloc_apic_backing_page,
>  };
>
>  static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 5f874168551b..1175edb02d33 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -554,6 +554,7 @@ void sev_es_create_vcpu(struct vcpu_svm *svm);
>  void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
>  void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu);
>  void sev_es_unmap_ghcb(struct vcpu_svm *svm);
> +struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
>
>  /* vmenter.S */
>
> --
> 2.17.1
>
>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-14 13:22   ` Marc Orr
@ 2021-07-14 16:45     ` Brijesh Singh
  2021-07-14 18:14       ` Marc Orr
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-14 16:45 UTC (permalink / raw)
  To: Marc Orr
  Cc: brijesh.singh, x86, linux-kernel, kvm list, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Alper Gun



On 7/14/21 8:22 AM, Marc Orr wrote:
>>
>> +static int snp_reclaim_page(struct page *page, bool locked)
>> +{
>> +       struct sev_data_snp_page_reclaim data = {};
> 
> Hmmm.. according to some things I read online, an empty initializer
> list is not legal in C. For example:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F17589533%2Fis-an-empty-initializer-list-valid-c-code&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Cda82a72de9ab40237b1208d946ca78e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637618657748568732%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zrK%2BUfXYGFVB5MfsmIIM0LtPDQ9UsAJxksCunosP9MY%3D&amp;reserved=0
> I'm sure this is compiling. Should we change this to `{0}`, which I
> believe will initialize all fields in this struct to zero, according
> to: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F11152160%2Finitializing-a-struct-to-0&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Cda82a72de9ab40237b1208d946ca78e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637618657748568732%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vpyAtB%2BZ6b%2BXD3VthQy2b8JtYzYnMceWb9cdj5UGlPg%3D&amp;reserved=0?
> 

Ah, good point. I will fix in next version.


> 
> Should this return a non-zero value -- maybe `-ENODEV`? Otherwise, the
> `snp_alloc_firmware_page()` API will return a page that the caller
> believes is suitable to use with FW. My concern is that someone
> decides to use this API to stash a page very early on during kernel
> boot and that page becomes a time bomb.

But that means the caller now need to know that SNP is enabled before 
calling the APIs. The idea behind the API was that caller does not need 
to know whether the firmware is in the INIT state. If the firmware has 
initialized the SNP, then it will transparently set the immutable bit in 
the RMP table.

> 
> If we initialize `rc` to `-ENODEV` (or something similar), then every
> return in this function can be `return rc`.
> 
>> +
>> +       /* If SEV-SNP is initialized then add the page in RMP table. */
>> +       sev = psp->sev_data;
>> +       if (!sev->snp_inited)
>> +               return 0;
> 
> Ditto. Should this turn a non-zero value?
> 
>> +
>> +       while (pfn < pfn_end) {
>> +               if (need_reclaim)
>> +                       if (snp_reclaim_page(pfn_to_page(pfn), locked))
>> +                               return -EFAULT;
>> +
>> +               rc = rmpupdate(pfn_to_page(pfn), val);
>> +               if (rc)
>> +                       return rc;
>> +
>> +               pfn++;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static struct page *__snp_alloc_firmware_pages(gfp_t gfp_mask, int order, bool locked)
>> +{
>> +       struct rmpupdate val = {};
> 
> `{}` -> `{0}`? (Not sure, see my previous comment.)
> 
>> +       unsigned long paddr;
>> +       struct page *page;
>> +
>> +       page = alloc_pages(gfp_mask, order);
>> +       if (!page)
>> +               return NULL;
>> +
>> +       val.assigned = 1;
>> +       val.immutable = 1;
>> +       paddr = __pa((unsigned long)page_address(page));
>> +
>> +       if (snp_set_rmptable_state(paddr, 1 << order, &val, locked, false)) {
>> +               pr_warn("Failed to set page state (leaking it)\n");
> 
> Maybe `WARN_ONCE` instead of `pr_warn`? It's both a big attention
> grabber and also rate limited.

Noted.

> 
>> +               return NULL;
>> +       }
>> +
>> +       return page;
>> +}
>> +
>> +void *snp_alloc_firmware_page(gfp_t gfp_mask)
>> +{
>> +       struct page *page;
>> +
>> +       page = __snp_alloc_firmware_pages(gfp_mask, 0, false);
>> +
>> +       return page ? page_address(page) : NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(snp_alloc_firmware_page);
>>
>> +static void __snp_free_firmware_pages(struct page *page, int order, bool locked)
>> +{
>> +       struct rmpupdate val = {};
> 
> `{}` -> `{0}`? (Not sure, see my previous comment.)
> 
>> +       unsigned long paddr;
>> +
>> +       if (!page)
>> +               return;
>> +
>> +       paddr = __pa((unsigned long)page_address(page));
>> +
>> +       if (snp_set_rmptable_state(paddr, 1 << order, &val, locked, true)) {
>> +               pr_warn("Failed to set page state (leaking it)\n");
> 
> WARN_ONCE?

Noted.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  2021-07-14 13:35   ` Marc Orr
@ 2021-07-14 16:47     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-14 16:47 UTC (permalink / raw)
  To: Marc Orr
  Cc: brijesh.singh, x86, linux-kernel, kvm list, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh



On 7/14/21 8:35 AM, Marc Orr wrote:
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> 
> Co-developed-by: Marc Orr <marcorr@google.com>
> 
thank you for adding it, it was dropped accidentally.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-14 16:45     ` Brijesh Singh
@ 2021-07-14 18:14       ` Marc Orr
  0 siblings, 0 replies; 176+ messages in thread
From: Marc Orr @ 2021-07-14 18:14 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm list, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh,
	Alper Gun

> > Should this return a non-zero value -- maybe `-ENODEV`? Otherwise, the
> > `snp_alloc_firmware_page()` API will return a page that the caller
> > believes is suitable to use with FW. My concern is that someone
> > decides to use this API to stash a page very early on during kernel
> > boot and that page becomes a time bomb.
>
> But that means the caller now need to know that SNP is enabled before
> calling the APIs. The idea behind the API was that caller does not need
> to know whether the firmware is in the INIT state. If the firmware has
> initialized the SNP, then it will transparently set the immutable bit in
> the RMP table.

For SNP, isn't that already the case? There are three scenarios:

#1: The PSP driver is loaded and `snp_inited` is `true`: These returns
are never hit.

#2: The PSP driver is not loaded. The first return, `!psp ||
!psp->sev_data` fires. As written, it returns `0`, indicating success.
However, we never called RMPUPDATE on the page. Thus, later, when the
PSP driver is loaded, the page that was previously returned as usable
with FW is in fact not usable with FW. Unless SNP is disabled (e.g.,
SEV, SEV-ES only). In which case I guess the page is OK.

#3 The PSP driver is loaded but the SNP_INIT command has not been
issued. Looking at this again, I guess `return 0` is OK. Because if we
got this far, then `sev_pci_init()` has been called, and the SNP_INIT
command has been issued if we're supporting SNP VMs.

So in summary, I think we should change the first return to return an
error and leave the 2nd return as is.

> > If we initialize `rc` to `-ENODEV` (or something similar), then every
> > return in this function can be `return rc`.
> >
> >> +
> >> +       /* If SEV-SNP is initialized then add the page in RMP table. */
> >> +       sev = psp->sev_data;
> >> +       if (!sev->snp_inited)
> >> +               return 0;
> >
> > Ditto. Should this turn a non-zero value?
> >
> >> +
> >> +       while (pfn < pfn_end) {
> >> +               if (need_reclaim)
> >> +                       if (snp_reclaim_page(pfn_to_page(pfn), locked))
> >> +                               return -EFAULT;
> >> +
> >> +               rc = rmpupdate(pfn_to_page(pfn), val);
> >> +               if (rc)
> >> +                       return rc;
> >> +
> >> +               pfn++;
> >> +       }
> >> +
> >> +       return 0;
> >> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol Brijesh Singh
@ 2021-07-14 20:17   ` Sean Christopherson
  2021-07-15  7:39     ` Joerg Roedel
  2021-07-15 13:42     ` Tom Lendacky
  0 siblings, 2 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-14 20:17 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
> available in version 2 of the GHCB specification.

Please provide a brief overview of the protocol, and why it's needed.  I assume
it's to allow AP wakeup without a shared GHCB?

> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---

...

>  static u8 sev_enc_bit;
>  static DECLARE_RWSEM(sev_deactivate_lock);
>  static DEFINE_MUTEX(sev_bitmap_lock);
> @@ -2199,6 +2203,9 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>  
>  void sev_es_unmap_ghcb(struct vcpu_svm *svm)
>  {
> +	/* Clear any indication that the vCPU is in a type of AP Reset Hold */
> +	svm->ap_reset_hold_type = AP_RESET_HOLD_NONE;
> +
>  	if (!svm->ghcb)
>  		return;
>  
> @@ -2404,6 +2411,22 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  				  GHCB_MSR_INFO_POS);
>  		break;
>  	}
> +	case GHCB_MSR_AP_RESET_HOLD_REQ:
> +		svm->ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
> +		ret = kvm_emulate_ap_reset_hold(&svm->vcpu);

The hold type feels like it should be a param to kvm_emulate_ap_reset_hold().

> +
> +		/*
> +		 * Preset the result to a non-SIPI return and then only set
> +		 * the result to non-zero when delivering a SIPI.
> +		 */
> +		set_ghcb_msr_bits(svm, 0,
> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
> +
> +		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
> +				  GHCB_MSR_INFO_MASK,
> +				  GHCB_MSR_INFO_POS);

It looks like all uses set an arbitrary value and then the response.  I think
folding the response into the helper would improve both readability and robustness.
I also suspect the helper needs to do WRITE_ONCE() to guarantee the guest sees
what it's supposed to see, though memory ordering is not my strong suit.

Might even be able to squeeze in a build-time assertion.

Also, do the guest-provided contents actually need to be preserved?  That seems
somewhat odd.

E.g. can it be

static void set_ghcb_msr_response(struct vcpu_svm *svm, u64 response, u64 value,
				  u64 mask, unsigned int pos)
{
	u64 val = (response << GHCB_MSR_INFO_POS) | (val << pos);

	WRITE_ONCE(svm->vmcb->control.ghcb_gpa |= (value & mask) << pos;
}

and

		set_ghcb_msr_response(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
				      GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
				      GHCB_MSR_AP_RESET_HOLD_RESULT_POS);

> +		break;
>  	case GHCB_MSR_TERM_REQ: {
>  		u64 reason_set, reason_code;
>  
> @@ -2491,6 +2514,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  		ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
>  		break;
>  	case SVM_VMGEXIT_AP_HLT_LOOP:
> +		svm->ap_reset_hold_type = AP_RESET_HOLD_NAE_EVENT;
>  		ret = kvm_emulate_ap_reset_hold(vcpu);
>  		break;
>  	case SVM_VMGEXIT_AP_JUMP_TABLE: {
> @@ -2628,13 +2652,29 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>  		return;
>  	}
>  
> -	/*
> -	 * Subsequent SIPI: Return from an AP Reset Hold VMGEXIT, where
> -	 * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
> -	 * non-zero value.
> -	 */
> -	if (!svm->ghcb)
> -		return;
> +	/* Subsequent SIPI */
> +	switch (svm->ap_reset_hold_type) {
> +	case AP_RESET_HOLD_NAE_EVENT:
> +		/*
> +		 * Return from an AP Reset Hold VMGEXIT, where the guest will
> +		 * set the CS and RIP. Set SW_EXIT_INFO_2 to a non-zero value.
> +		 */
> +		ghcb_set_sw_exit_info_2(svm->ghcb, 1);
> +		break;
> +	case AP_RESET_HOLD_MSR_PROTO:
> +		/*
> +		 * Return from an AP Reset Hold VMGEXIT, where the guest will
> +		 * set the CS and RIP. Set GHCB data field to a non-zero value.
> +		 */
> +		set_ghcb_msr_bits(svm, 1,
> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
>  
> -	ghcb_set_sw_exit_info_2(svm->ghcb, 1);
> +		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
> +				  GHCB_MSR_INFO_MASK,
> +				  GHCB_MSR_INFO_POS);
> +		break;
> +	default:
> +		break;
> +	}
>  }
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 0b89aee51b74..ad12ca26b2d8 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -174,6 +174,7 @@ struct vcpu_svm {
>  	struct ghcb *ghcb;
>  	struct kvm_host_map ghcb_map;
>  	bool received_first_sipi;
> +	unsigned int ap_reset_hold_type;

Can't this be a u8?

>  
>  	/* SEV-ES scratch area support */
>  	void *ghcb_sa;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 02/40] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 02/40] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT Brijesh Singh
@ 2021-07-14 20:37   ` Sean Christopherson
  2021-07-14 21:00     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-14 20:37 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> Version 2 of the GHCB specification introduced advertisement of features
> that are supported by the Hypervisor.
> 
> Now that KVM supports version 2 of the GHCB specification, bump the
> maximum supported protocol version.

Heh, the changelog doesn't actually state that it's adding support for said
advertisement of features.  It took me a few seconds to figure out what the
patch was doing, even though it's quite trivial in the end.

> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/uapi/asm/svm.h |  4 ++--
>  arch/x86/kvm/svm/sev.c          | 14 ++++++++++++++
>  arch/x86/kvm/svm/svm.h          |  3 ++-
>  3 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
> index 9aaf0ab386ef..ba4137abf012 100644
> --- a/arch/x86/include/uapi/asm/svm.h
> +++ b/arch/x86/include/uapi/asm/svm.h
> @@ -115,7 +115,7 @@
>  #define SVM_VMGEXIT_AP_CREATE_ON_INIT		0
>  #define SVM_VMGEXIT_AP_CREATE			1
>  #define SVM_VMGEXIT_AP_DESTROY			2
> -#define SVM_VMGEXIT_HYPERVISOR_FEATURES		0x8000fffd
> +#define SVM_VMGEXIT_HV_FT			0x8000fffd

This is fixing up commit 3 from Part1, though I think it can and should be
omitted from that patch entirely since it's not relevant to the guest, only to
KVM.

And FWIW, I like the verbose name, though it looks like Boris requested the
shorter names for the guest.  Can we keep the verbose form for KVM-only VMEGXIT
name?  Hyper-V has mostly laid claim to "HV", and feature is not the first thing
that comes to mind for "FT".

>  #define SVM_VMGEXIT_UNSUPPORTED_EVENT		0x8000ffff

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 02/40] KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
  2021-07-14 20:37   ` Sean Christopherson
@ 2021-07-14 21:00     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-14 21:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/14/21 3:37 PM, Sean Christopherson wrote:
>> +#define SVM_VMGEXIT_HV_FT			0x8000fffd
> 
> This is fixing up commit 3 from Part1, though I think it can and should be
> omitted from that patch entirely since it's not relevant to the guest, only to
> KVM.

Yes, one of the thing which I was struggling header files between the 
kvm/queue and tip/master was not in sync. I had to do some cherry-picks 
to make my part2 still build. I hope this will get addressed in next rebase.

> 
> And FWIW, I like the verbose name, though it looks like Boris requested the
> shorter names for the guest.  Can we keep the verbose form for KVM-only VMEGXIT
> name?  Hyper-V has mostly laid claim to "HV", and feature is not the first thing
> that comes to mind for "FT".
> 

For the uapi/asm/svm.h, I can stick with the verbose name.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support Brijesh Singh
@ 2021-07-14 21:07   ` Sean Christopherson
  2021-07-14 22:02     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-14 21:07 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index aa7e37631447..f9d813d498fa 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -24,6 +24,8 @@
>  #include <linux/sev-guest.h>
>  #include <linux/platform_device.h>
>  #include <linux/io.h>
> +#include <linux/io.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/cpu_entry_area.h>
>  #include <asm/stacktrace.h>
> @@ -40,11 +42,14 @@
>  #include <asm/efi.h>
>  #include <asm/cpuid-indexed.h>
>  #include <asm/setup.h>
> +#include <asm/iommu.h>
>  
>  #include "sev-internal.h"
>  
>  #define DR7_RESET_VALUE        0x400
>  
> +#define RMPTABLE_ENTRIES_OFFSET        0x4000

A comment and/or blurb in the changelog describing this magic number would be
quite helpful.  And maybe call out that this is for the bookkeeping, e.g.

  #define RMPTABLE_CPU_BOOKKEEPING_SIZE	0x4000

Also, the APM doesn't actually state the exact location of the bookkeeping
region, it only states that it's somewhere between RMP_BASE and RMP_END.  This
seems to imply that the bookkeeping region is always at RMP_BASE?

  The region of memory between RMP_BASE and RMP_END contains a 16KB region used
  for processor bookkeeping followed by the RMP entries, which are each 16B in
  size. The size of the RMP determines the range of physical memory that the
  hypervisor can assign to SNP-active virtual machines at runtime. The RMP covers
  the system physical address space from address 0h to the address calculated by:

  ((RMP_END + 1 – RMP_BASE – 16KB) / 16B) x 4KB

>  /* For early boot hypervisor communication in SEV-ES enabled guests */
>  static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>  
> @@ -56,6 +61,9 @@ static struct ghcb __initdata *boot_ghcb;
>  
>  static u64 snp_secrets_phys;
>  
> +static unsigned long rmptable_start __ro_after_init;
> +static unsigned long rmptable_end __ro_after_init;
> +
>  /* #VC handler runtime per-CPU data */
>  struct sev_es_runtime_data {
>  	struct ghcb ghcb_page;
> @@ -2176,3 +2184,138 @@ static int __init add_snp_guest_request(void)
>  	return 0;
>  }
>  device_initcall(add_snp_guest_request);
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt)	"SEV-SNP: " fmt
> +
> +static int __snp_enable(unsigned int cpu)
> +{
> +	u64 val;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> +		return 0;
> +
> +	rdmsrl(MSR_AMD64_SYSCFG, val);
> +
> +	val |= MSR_AMD64_SYSCFG_SNP_EN;
> +	val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;

Is VMPL required?  Do we plan on using VMPL out of the gate?

> +
> +	wrmsrl(MSR_AMD64_SYSCFG, val);
> +
> +	return 0;
> +}
> +
> +static __init void snp_enable(void *arg)
> +{
> +	__snp_enable(smp_processor_id());
> +}
> +
> +static bool get_rmptable_info(u64 *start, u64 *len)
> +{
> +	u64 calc_rmp_sz, rmp_sz, rmp_base, rmp_end, nr_pages;
> +
> +	rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
> +	rdmsrl(MSR_AMD64_RMP_END, rmp_end);
> +
> +	if (!rmp_base || !rmp_end) {

Can BIOS put the RMP at PA=0?

Also, why is it a BIOS decision?  AFAICT, the MSRs aren't locked until SNP_EN
is set in SYSCFG, and that appears to be a kernel decision (ignoring kexec),
i.e. nothing would prevent the kernel from configuring it's own RMP.

> +		pr_info("Memory for the RMP table has not been reserved by BIOS\n");
> +		return false;
> +	}
> +
> +	rmp_sz = rmp_end - rmp_base + 1;
> +
> +	/*
> +	 * Calculate the amount the memory that must be reserved by the BIOS to
> +	 * address the full system RAM. The reserved memory should also cover the
> +	 * RMP table itself.
> +	 *
> +	 * See PPR section 2.1.5.2 for more information on memory requirement.
> +	 */
> +	nr_pages = totalram_pages();
> +	calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_ENTRIES_OFFSET;
> +
> +	if (calc_rmp_sz > rmp_sz) {
> +		pr_info("Memory reserved for the RMP table does not cover the full system "
> +			"RAM (expected 0x%llx got 0x%llx)\n", calc_rmp_sz, rmp_sz);

Is BIOS expected to provide exact coverage, e.g. should this be s/expected/need?

Should the kernel also sanity check other requirements, e.g. the 8kb alignment,
or does the CPU enforce those things at WRMSR?

> +		return false;
> +	}
> +
> +	*start = rmp_base;
> +	*len = rmp_sz;
> +
> +	pr_info("RMP table physical address 0x%016llx - 0x%016llx\n", rmp_base, rmp_end);
> +
> +	return true;
> +}
> +
> +static __init int __snp_rmptable_init(void)
> +{
> +	u64 rmp_base, sz;
> +	void *start;
> +	u64 val;
> +
> +	if (!get_rmptable_info(&rmp_base, &sz))
> +		return 1;
> +
> +	start = memremap(rmp_base, sz, MEMREMAP_WB);
> +	if (!start) {
> +		pr_err("Failed to map RMP table 0x%llx+0x%llx\n", rmp_base, sz);
> +		return 1;
> +	}
> +
> +	/*
> +	 * Check if SEV-SNP is already enabled, this can happen if we are coming from
> +	 * kexec boot.
> +	 */
> +	rdmsrl(MSR_AMD64_SYSCFG, val);
> +	if (val & MSR_AMD64_SYSCFG_SNP_EN)

Hmm, it kinda feels like there should be a sanity check for the case where SNP is
already enabled but get_rmptable_info() fails, e.g. due to insufficient RMP size.

> +		goto skip_enable;
> +
> +	/* Initialize the RMP table to zero */
> +	memset(start, 0, sz);
> +
> +	/* Flush the caches to ensure that data is written before SNP is enabled. */
> +	wbinvd_on_all_cpus();
> +
> +	/* Enable SNP on all CPUs. */
> +	on_each_cpu(snp_enable, NULL, 1);
> +
> +skip_enable:
> +	rmptable_start = (unsigned long)start;

Mostly out of curiosity, why store start/end as unsigned longs?  This is all 64-bit
only so it doesn't actually affect the code generation, but it feels odd to store
things that absolutely have to be 64-bit values as unsigned long.

Similar question for why asm/sev-common.h cases to unsigned long instead of u64.
E.g. the below in particular looks wrong because we're shifting an unsigned long
b y32 bits, i.e. the value _must_ be a 64-bit value, why obfuscate that?

	#define GHCB_CPUID_REQ(fn, reg)		\
		(GHCB_MSR_CPUID_REQ | \
		(((unsigned long)reg & GHCB_MSR_CPUID_REG_MASK) << GHCB_MSR_CPUID_REG_POS) | \
		(((unsigned long)fn) << GHCB_MSR_CPUID_FUNC_POS))

> +	rmptable_end = rmptable_start + sz;
> +
> +	return 0;
> +}
> +
> +static int __init snp_rmptable_init(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
> +		return 0;
> +
> +	/*
> +	 * The SEV-SNP support requires that IOMMU must be enabled, and is not
> +	 * configured in the passthrough mode.
> +	 */
> +	if (no_iommu || iommu_default_passthrough()) {

Similar comment regarding the sanity check, kexec'ing into a kernel with SNP
already enabled should probably fail explicitly if the new kernel is booted with
incompatible params.

> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> +		pr_err("IOMMU is either disabled or configured in passthrough mode.\n");
> +		return 0;
> +	}
> +
> +	if (__snp_rmptable_init()) {
> +		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> +		return 1;
> +	}
> +
> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
> +
> +	return 0;
> +}
> +
> +/*
> + * This must be called after the PCI subsystem. This is because before enabling
> + * the SNP feature we need to ensure that IOMMU is not configured in the
> + * passthrough mode. The iommu_default_passthrough() is used for checking the
> + * passthough state, and it is available after subsys_initcall().
> + */
> +fs_initcall(snp_rmptable_init);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support
  2021-07-14 21:07   ` Sean Christopherson
@ 2021-07-14 22:02     ` Brijesh Singh
  2021-07-14 22:06       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-14 22:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/14/21 4:07 PM, Sean Christopherson wrote:
>>   
>> +#define RMPTABLE_ENTRIES_OFFSET        0x4000
> 
> A comment and/or blurb in the changelog describing this magic number would be
> quite helpful.  And maybe call out that this is for the bookkeeping, e.g.
> 
>    #define RMPTABLE_CPU_BOOKKEEPING_SIZE	0x4000

Noted.

> 
> Also, the APM doesn't actually state the exact location of the bookkeeping
> region, it only states that it's somewhere between RMP_BASE and RMP_END.  This
> seems to imply that the bookkeeping region is always at RMP_BASE?
> 
>    The region of memory between RMP_BASE and RMP_END contains a 16KB region used
>    for processor bookkeeping followed by the RMP entries, which are each 16B in
>    size. The size of the RMP determines the range of physical memory that the
>    hypervisor can assign to SNP-active virtual machines at runtime. The RMP covers
>    the system physical address space from address 0h to the address calculated by:
> 
>    ((RMP_END + 1 – RMP_BASE – 16KB) / 16B) x 4KB
> 

The bookkeeping region is at the start of the RMP_BASE. If we look at 
the PPR then it provides a formula which we should use to read the RMP 
entry location. And in that it adds the bookkeeping to the RMP_BASE.

       RMP Entry Address = RMP_BASE + 0x4000 + x>>8


>> +
>> +	val |= MSR_AMD64_SYSCFG_SNP_EN;
>> +	val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
> 
> Is VMPL required?  Do we plan on using VMPL out of the gate?
> 

The SEV-SNP firmware requires that VMPL must be enabled otherwise it 
will fail to initialize. However, the current SEV-SNP support is limited 
to the VMPL0.

> 
> Can BIOS put the RMP at PA=0?

No, they should not. As per the PPR, the 0h is a reset value (means the 
MSR is not programmed).

> 
> Also, why is it a BIOS decision?  AFAICT, the MSRs aren't locked until SNP_EN
> is set in SYSCFG, and that appears to be a kernel decision (ignoring kexec),
> i.e. nothing would prevent the kernel from configuring it's own RMP.

In the current patch set, we assume that user is configuring the BIOS to 
reserve memory for the RMP table. From hardware point-of-view, it does 
not matter who reserves the memory (bios or kernel). In future, we could 
look into reserving the memory from the kernel before through the 
memblock etc.

> 
>> +		pr_info("Memory for the RMP table has not been reserved by BIOS\n");
>> +		return false;
>> +	}
>> +
>> +	rmp_sz = rmp_end - rmp_base + 1;
>> +
>> +	/*
>> +	 * Calculate the amount the memory that must be reserved by the BIOS to
>> +	 * address the full system RAM. The reserved memory should also cover the
>> +	 * RMP table itself.
>> +	 *
>> +	 * See PPR section 2.1.5.2 for more information on memory requirement.
>> +	 */
>> +	nr_pages = totalram_pages();
>> +	calc_rmp_sz = (((rmp_sz >> PAGE_SHIFT) + nr_pages) << 4) + RMPTABLE_ENTRIES_OFFSET;
>> +
>> +	if (calc_rmp_sz > rmp_sz) {
>> +		pr_info("Memory reserved for the RMP table does not cover the full system "
>> +			"RAM (expected 0x%llx got 0x%llx)\n", calc_rmp_sz, rmp_sz);
> 
> Is BIOS expected to provide exact coverage, e.g. should this be s/expected/need?
> 

BIOS provides option to reserve the required memory. If they don't cover 
the entire system ram then its a BIOS bug.

Yes, I will fix the wording s/expected/need.

To make things interesting, it also has option where user can specify 
amount of memory to be reserved. If user does not cover the full system 
ram then we need to warn and not enable the SNP. We cannot work with 
partially reserved RMP table memory.


> Should the kernel also sanity check other requirements, e.g. the 8kb alignment,
> or does the CPU enforce those things at WRMSR?
> 

The SNP firmware enforces those requirement. It is documented in the SNP 
firmware specification (SNP_INIT).



>> +
>> +	/*
>> +	 * Check if SEV-SNP is already enabled, this can happen if we are coming from
>> +	 * kexec boot.
>> +	 */
>> +	rdmsrl(MSR_AMD64_SYSCFG, val);
>> +	if (val & MSR_AMD64_SYSCFG_SNP_EN)
> 
> Hmm, it kinda feels like there should be a sanity check for the case where SNP is
> already enabled but get_rmptable_info() fails, e.g. due to insufficient RMP size.
> 

Hmm, I am not sure if we need to do this. We enabled the SNP only after 
all the sanity check is completed, so the get_rmptable_info() will not 
fail after the SNP is enabled. The RMP MSR's are locked after the SNP is 
enabled so we should not see a different size.


>> +		goto skip_enable;
>> +
>> +	/* Initialize the RMP table to zero */
>> +	memset(start, 0, sz);
>> +
>> +	/* Flush the caches to ensure that data is written before SNP is enabled. */
>> +	wbinvd_on_all_cpus();
>> +
>> +	/* Enable SNP on all CPUs. */
>> +	on_each_cpu(snp_enable, NULL, 1);
>> +
>> +skip_enable:
>> +	rmptable_start = (unsigned long)start;
> 
> Mostly out of curiosity, why store start/end as unsigned longs?  This is all 64-bit
> only so it doesn't actually affect the code generation, but it feels odd to store
> things that absolutely have to be 64-bit values as unsigned long.
> 

The AMD memory encryption support is compiled when 64-bit is enabled in 
the Kconfig; Having said that, I am okay to use the u64.


> Similar question for why asm/sev-common.h cases to unsigned long instead of u64.
> E.g. the below in particular looks wrong because we're shifting an unsigned long
> b y32 bits, i.e. the value _must_ be a 64-bit value, why obfuscate that?
> 
> 	#define GHCB_CPUID_REQ(fn, reg)		\
> 		(GHCB_MSR_CPUID_REQ | \
> 		(((unsigned long)reg & GHCB_MSR_CPUID_REG_MASK) << GHCB_MSR_CPUID_REG_POS) | \
> 		(((unsigned long)fn) << GHCB_MSR_CPUID_FUNC_POS))
> 
>> +	rmptable_end = rmptable_start + sz;
>> +
>> +	return 0;
>> +}
>> +
>> +static int __init snp_rmptable_init(void)
>> +{
>> +	if (!boot_cpu_has(X86_FEATURE_SEV_SNP))
>> +		return 0;
>> +
>> +	/*
>> +	 * The SEV-SNP support requires that IOMMU must be enabled, and is not
>> +	 * configured in the passthrough mode.
>> +	 */
>> +	if (no_iommu || iommu_default_passthrough()) {
> 
> Similar comment regarding the sanity check, kexec'ing into a kernel with SNP
> already enabled should probably fail explicitly if the new kernel is booted with
> incompatible params.

Good point on the kexec, I'll look to cover it.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support
  2021-07-14 22:02     ` Brijesh Singh
@ 2021-07-14 22:06       ` Sean Christopherson
  2021-07-14 22:11         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-14 22:06 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 14, 2021, Brijesh Singh wrote:
> The bookkeeping region is at the start of the RMP_BASE. If we look at the
> PPR then it provides a formula which we should use to read the RMP entry

What's the PPR?  I get the feeling I'm missing a spec :-)

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 04/40] x86/sev: Add the host SEV-SNP initialization support
  2021-07-14 22:06       ` Sean Christopherson
@ 2021-07-14 22:11         ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-14 22:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/14/21 5:06 PM, Sean Christopherson wrote:
> What's the PPR?  I get the feeling I'm missing a spec:-)

My bad, I should have provided the link in my previous response

Processor Programming Reference (PPR) for AMD Family 19h Model 01h, 
Revision B1 Processors

https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip

look for the PPR_B1_PUB_1.pdf for RMP entry details.

SEV-SNP firmware spec is at developer.amd.com/sev
https://www.amd.com/system/files/TechDocs/56860.pdf

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table Brijesh Singh
@ 2021-07-14 22:25   ` Sean Christopherson
  2021-07-15 17:05     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-14 22:25 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The integrity guarantee of SEV-SNP is enforced through the RMP table.
> The RMP is used in conjuntion with standard x86 and IOMMU page
> tables to enforce memory restrictions and page access rights. The
> RMP is indexed by system physical address, and is checked at the end
> of CPU and IOMMU table walks. The RMP check is enforced as soon as
> SEV-SNP is enabled globally in the system. Not every memory access
> requires an RMP check. In particular, the read accesses from the
> hypervisor do not require RMP checks because the data confidentiality
> is already protected via memory encryption. When hardware encounters
> an RMP checks failure, it raise a page-fault exception. The RMP bit in
> fault error code can be used to determine if the fault was due to an
> RMP checks failure.
> 
> A write from the hypervisor goes through the RMP checks. When the
> hypervisor writes to pages, hardware checks to ensures that the assigned
> bit in the RMP is zero (i.e page is shared). If the page table entry that
> gives the sPA indicates that the target page size is a large page, then
> all RMP entries for the 4KB constituting pages of the target must have the
> assigned bit 0. If one of entry does not have assigned bit 0 then hardware
> will raise an RMP violation. To resolve it, split the page table entry
> leading to target page into 4K.

Isn't the above just saying:

  All RMP entries covered by a large page must match the shared vs. encrypted
  state of the page, e.g. host large pages must have assigned=0 for all relevant
  RMP entries.

> This poses a challenge in the Linux memory model. The Linux kernel
> creates a direct mapping of all the physical memory -- referred to as
> the physmap. The physmap may contain a valid mapping of guest owned pages.
> During the page table walk, the host access may get into the situation
> where one of the pages within the large page is owned by the guest (i.e
> assigned bit is set in RMP). A write to a non-guest within the large page
> will raise an RMP violation. Call set_memory_4k() to split the physmap
> before adding the page in the RMP table. This ensures that the pages
> added in the RMP table are used as 4K in the physmap.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/kernel/sev.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 949efe530319..a482e01f880a 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
>  	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>  		return -ENXIO;
>  
> +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);

IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
the large pages are never recovered?

I believe a better approach would be to do something similar to memfd_secret[*],
which encountered a similar problem with the direct map.  Instead of forcing the
direct map to be forever 4k, unmap the direct map when making a page guest private,
and restore the direct map when it's made shared (or freed).

I thought memfd_secret had also solved the problem of restoring large pages in
the direct map, but at a glance I can't tell if that's actually implemented
anywhere.  But, even if it's not currently implemented, I think it makes sense
to mimic the memfd_secret approach so that both features can benefit if large
page preservation/restoration is ever added.

[*] https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org

> +	if (ret) {
> +		pr_err("Failed to split physical address 0x%lx (%d)\n", spa, ret);
> +		return ret;
> +	}
> +
>  	/* Retry if another processor is modifying the RMP entry. */
>  	do {
>  		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol
  2021-07-14 20:17   ` Sean Christopherson
@ 2021-07-15  7:39     ` Joerg Roedel
  2021-07-15 13:42     ` Tom Lendacky
  1 sibling, 0 replies; 176+ messages in thread
From: Joerg Roedel @ 2021-07-15  7:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Tom Lendacky, H. Peter Anvin,
	Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 14, 2021 at 08:17:29PM +0000, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
> > From: Tom Lendacky <thomas.lendacky@amd.com>
> > 
> > Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
> > available in version 2 of the GHCB specification.
> 
> Please provide a brief overview of the protocol, and why it's needed.  I assume
> it's to allow AP wakeup without a shared GHCB?

Yes, this is needed for SEV-ES kexec support to park APs without the
need for memory that will be owned by the new kernel when APs are woken
up.

You can have a look into my SEV-ES kexec/kdump patch-set for details:

	https://lore.kernel.org/lkml/20210705082443.14721-1-joro@8bytes.org/

I also sent this patch separatly earlier this week to enable GHCB
protocol version 2 support in KVM.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol
  2021-07-14 20:17   ` Sean Christopherson
  2021-07-15  7:39     ` Joerg Roedel
@ 2021-07-15 13:42     ` Tom Lendacky
  2021-07-15 15:45       ` Sean Christopherson
  1 sibling, 1 reply; 176+ messages in thread
From: Tom Lendacky @ 2021-07-15 13:42 UTC (permalink / raw)
  To: Sean Christopherson, Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Andy Lutomirski,
	Dave Hansen, Sergio Lopez, Peter Gonda, Peter Zijlstra,
	Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/14/21 3:17 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> Add support for AP Reset Hold being invoked using the GHCB MSR protocol,
>> available in version 2 of the GHCB specification.
> 
> Please provide a brief overview of the protocol, and why it's needed.  I assume
> it's to allow AP wakeup without a shared GHCB?

Right, mainly the ability to be able to issue an AP reset hold from a mode
that would not be able to access the GHCB as a shared page, e.g. 32-bit
mode without paging enabled where reads/writes are always encrypted for an
SEV guest.

> 
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
> 
> ...
> 
>>  static u8 sev_enc_bit;
>>  static DECLARE_RWSEM(sev_deactivate_lock);
>>  static DEFINE_MUTEX(sev_bitmap_lock);
>> @@ -2199,6 +2203,9 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>>  
>>  void sev_es_unmap_ghcb(struct vcpu_svm *svm)
>>  {
>> +	/* Clear any indication that the vCPU is in a type of AP Reset Hold */
>> +	svm->ap_reset_hold_type = AP_RESET_HOLD_NONE;
>> +
>>  	if (!svm->ghcb)
>>  		return;
>>  
>> @@ -2404,6 +2411,22 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>>  				  GHCB_MSR_INFO_POS);
>>  		break;
>>  	}
>> +	case GHCB_MSR_AP_RESET_HOLD_REQ:
>> +		svm->ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
>> +		ret = kvm_emulate_ap_reset_hold(&svm->vcpu);
> 
> The hold type feels like it should be a param to kvm_emulate_ap_reset_hold().

I suppose it could be, but then the type would have to be tracked in the
kvm_vcpu_arch struct instead of the vcpu_svm struct, so I opted for the
latter. Maybe a helper function, sev_ap_reset_hold(), that sets the type
and then calls kvm_emulate_ap_reset_hold(), but I'm not seeing a big need
for it.

> 
>> +
>> +		/*
>> +		 * Preset the result to a non-SIPI return and then only set
>> +		 * the result to non-zero when delivering a SIPI.
>> +		 */
>> +		set_ghcb_msr_bits(svm, 0,
>> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
>> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
>> +
>> +		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
>> +				  GHCB_MSR_INFO_MASK,
>> +				  GHCB_MSR_INFO_POS);
> 
> It looks like all uses set an arbitrary value and then the response.  I think
> folding the response into the helper would improve both readability and robustness.

Joerg pulled this patch out and submitted it as part of a small, three
patch series, so it might be best to address this in general in the
SEV-SNP patches or as a follow-on series specifically for this re-work.

> I also suspect the helper needs to do WRITE_ONCE() to guarantee the guest sees
> what it's supposed to see, though memory ordering is not my strong suit.

This is writing to the VMCB that is then used to set the value of the
guest MSR. I don't see anything done in general for writes to the VMCB, so
I wouldn't think this should be any different.

> 
> Might even be able to squeeze in a build-time assertion.
> 
> Also, do the guest-provided contents actually need to be preserved?  That seems
> somewhat odd.

Hmmm... not sure I see where the guest contents are being preserved.

> 
> E.g. can it be
> 
> static void set_ghcb_msr_response(struct vcpu_svm *svm, u64 response, u64 value,
> 				  u64 mask, unsigned int pos)
> {
> 	u64 val = (response << GHCB_MSR_INFO_POS) | (val << pos);
> 
> 	WRITE_ONCE(svm->vmcb->control.ghcb_gpa |= (value & mask) << pos;
> }
> 
> and
> 
> 		set_ghcb_msr_response(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
> 				      GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
> 				      GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
> 
>> +		break;
>>  	case GHCB_MSR_TERM_REQ: {
>>  		u64 reason_set, reason_code;
>>  
>> @@ -2491,6 +2514,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>>  		ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
>>  		break;
>>  	case SVM_VMGEXIT_AP_HLT_LOOP:
>> +		svm->ap_reset_hold_type = AP_RESET_HOLD_NAE_EVENT;
>>  		ret = kvm_emulate_ap_reset_hold(vcpu);
>>  		break;
>>  	case SVM_VMGEXIT_AP_JUMP_TABLE: {
>> @@ -2628,13 +2652,29 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>>  		return;
>>  	}
>>  
>> -	/*
>> -	 * Subsequent SIPI: Return from an AP Reset Hold VMGEXIT, where
>> -	 * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
>> -	 * non-zero value.
>> -	 */
>> -	if (!svm->ghcb)
>> -		return;
>> +	/* Subsequent SIPI */
>> +	switch (svm->ap_reset_hold_type) {
>> +	case AP_RESET_HOLD_NAE_EVENT:
>> +		/*
>> +		 * Return from an AP Reset Hold VMGEXIT, where the guest will
>> +		 * set the CS and RIP. Set SW_EXIT_INFO_2 to a non-zero value.
>> +		 */
>> +		ghcb_set_sw_exit_info_2(svm->ghcb, 1);
>> +		break;
>> +	case AP_RESET_HOLD_MSR_PROTO:
>> +		/*
>> +		 * Return from an AP Reset Hold VMGEXIT, where the guest will
>> +		 * set the CS and RIP. Set GHCB data field to a non-zero value.
>> +		 */
>> +		set_ghcb_msr_bits(svm, 1,
>> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
>> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
>>  
>> -	ghcb_set_sw_exit_info_2(svm->ghcb, 1);
>> +		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
>> +				  GHCB_MSR_INFO_MASK,
>> +				  GHCB_MSR_INFO_POS);
>> +		break;
>> +	default:
>> +		break;
>> +	}
>>  }
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 0b89aee51b74..ad12ca26b2d8 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -174,6 +174,7 @@ struct vcpu_svm {
>>  	struct ghcb *ghcb;
>>  	struct kvm_host_map ghcb_map;
>>  	bool received_first_sipi;
>> +	unsigned int ap_reset_hold_type;
> 
> Can't this be a u8?

Yes, it could be, maybe even an enum and let the compiler decide on the size.

Thanks,
Tom

> 
>>  
>>  	/* SEV-ES scratch area support */
>>  	void *ghcb_sa;
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol
  2021-07-15 13:42     ` Tom Lendacky
@ 2021-07-15 15:45       ` Sean Christopherson
  2021-07-15 17:05         ` Tom Lendacky
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 15:45 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Thu, Jul 15, 2021, Tom Lendacky wrote:
> On 7/14/21 3:17 PM, Sean Christopherson wrote:
> >> +	case GHCB_MSR_AP_RESET_HOLD_REQ:
> >> +		svm->ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
> >> +		ret = kvm_emulate_ap_reset_hold(&svm->vcpu);
> > 
> > The hold type feels like it should be a param to kvm_emulate_ap_reset_hold().
> 
> I suppose it could be, but then the type would have to be tracked in the
> kvm_vcpu_arch struct instead of the vcpu_svm struct, so I opted for the
> latter. Maybe a helper function, sev_ap_reset_hold(), that sets the type
> and then calls kvm_emulate_ap_reset_hold(), but I'm not seeing a big need
> for it.

Huh.  Why is kvm_emulate_ap_reset_hold() in x86.c?  That entire concept is very
much SEV specific.  And if anyone argues its not SEV specific, then the hold type
should also be considered generic, i.e. put in kvm_vcpu_arch.

> >> +
> >> +		/*
> >> +		 * Preset the result to a non-SIPI return and then only set
> >> +		 * the result to non-zero when delivering a SIPI.
> >> +		 */
> >> +		set_ghcb_msr_bits(svm, 0,
> >> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
> >> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
> >> +
> >> +		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
> >> +				  GHCB_MSR_INFO_MASK,
> >> +				  GHCB_MSR_INFO_POS);
> > 
> > It looks like all uses set an arbitrary value and then the response.  I think
> > folding the response into the helper would improve both readability and robustness.
> 
> Joerg pulled this patch out and submitted it as part of a small, three
> patch series, so it might be best to address this in general in the
> SEV-SNP patches or as a follow-on series specifically for this re-work.
> 
> > I also suspect the helper needs to do WRITE_ONCE() to guarantee the guest sees
> > what it's supposed to see, though memory ordering is not my strong suit.
> 
> This is writing to the VMCB that is then used to set the value of the
> guest MSR. I don't see anything done in general for writes to the VMCB, so
> I wouldn't think this should be any different.

Ooooh, right.  I was thinking this was writing memory that's shared with the
guest, but this is KVM's copy of the GCHB MSR, not the GHCB itself.  Thanks!

> > Might even be able to squeeze in a build-time assertion.
> > 
> > Also, do the guest-provided contents actually need to be preserved?  That seems
> > somewhat odd.
> 
> Hmmm... not sure I see where the guest contents are being preserved.

The fact that set_ghcb_msr_bits() is a RMW flow implies _something_ is being
preserved.  And unless KVM explicitly zeros/initializes control.ghcb_gpa, the
value being preserved is the value last written by the guest.  E.g. for CPUID
emulation, KVM reads the guest-requested function and register from ghcb_gpa,
then writes back the result.  But set_ghcb_msr_bits() is a RMW on a subset of
bits, and thus it's preserving the guest's value for the bits not being written.

Unless there is an explicit need to preserve the guest value, the whole RMW thing
is unnecessary and confusing.

	case GHCB_MSR_CPUID_REQ: {
		u64 cpuid_fn, cpuid_reg, cpuid_value;

		cpuid_fn = get_ghcb_msr_bits(svm,
					     GHCB_MSR_CPUID_FUNC_MASK,
					     GHCB_MSR_CPUID_FUNC_POS);

		/* Initialize the registers needed by the CPUID intercept */
		vcpu->arch.regs[VCPU_REGS_RAX] = cpuid_fn;
		vcpu->arch.regs[VCPU_REGS_RCX] = 0;

		ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_CPUID);
		if (!ret) {
			ret = -EINVAL;
			break;
		}

		cpuid_reg = get_ghcb_msr_bits(svm,
					      GHCB_MSR_CPUID_REG_MASK,
					      GHCB_MSR_CPUID_REG_POS);
		if (cpuid_reg == 0)
			cpuid_value = vcpu->arch.regs[VCPU_REGS_RAX];
		else if (cpuid_reg == 1)
			cpuid_value = vcpu->arch.regs[VCPU_REGS_RBX];
		else if (cpuid_reg == 2)
			cpuid_value = vcpu->arch.regs[VCPU_REGS_RCX];
		else
			cpuid_value = vcpu->arch.regs[VCPU_REGS_RDX];

		set_ghcb_msr_bits(svm, cpuid_value,
				  GHCB_MSR_CPUID_VALUE_MASK,
				  GHCB_MSR_CPUID_VALUE_POS);

		set_ghcb_msr_bits(svm, GHCB_MSR_CPUID_RESP,
				  GHCB_MSR_INFO_MASK,
				  GHCB_MSR_INFO_POS);
		break;
	}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 01/40] KVM: SVM: Add support to handle AP reset MSR protocol
  2021-07-15 15:45       ` Sean Christopherson
@ 2021-07-15 17:05         ` Tom Lendacky
  0 siblings, 0 replies; 176+ messages in thread
From: Tom Lendacky @ 2021-07-15 17:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Brijesh Singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Andy Lutomirski, Dave Hansen, Sergio Lopez,
	Peter Gonda, Peter Zijlstra, Srinivas Pandruvada, David Rientjes,
	Dov Murik, Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/15/21 10:45 AM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Tom Lendacky wrote:
>> On 7/14/21 3:17 PM, Sean Christopherson wrote:
>>>> +	case GHCB_MSR_AP_RESET_HOLD_REQ:
>>>> +		svm->ap_reset_hold_type = AP_RESET_HOLD_MSR_PROTO;
>>>> +		ret = kvm_emulate_ap_reset_hold(&svm->vcpu);
>>>
>>> The hold type feels like it should be a param to kvm_emulate_ap_reset_hold().
>>
>> I suppose it could be, but then the type would have to be tracked in the
>> kvm_vcpu_arch struct instead of the vcpu_svm struct, so I opted for the
>> latter. Maybe a helper function, sev_ap_reset_hold(), that sets the type
>> and then calls kvm_emulate_ap_reset_hold(), but I'm not seeing a big need
>> for it.
> 
> Huh.  Why is kvm_emulate_ap_reset_hold() in x86.c?  That entire concept is very
> much SEV specific.  And if anyone argues its not SEV specific, then the hold type
> should also be considered generic, i.e. put in kvm_vcpu_arch.

That was based on review comments where it was desired that the halt be
identified as specifically from the AP reset hold vs a normal halt. So
kvm_emulate_ap_reset_hold() was created using KVM_MP_STATE_AP_RESET_HOLD
and KVM_EXIT_AP_RESET_HOLD instead of exporting a version of
kvm_vcpu_halt() with the state and reason as arguments.

If there's no objection, then I don't have any issues with moving the hold
type to kvm_vcpu_arch and adding a param to kvm_emulate_ap_reset_hold().

> 
>>>> +
>>>> +		/*
>>>> +		 * Preset the result to a non-SIPI return and then only set
>>>> +		 * the result to non-zero when delivering a SIPI.
>>>> +		 */
>>>> +		set_ghcb_msr_bits(svm, 0,
>>>> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_MASK,
>>>> +				  GHCB_MSR_AP_RESET_HOLD_RESULT_POS);
>>>> +
>>>> +		set_ghcb_msr_bits(svm, GHCB_MSR_AP_RESET_HOLD_RESP,
>>>> +				  GHCB_MSR_INFO_MASK,
>>>> +				  GHCB_MSR_INFO_POS);
>>>
>>> It looks like all uses set an arbitrary value and then the response.  I think
>>> folding the response into the helper would improve both readability and robustness.
>>
>> Joerg pulled this patch out and submitted it as part of a small, three
>> patch series, so it might be best to address this in general in the
>> SEV-SNP patches or as a follow-on series specifically for this re-work.
>>
>>> I also suspect the helper needs to do WRITE_ONCE() to guarantee the guest sees
>>> what it's supposed to see, though memory ordering is not my strong suit.
>>
>> This is writing to the VMCB that is then used to set the value of the
>> guest MSR. I don't see anything done in general for writes to the VMCB, so
>> I wouldn't think this should be any different.
> 
> Ooooh, right.  I was thinking this was writing memory that's shared with the
> guest, but this is KVM's copy of the GCHB MSR, not the GHCB itself.  Thanks!
> 
>>> Might even be able to squeeze in a build-time assertion.
>>>
>>> Also, do the guest-provided contents actually need to be preserved?  That seems
>>> somewhat odd.
>>
>> Hmmm... not sure I see where the guest contents are being preserved.
> 
> The fact that set_ghcb_msr_bits() is a RMW flow implies _something_ is being
> preserved.  And unless KVM explicitly zeros/initializes control.ghcb_gpa, the
> value being preserved is the value last written by the guest.  E.g. for CPUID
> emulation, KVM reads the guest-requested function and register from ghcb_gpa,
> then writes back the result.  But set_ghcb_msr_bits() is a RMW on a subset of
> bits, and thus it's preserving the guest's value for the bits not being written.

Yes, set_ghcb_msr_bits() is a RMW helper, but the intent was to set every
bit. So for CPUID, I missed setting the reserved area to 0. There wouldn't
be an issue initializing the whole field to zero once everything has been
pulled out for the MSR protocol function being invoked.

> 
> Unless there is an explicit need to preserve the guest value, the whole RMW thing
> is unnecessary and confusing.

I guess it depends on who's reading the code. I don't find it confusing,
which is probably why I implemented it that way :) But, yes, it certainly
can be changed to create the result and then have a single function that
combines the result and response code and sets the ghcb_gpa, which would
have eliminated the missed setting of the reserved area.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-14 22:25   ` Sean Christopherson
@ 2021-07-15 17:05     ` Brijesh Singh
  2021-07-15 17:51       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-15 17:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/14/21 5:25 PM, Sean Christopherson wrote:
>> A write from the hypervisor goes through the RMP checks. When the
>> hypervisor writes to pages, hardware checks to ensures that the assigned
>> bit in the RMP is zero (i.e page is shared). If the page table entry that
>> gives the sPA indicates that the target page size is a large page, then
>> all RMP entries for the 4KB constituting pages of the target must have the
>> assigned bit 0. If one of entry does not have assigned bit 0 then hardware
>> will raise an RMP violation. To resolve it, split the page table entry
>> leading to target page into 4K.
> 
> Isn't the above just saying:
> 
>    All RMP entries covered by a large page must match the shared vs. encrypted
>    state of the page, e.g. host large pages must have assigned=0 for all relevant
>    RMP entries.
> 

Yes.


>> @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
>>   	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>>   		return -ENXIO;
>>   
>> +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
> 
> IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
> the large pages are never recovered?
> 
> I believe a better approach would be to do something similar to memfd_secret[*],
> which encountered a similar problem with the direct map.  Instead of forcing the
> direct map to be forever 4k, unmap the direct map when making a page guest private,
> and restore the direct map when it's made shared (or freed).
> 
> I thought memfd_secret had also solved the problem of restoring large pages in
> the direct map, but at a glance I can't tell if that's actually implemented
> anywhere.  But, even if it's not currently implemented, I think it makes sense
> to mimic the memfd_secret approach so that both features can benefit if large
> page preservation/restoration is ever added.
> 

thanks for the memfd_secrets pointer. At the lowest level it shares the
same logic to split the physmap. We both end up calling to
change_page_attrs_set_clr() which split the page and updates the page
table attributes.

Given this, I believe in future if the change_page_attrs_set_clr() is 
enhanced to track the splitting of the pages and restore it later then 
it should work transparently.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-15 17:05     ` Brijesh Singh
@ 2021-07-15 17:51       ` Sean Christopherson
  2021-07-15 18:14         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 17:51 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Thu, Jul 15, 2021, Brijesh Singh wrote:
> 
> On 7/14/21 5:25 PM, Sean Christopherson wrote:
> > > @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
> > >   	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> > >   		return -ENXIO;
> > > +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
> > 
> > IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
> > the large pages are never recovered?
> > 
> > I believe a better approach would be to do something similar to memfd_secret[*],
> > which encountered a similar problem with the direct map.  Instead of forcing the
> > direct map to be forever 4k, unmap the direct map when making a page guest private,
> > and restore the direct map when it's made shared (or freed).
> > 
> > I thought memfd_secret had also solved the problem of restoring large pages in
> > the direct map, but at a glance I can't tell if that's actually implemented
> > anywhere.  But, even if it's not currently implemented, I think it makes sense
> > to mimic the memfd_secret approach so that both features can benefit if large
> > page preservation/restoration is ever added.
> > 
> 
> thanks for the memfd_secrets pointer. At the lowest level it shares the
> same logic to split the physmap. We both end up calling to
> change_page_attrs_set_clr() which split the page and updates the page
> table attributes.
> 
> Given this, I believe in future if the change_page_attrs_set_clr() is
> enhanced to track the splitting of the pages and restore it later then it
> should work transparently.

But something actually needs to initiate the restore.  If the RMPUDATE path just
force 4k pages then there will never be a restore.  And zapping the direct map
for private pages is a good thing, e.g. prevents the kernel from reading garbage,
which IIUC isn't enforced by the RMP?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-15 17:51       ` Sean Christopherson
@ 2021-07-15 18:14         ` Brijesh Singh
  2021-07-15 18:39           ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-15 18:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/15/21 12:51 PM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>
>> On 7/14/21 5:25 PM, Sean Christopherson wrote:
>>>> @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
>>>>    	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>>>>    		return -ENXIO;
>>>> +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
>>>
>>> IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
>>> the large pages are never recovered?
>>>
>>> I believe a better approach would be to do something similar to memfd_secret[*],
>>> which encountered a similar problem with the direct map.  Instead of forcing the
>>> direct map to be forever 4k, unmap the direct map when making a page guest private,
>>> and restore the direct map when it's made shared (or freed).
>>>
>>> I thought memfd_secret had also solved the problem of restoring large pages in
>>> the direct map, but at a glance I can't tell if that's actually implemented
>>> anywhere.  But, even if it's not currently implemented, I think it makes sense
>>> to mimic the memfd_secret approach so that both features can benefit if large
>>> page preservation/restoration is ever added.
>>>
>>
>> thanks for the memfd_secrets pointer. At the lowest level it shares the
>> same logic to split the physmap. We both end up calling to
>> change_page_attrs_set_clr() which split the page and updates the page
>> table attributes.
>>
>> Given this, I believe in future if the change_page_attrs_set_clr() is
>> enhanced to track the splitting of the pages and restore it later then it
>> should work transparently.
> 
> But something actually needs to initiate the restore.  If the RMPUDATE path just
> force 4k pages then there will never be a restore.  And zapping the direct map
> for private pages is a good thing, e.g. prevents the kernel from reading garbage,
> which IIUC isn't enforced by the RMP?
> 

Yes, something need to initiated the restore. Since the restore support 
is not present today so its difficult to say how it will be look. I am 
thinking that restore thread may use some kind of notifier to check with 
the caller whether its safe to restore the page ranges. In case of the 
SEV-SNP, the SNP registered notifier will reject if the guest is running.

The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() 
and it is designed to remove/add the present bit in the direct map. We 
can't use them, because in our case the page may get accessed by the KVM 
(e.g kvm_guest_write, kvm_guest_map etc).

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers Brijesh Singh
@ 2021-07-15 18:37   ` Sean Christopherson
  2021-07-15 19:28     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 18:37 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
> entry for a given page. The RMP entry format is documented in AMD PPR, see
> https://bugzilla.kernel.org/attachment.cgi?id=296015.

Ewwwwww, the RMP format isn't architectural!?

  Architecturally the format of RMP entries are not specified in APM. In order
  to assist software, the following table specifies select portions of the RMP
  entry format for this specific product.

I know we generally don't want to add infrastructure without good reason, but on
the other hand exposing a microarchitectural data structure to the kernel at large
is going to be a disaster if the format does change on a future processor.

Looking at the future patches, dump_rmpentry() is the only power user, e.g.
everything else mostly looks at "assigned" and "level" (and one ratelimited warn
on "validated" in snp_make_page_shared(), but I suspect that particular check
can and should be dropped).

So, what about hiding "struct rmpentry" and possibly renaming it to something
scary/microarchitectural, e.g. something like

/*
 * Returns 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
 * and -errno if there is no corresponding RMP entry.
 */
int snp_lookup_rmpentry(struct page *page, int *level)
{
	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
	struct rmpentry *entry, *large_entry;
	unsigned long vaddr;

	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
		return -ENXIO;

	vaddr = rmptable_start + rmptable_page_offset(phys);
	if (unlikely(vaddr > rmptable_end))
		return -EXNIO;

	entry = (struct rmpentry *)vaddr;

	/* Read a large RMP entry to get the correct page level used in RMP entry. */
	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
	large_entry = (struct rmpentry *)vaddr;
	*level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));

	return !!entry->assigned;
}


And then move dump_rmpentry() (or add a helper) in sev.c so that "struct rmpentry"
can be declared in sev.c.

> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/sev.h |  4 +--
>  arch/x86/kernel/sev.c      | 26 +++++++++++++++++++
>  include/linux/sev.h        | 51 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 78 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/sev.h
> 
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 6c23e694a109..9e7e7e737f55 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -9,6 +9,7 @@
>  #define __ASM_ENCRYPTED_STATE_H
>  
>  #include <linux/types.h>
> +#include <linux/sev.h>

Why move things to linux/sev.h?  AFAICT, even at the end of the series, the only
users of anything in this file all reside somewhere in arch/x86.

>  #include <asm/insn.h>
>  #include <asm/sev-common.h>
>  #include <asm/bootparam.h>
> @@ -75,9 +76,6 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
>  /* Software defined (when rFlags.CF = 1) */
>  #define PVALIDATE_FAIL_NOUPDATE		255
>  
> -/* RMP page size */
> -#define RMP_PG_SIZE_4K			0
> -
>  #define RMPADJUST_VMSA_PAGE_BIT		BIT(16)
>  
>  #ifdef CONFIG_AMD_MEM_ENCRYPT
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index f9d813d498fa..1aed3d53f59f 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -49,6 +49,8 @@
>  #define DR7_RESET_VALUE        0x400
>  
>  #define RMPTABLE_ENTRIES_OFFSET        0x4000
> +#define RMPENTRY_SHIFT			8
> +#define rmptable_page_offset(x)	(RMPTABLE_ENTRIES_OFFSET + (((unsigned long)x) >> RMPENTRY_SHIFT))
>  
>  /* For early boot hypervisor communication in SEV-ES enabled guests */
>  static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
> @@ -2319,3 +2321,27 @@ static int __init snp_rmptable_init(void)
>   * passthough state, and it is available after subsys_initcall().
>   */
>  fs_initcall(snp_rmptable_init);
> +
> +struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)

Maybe just snp_get_rmpentry?  Or snp_lookup_rmpentry?  I'm guessing the name was
chosen to align with e.g. lookup_address_in_mm, but IMO the lookup_address helpers
are oddly named.

> +{
> +	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
> +	struct rmpentry *entry, *large_entry;
> +	unsigned long vaddr;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> +		return NULL;
> +
> +	vaddr = rmptable_start + rmptable_page_offset(phys);
> +	if (unlikely(vaddr > rmptable_end))
> +		return NULL;
> +
> +	entry = (struct rmpentry *)vaddr;
> +
> +	/* Read a large RMP entry to get the correct page level used in RMP entry. */
> +	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
> +	large_entry = (struct rmpentry *)vaddr;
> +	*level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
> +
> +	return entry;
> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-15 18:14         ` Brijesh Singh
@ 2021-07-15 18:39           ` Sean Christopherson
  2021-07-15 19:38             ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 18:39 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Thu, Jul 15, 2021, Brijesh Singh wrote:
> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
> is designed to remove/add the present bit in the direct map. We can't use
> them, because in our case the page may get accessed by the KVM (e.g
> kvm_guest_write, kvm_guest_map etc).

But KVM should never access a guest private page, i.e. the direct map should
always be restored to PRESENT before KVM attempts to access the page.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  2021-07-12 19:00     ` Dave Hansen
@ 2021-07-15 18:56       ` Sean Christopherson
  2021-07-15 19:08         ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 18:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Gonda, Brijesh Singh, x86, linux-kernel, kvm list,
	linux-efi, platform-driver-x86, linux-coco, linux-mm,
	linux-crypto, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Andy Lutomirski,
	Dave Hansen, Sergio Lopez, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	Nathaniel McCallum, brijesh.ksingh

On Mon, Jul 12, 2021, Dave Hansen wrote:
> On 7/12/21 11:44 AM, Peter Gonda wrote:
> >> +int psmash(struct page *page)
> >> +{
> >> +       unsigned long spa = page_to_pfn(page) << PAGE_SHIFT;
> >> +       int ret;
> >> +
> >> +       if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> >> +               return -ENXIO;
> >> +
> >> +       /* Retry if another processor is modifying the RMP entry. */
> >> +       do {
> >> +               /* Binutils version 2.36 supports the PSMASH mnemonic. */
> >> +               asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> >> +                             : "=a"(ret)
> >> +                             : "a"(spa)
> >> +                             : "memory", "cc");
> >> +       } while (ret == FAIL_INUSE);
> > Should there be some retry limit here for safety? Or do we know that
> > we'll never be stuck in this loop? Ditto for the loop in rmpupdate.
> 
> It's probably fine to just leave this.  While you could *theoretically*
> lose this race forever, it's unlikely to happen in practice.  If it
> does, you'll get an easy-to-understand softlockup backtrace which should
> point here pretty quickly.

But should failure here even be tolerated?  The TDX cases spin on flows that are
_not_ due to (direct) contenion, e.g. a pending interrupt while flushing the
cache or lack of randomness when generating a key.  In this case, there are two
CPUs racing to modify the RMP entry, which implies that the final state of the
RMP entry is not deterministic.

> I think TDX has a few of these as well.  Most of the "SEAMCALL"s from
> host to the firmware doing the security enforcement have something like
> an -EBUSY as well.  I believe they just retry forever too.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 08/40] x86/traps: Define RMP violation #PF error code
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 08/40] x86/traps: Define RMP violation #PF error code Brijesh Singh
@ 2021-07-15 19:02   ` Sean Christopherson
  2021-07-15 19:16     ` Dave Hansen
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 19:02 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> Bit 31 in the page fault-error bit will be set when processor encounters
> an RMP violation.
> 
> While at it, use the BIT() macro.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/trap_pf.h | 18 +++++++++++-------
>  arch/x86/mm/fault.c            |  1 +
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
> index 10b1de500ab1..29f678701753 100644
> --- a/arch/x86/include/asm/trap_pf.h
> +++ b/arch/x86/include/asm/trap_pf.h
> @@ -2,6 +2,8 @@
>  #ifndef _ASM_X86_TRAP_PF_H
>  #define _ASM_X86_TRAP_PF_H
>  
> +#include <vdso/bits.h>  /* BIT() macro */

What are people's thoughts on using linux/bits.h instead of vdso.bits.h, even
though the vDSO version is technically sufficient?  Seeing the "vdso" reference
definitely made me blink slowly a few times.

> +
>  /*
>   * Page fault error code bits:
>   *
> @@ -12,15 +14,17 @@
>   *   bit 4 ==				1: fault was an instruction fetch
>   *   bit 5 ==				1: protection keys block access
>   *   bit 15 ==				1: SGX MMU page-fault
> + *   bit 31 ==				1: fault was an RMP violation
>   */
>  enum x86_pf_error_code {
> -	X86_PF_PROT	=		1 << 0,
> -	X86_PF_WRITE	=		1 << 1,
> -	X86_PF_USER	=		1 << 2,
> -	X86_PF_RSVD	=		1 << 3,
> -	X86_PF_INSTR	=		1 << 4,
> -	X86_PF_PK	=		1 << 5,
> -	X86_PF_SGX	=		1 << 15,
> +	X86_PF_PROT	=		BIT(0),
> +	X86_PF_WRITE	=		BIT(1),
> +	X86_PF_USER	=		BIT(2),
> +	X86_PF_RSVD	=		BIT(3),
> +	X86_PF_INSTR	=		BIT(4),
> +	X86_PF_PK	=		BIT(5),
> +	X86_PF_SGX	=		BIT(15),
> +	X86_PF_RMP	=		BIT(31),
>  };
>  
>  #endif /* _ASM_X86_TRAP_PF_H */
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 1c548ad00752..2715240c757e 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -545,6 +545,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
>  		 !(error_code & X86_PF_PROT) ? "not-present page" :
>  		 (error_code & X86_PF_RSVD)  ? "reserved bit violation" :
>  		 (error_code & X86_PF_PK)    ? "protection keys violation" :
> +		 (error_code & X86_PF_RMP)   ? "rmp violation" :
>  					       "permissions violation");
>  
>  	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  2021-07-15 18:56       ` Sean Christopherson
@ 2021-07-15 19:08         ` Dave Hansen
  2021-07-15 19:18           ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Dave Hansen @ 2021-07-15 19:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Gonda, Brijesh Singh, x86, linux-kernel, kvm list,
	linux-efi, platform-driver-x86, linux-coco, linux-mm,
	linux-crypto, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Andy Lutomirski,
	Dave Hansen, Sergio Lopez, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	Nathaniel McCallum, brijesh.ksingh

On 7/15/21 11:56 AM, Sean Christopherson wrote:
>>>> +       /* Retry if another processor is modifying the RMP entry. */
>>>> +       do {
>>>> +               /* Binutils version 2.36 supports the PSMASH mnemonic. */
>>>> +               asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
>>>> +                             : "=a"(ret)
>>>> +                             : "a"(spa)
>>>> +                             : "memory", "cc");
>>>> +       } while (ret == FAIL_INUSE);
>>> Should there be some retry limit here for safety? Or do we know that
>>> we'll never be stuck in this loop? Ditto for the loop in rmpupdate.
>> It's probably fine to just leave this.  While you could *theoretically*
>> lose this race forever, it's unlikely to happen in practice.  If it
>> does, you'll get an easy-to-understand softlockup backtrace which should
>> point here pretty quickly.
> But should failure here even be tolerated?  The TDX cases spin on flows that are
> _not_ due to (direct) contenion, e.g. a pending interrupt while flushing the
> cache or lack of randomness when generating a key.  In this case, there are two
> CPUs racing to modify the RMP entry, which implies that the final state of the
> RMP entry is not deterministic.

I was envisioning that two different CPUs could try to smash two
*different* 4k physical pages, but collide since they share
a 2M page.

But, in patch 33, this is called via:

> +		write_lock(&kvm->mmu_lock);
> +
> +		switch (op) {
> +		case SNP_PAGE_STATE_SHARED:
> +			rc = snp_make_page_shared(vcpu, gpa, pfn, level);
...

Which should make collisions impossible.  Did I miss another call-site?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 08/40] x86/traps: Define RMP violation #PF error code
  2021-07-15 19:02   ` Sean Christopherson
@ 2021-07-15 19:16     ` Dave Hansen
  0 siblings, 0 replies; 176+ messages in thread
From: Dave Hansen @ 2021-07-15 19:16 UTC (permalink / raw)
  To: Sean Christopherson, Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/15/21 12:02 PM, Sean Christopherson wrote:
>>  #ifndef _ASM_X86_TRAP_PF_H
>>  #define _ASM_X86_TRAP_PF_H
>>  
>> +#include <vdso/bits.h>  /* BIT() macro */
> What are people's thoughts on using linux/bits.h instead of vdso.bits.h, even
> though the vDSO version is technically sufficient?  Seeing the "vdso" reference
> definitely made me blink slowly a few times.

Ugh, missed that.  Yes, that does look very weird.

I don't see any reason to use that vdso/ version instead of BIT_ULL().
I suspect I said to use BIT() when I commented on this in a previous
round.  If so, that was wrong.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 06/40] x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  2021-07-15 19:08         ` Dave Hansen
@ 2021-07-15 19:18           ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 19:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Gonda, Brijesh Singh, x86, linux-kernel, kvm list,
	linux-efi, platform-driver-x86, linux-coco, linux-mm,
	linux-crypto, Thomas Gleixner, Ingo Molnar, Joerg Roedel,
	Tom Lendacky, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Andy Lutomirski,
	Dave Hansen, Sergio Lopez, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	Nathaniel McCallum, brijesh.ksingh

On Thu, Jul 15, 2021, Dave Hansen wrote:
> On 7/15/21 11:56 AM, Sean Christopherson wrote:
> >>>> +       /* Retry if another processor is modifying the RMP entry. */
> >>>> +       do {
> >>>> +               /* Binutils version 2.36 supports the PSMASH mnemonic. */
> >>>> +               asm volatile(".byte 0xF3, 0x0F, 0x01, 0xFF"
> >>>> +                             : "=a"(ret)
> >>>> +                             : "a"(spa)
> >>>> +                             : "memory", "cc");
> >>>> +       } while (ret == FAIL_INUSE);
> >>> Should there be some retry limit here for safety? Or do we know that
> >>> we'll never be stuck in this loop? Ditto for the loop in rmpupdate.
> >> It's probably fine to just leave this.  While you could *theoretically*
> >> lose this race forever, it's unlikely to happen in practice.  If it
> >> does, you'll get an easy-to-understand softlockup backtrace which should
> >> point here pretty quickly.
> > But should failure here even be tolerated?  The TDX cases spin on flows that are
> > _not_ due to (direct) contenion, e.g. a pending interrupt while flushing the
> > cache or lack of randomness when generating a key.  In this case, there are two
> > CPUs racing to modify the RMP entry, which implies that the final state of the
> > RMP entry is not deterministic.
> 
> I was envisioning that two different CPUs could try to smash two
> *different* 4k physical pages, but collide since they share
> a 2M page.
> 
> But, in patch 33, this is called via:
> 
> > +		write_lock(&kvm->mmu_lock);
> > +
> > +		switch (op) {
> > +		case SNP_PAGE_STATE_SHARED:
> > +			rc = snp_make_page_shared(vcpu, gpa, pfn, level);
> ...
> 
> Which should make collisions impossible.  Did I miss another call-site?

Ya, there's more, e.g. sev_snp_write_page_begin() and snp_handle_rmp_page_fault(),
both of which run without holding mmu_lock.  The PSMASH operation isn't too
concerning, but the associated RMPUDATE is most definitely a concern, e.g. if two
vCPUs are trying to access different variants of a page.  It's ok if KVM's
"response" in such a situation does weird things to the guest, but one of the
two operations should "win", which I don't think is guaranteed if multiple RMP
violations are racing.

I'll circle back to this patch after I've gone through the KVM MMU changes.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers
  2021-07-15 18:37   ` Sean Christopherson
@ 2021-07-15 19:28     ` Brijesh Singh
  2021-07-16 17:22       ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-15 19:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/15/21 1:37 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> The snp_lookup_page_in_rmptable() can be used by the host to read the RMP
>> entry for a given page. The RMP entry format is documented in AMD PPR, see
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D296015&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7C2140214b3fbd4a71617008d947bf9ae7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637619710568694335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=AkCyolw0P%2BrRFF%2FAnRozld4GkegQ0hR%2F523DI48jB4g%3D&amp;reserved=0.
> 
> Ewwwwww, the RMP format isn't architectural!?
> 
>    Architecturally the format of RMP entries are not specified in APM. In order
>    to assist software, the following table specifies select portions of the RMP
>    entry format for this specific product.
> 

Unfortunately yes.

But the documented fields in the RMP entry is architectural. The entry 
fields are documented in the APM section 15.36. So, in future we are 
guaranteed to have those fields available. If we are reading the RMP 
table directly, then architecture should provide some other means to get 
to fields from the RMP entry.


> I know we generally don't want to add infrastructure without good reason, but on
> the other hand exposing a microarchitectural data structure to the kernel at large
> is going to be a disaster if the format does change on a future processor.
> 
> Looking at the future patches, dump_rmpentry() is the only power user, e.g.
> everything else mostly looks at "assigned" and "level" (and one ratelimited warn
> on "validated" in snp_make_page_shared(), but I suspect that particular check
> can and should be dropped).
> 

Yes, we need "assigned" and "level" and other entries are mainly for the 
debug purposes.

> So, what about hiding "struct rmpentry" and possibly renaming it to something
> scary/microarchitectural, e.g. something like
> 

Yes, it will work fine.

> /*
>   * Returns 1 if the RMP entry is assigned, 0 if it exists but is not assigned,
>   * and -errno if there is no corresponding RMP entry.
>   */
> int snp_lookup_rmpentry(struct page *page, int *level)
> {
> 	unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
> 	struct rmpentry *entry, *large_entry;
> 	unsigned long vaddr;
> 
> 	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> 		return -ENXIO;
> 
> 	vaddr = rmptable_start + rmptable_page_offset(phys);
> 	if (unlikely(vaddr > rmptable_end))
> 		return -EXNIO;
> 
> 	entry = (struct rmpentry *)vaddr;
> 
> 	/* Read a large RMP entry to get the correct page level used in RMP entry. */
> 	vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
> 	large_entry = (struct rmpentry *)vaddr;
> 	*level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
> 
> 	return !!entry->assigned;
> }
> 
> 
> And then move dump_rmpentry() (or add a helper) in sev.c so that "struct rmpentry"
> can be declared in sev.c.
> 

Ack.


>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>   arch/x86/include/asm/sev.h |  4 +--
>>   arch/x86/kernel/sev.c      | 26 +++++++++++++++++++
>>   include/linux/sev.h        | 51 ++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 78 insertions(+), 3 deletions(-)
>>   create mode 100644 include/linux/sev.h
>>
>> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
>> index 6c23e694a109..9e7e7e737f55 100644
>> --- a/arch/x86/include/asm/sev.h
>> +++ b/arch/x86/include/asm/sev.h
>> @@ -9,6 +9,7 @@
>>   #define __ASM_ENCRYPTED_STATE_H
>>   
>>   #include <linux/types.h>
>> +#include <linux/sev.h>
> 
> Why move things to linux/sev.h?  AFAICT, even at the end of the series, the only
> users of anything in this file all reside somewhere in arch/x86.
> 


If we go with approach where the 'struct rmpentry' is not visible 
outside the arch/x86/kernel/sev.c then there is no need to define all 
these bit fields in linux/sev.h. I kept in linux/sev.h because driver 
(KVM, and PSP) uses the rmpentry_xxx() to read the fields.


>>   #include <asm/insn.h>
>>   #include <asm/sev-common.h>
>>   #include <asm/bootparam.h>
>> @@ -75,9 +76,6 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
>>   /* Software defined (when rFlags.CF = 1) */
>>   #define PVALIDATE_FAIL_NOUPDATE		255
>>   
>> -/* RMP page size */
>> -#define RMP_PG_SIZE_4K			0
>> -
>>   #define RMPADJUST_VMSA_PAGE_BIT		BIT(16)
>>   
>>   #ifdef CONFIG_AMD_MEM_ENCRYPT
>> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
>> index f9d813d498fa..1aed3d53f59f 100644
>> --- a/arch/x86/kernel/sev.c
>> +++ b/arch/x86/kernel/sev.c
>> @@ -49,6 +49,8 @@
>>   #define DR7_RESET_VALUE        0x400
>>   
>>   #define RMPTABLE_ENTRIES_OFFSET        0x4000
>> +#define RMPENTRY_SHIFT			8
>> +#define rmptable_page_offset(x)	(RMPTABLE_ENTRIES_OFFSET + (((unsigned long)x) >> RMPENTRY_SHIFT))
>>   
>>   /* For early boot hypervisor communication in SEV-ES enabled guests */
>>   static struct ghcb boot_ghcb_page __bss_decrypted __aligned(PAGE_SIZE);
>> @@ -2319,3 +2321,27 @@ static int __init snp_rmptable_init(void)
>>    * passthough state, and it is available after subsys_initcall().
>>    */
>>   fs_initcall(snp_rmptable_init);
>> +
>> +struct rmpentry *snp_lookup_page_in_rmptable(struct page *page, int *level)
> 
> Maybe just snp_get_rmpentry?  Or snp_lookup_rmpentry?  I'm guessing the name was
> chosen to align with e.g. lookup_address_in_mm, but IMO the lookup_address helpers
> are oddly named.
> 

Yes, it was mostly choose to align with it. Dave recommended dropping 
the 'struct page *' arg from it and accept the pfn directly. Based on 
your feedbacks, I am going to add

int snp_lookup_rmpentry(unsigned long pfn, int *level);

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-15 18:39           ` Sean Christopherson
@ 2021-07-15 19:38             ` Brijesh Singh
  2021-07-15 22:01               ` Sean Christopherson
  2021-07-30 11:31               ` Vlastimil Babka
  0 siblings, 2 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-15 19:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/15/21 1:39 PM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
>> is designed to remove/add the present bit in the direct map. We can't use
>> them, because in our case the page may get accessed by the KVM (e.g
>> kvm_guest_write, kvm_guest_map etc).
> 
> But KVM should never access a guest private page, i.e. the direct map should
> always be restored to PRESENT before KVM attempts to access the page.
> 

Yes, KVM should *never* access the guest private pages. So, we could 
potentially enhance the RMPUPDATE() to check for the assigned and act 
accordingly.

Are you thinking something along the line of this:

int rmpupdate(struct page *page, struct rmpupdate *val)
{
	...
	
	/*
	 * If page is getting assigned in the RMP entry then unmap
	 * it from the direct map before its added in the RMP table.
	 */
	if (val.assigned)
		set_direct_map_invalid_noflush(page_to_virt(page), 1);

	...

	/*
	 * If the page is getting unassigned then restore the mapping
	 * in the direct map after its removed from the RMP table.
	 */
	if (!val.assigned)
		set_direct_map_default_noflush(page_to_virt(page), 1);
	
	...
}

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
  2021-07-12 16:49                 ` Brijesh Singh
@ 2021-07-15 21:53                   ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 21:53 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: Dave Hansen, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh

On Mon, Jul 12, 2021, Brijesh Singh wrote:
> 
> 
> On 7/12/21 11:29 AM, Dave Hansen wrote:
> > On 7/12/21 9:24 AM, Brijesh Singh wrote:
> > > Apologies if I was not clear in the messaging, that's exactly what I
> > > mean that we don't feed RMP entries during the page state change.
> > > 
> > > The sequence of the operation is:
> > > 
> > > 1. Guest issues a VMGEXIT (page state change) to add a page in the RMP
> > > 2. Hyperivosr adds the page in the RMP table.
> > > 
> > > The check will be inside the hypervisor (#2), to query the backing page
> > > type, if the backing page is from the hugetlbfs, then don't add the page
> > > in the RMP, and fail the page state change VMGEXIT.
> > 
> > Right, but *LOOOOOONG* before that, something walked the page tables and
> > stuffed the PFN into the NPT (that's the AMD equivalent of EPT, right?).
> >   You could also avoid this whole mess by refusing to allow hugetblfs to
> > be mapped into the guest in the first place.
> > 
> 
> Ah, that should be doable. For SEV stuff, we require the VMM to register the
> memory region to the hypervisor during the VM creation time. I can check the
> hugetlbfs while registering the memory region and fail much earlier.

That's technically unnecessary, because this patch is working on the wrong set of
page tables when handling faults from KVM.

The host page tables constrain KVM's NPT, but the two are not mirrors of each
other.  Specifically, KVM cannot exceed the size of the host page tables because
that would give the guest access to memory it does not own, but KVM isn't required
to use the same size as the host.  E.g. a 1gb page in the host can be 1gb, 2mb, or
4kb in the NPT.

The code "works" because the size contraints mean it can't get false negatives,
only false positives, false positives will never be fatal, e.g. the fault handler
may unnecessarily demote a 1gb, and demoting a host page will further constrain
KVM's NPT.

The distinction matters because it changes our options.  For RMP violations on
NPT due to page size mismatches, KVM can and should handle the fault without
consulting the primary MMU, i.e. by demoting the NPT entry.  That means KVM does
not need to care about hugetlbfs or any other backing type that cannot be split
since KVM will never initiate a host page split in response to a #NPT RMP violation.

That doesn't mean that hugetlbfs will magically work since e.g. get/put_user()
will fault and fail, but that's a generic non-KVM problem since nothing prevents
remapping and/or accessing the page(s) outside of KVM context.

The other reason to not disallow hugetlbfs and co. is that a guest that's
enlightened to operate at 2mb granularity, e.g. always do page state changes on
2mb chunks, can play nice with hugetlbfs without ever hitting an RMP violation.

Last thought, have we taken care in the guest side of things to work at 2mb
granularity when possible?  AFAICT, PSMASH is effectively a one-way street since
RMPUPDATE to restore a 2mb RMP is destructive, i.e. requires PVALIDATE on the
entire 2mb chunk, and the guest can't safely do that without reinitializing the
whole page, e.g. would either lose data or have to save/init/restore.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-15 19:38             ` Brijesh Singh
@ 2021-07-15 22:01               ` Sean Christopherson
  2021-07-15 22:11                 ` Brijesh Singh
  2021-07-30 11:31               ` Vlastimil Babka
  1 sibling, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 22:01 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Thu, Jul 15, 2021, Brijesh Singh wrote:
> 
> 
> On 7/15/21 1:39 PM, Sean Christopherson wrote:
> > On Thu, Jul 15, 2021, Brijesh Singh wrote:
> > > The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
> > > is designed to remove/add the present bit in the direct map. We can't use
> > > them, because in our case the page may get accessed by the KVM (e.g
> > > kvm_guest_write, kvm_guest_map etc).
> > 
> > But KVM should never access a guest private page, i.e. the direct map should
> > always be restored to PRESENT before KVM attempts to access the page.
> > 
> 
> Yes, KVM should *never* access the guest private pages. So, we could
> potentially enhance the RMPUPDATE() to check for the assigned and act
> accordingly.
> 
> Are you thinking something along the line of this:
> 
> int rmpupdate(struct page *page, struct rmpupdate *val)
> {
> 	...
> 	
> 	/*
> 	 * If page is getting assigned in the RMP entry then unmap
> 	 * it from the direct map before its added in the RMP table.
> 	 */
> 	if (val.assigned)
> 		set_direct_map_invalid_noflush(page_to_virt(page), 1);
> 
> 	...
> 
> 	/*
> 	 * If the page is getting unassigned then restore the mapping
> 	 * in the direct map after its removed from the RMP table.
> 	 */
> 	if (!val.assigned)
> 		set_direct_map_default_noflush(page_to_virt(page), 1);
> 	
> 	...
> }

Yep.

However, looking at the KVM usage, rmpupdate() appears to be broken.  When
handling a page state change, the guest can specify a 2mb page.  In that case,
rmpupdate() will be called once for a 2mb page, but this flow assumes a single
4kb page.  The current code works because set_memory_4k() will cause the entire
2mb page to be shattered, but it's technically wrong and switching to the above
would cause problems.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 07/40] x86/sev: Split the physmap when adding the page in RMP table
  2021-07-15 22:01               ` Sean Christopherson
@ 2021-07-15 22:11                 ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-15 22:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/15/21 5:01 PM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>
>> On 7/15/21 1:39 PM, Sean Christopherson wrote:
>>> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>>> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
>>>> is designed to remove/add the present bit in the direct map. We can't use
>>>> them, because in our case the page may get accessed by the KVM (e.g
>>>> kvm_guest_write, kvm_guest_map etc).
>>> But KVM should never access a guest private page, i.e. the direct map should
>>> always be restored to PRESENT before KVM attempts to access the page.
>>>
>> Yes, KVM should *never* access the guest private pages. So, we could
>> potentially enhance the RMPUPDATE() to check for the assigned and act
>> accordingly.
>>
>> Are you thinking something along the line of this:
>>
>> int rmpupdate(struct page *page, struct rmpupdate *val)
>> {
>> 	...
>> 	
>> 	/*
>> 	 * If page is getting assigned in the RMP entry then unmap
>> 	 * it from the direct map before its added in the RMP table.
>> 	 */
>> 	if (val.assigned)
>> 		set_direct_map_invalid_noflush(page_to_virt(page), 1);
>>
>> 	...
>>
>> 	/*
>> 	 * If the page is getting unassigned then restore the mapping
>> 	 * in the direct map after its removed from the RMP table.
>> 	 */
>> 	if (!val.assigned)
>> 		set_direct_map_default_noflush(page_to_virt(page), 1);
>> 	
>> 	...
>> }
> Yep.
>
> However, looking at the KVM usage, rmpupdate() appears to be broken.  When
> handling a page state change, the guest can specify a 2mb page.  In that case,
> rmpupdate() will be called once for a 2mb page, but this flow assumes a single
> 4kb page.  The current code works because set_memory_4k() will cause the entire
> 2mb page to be shattered, but it's technically wrong and switching to the above
> would cause problems.


Yep, this was just an example to make sure I am able to follow you
correctly. In the actual patch I am going to read the pagesize from the
RMPUPDATE structure and  calculated npages for the
set_direct_map_default(...). As you said it was not needed in the case
of set_memory_4k() because the function force splits the large page.
Whereas with set_direct_map_default(), it first checks whether the split
is required, if not, then skip and update the attributes.

-Brijesh


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled Brijesh Singh
  2021-07-14 13:22   ` Marc Orr
@ 2021-07-15 23:48   ` Sean Christopherson
  2021-07-16 12:55     ` Brijesh Singh
  1 sibling, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-15 23:48 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The behavior and requirement for the SEV-legacy command is altered when
> the SNP firmware is in the INIT state. See SEV-SNP firmware specification
> for more details.
> 
> When SNP is INIT state, all the SEV-legacy commands that cause the
> firmware to write memory must be in the firmware state. The TMR memory

It'd be helpful to spell out Trusted Memory Region, I hadn't seen that
term before and for some reason my brain immediately thought "xAPIC register!".

> is allocated by the host but updated by the firmware, so, it must be
> in the firmware state.  Additionally, the TMR memory must be a 2MB aligned
> instead of the 1MB, and the TMR length need to be 2MB instead of 1MB.
> The helper __snp_{alloc,free}_firmware_pages() can be used for allocating
> and freeing the memory used by the firmware.

None of this actually states what the patch does, e.g. it's not clear whether
all allocations are being converted to 2mb or just the SNP.  Looks like it's
just SNP.  Something like this?

  Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region when
  SNP is enabled to satisfy new requirements for SNP.  Continue allocating a
  1mb region for !SNP configuration.

> While at it, provide API that can be used by others to allocate a page
> that can be used by the firmware. The immediate user for this API will
> be the KVM driver. The KVM driver to need to allocate a firmware context
> page during the guest creation. The context page need to be updated
> by the firmware. See the SEV-SNP specification for further details.

...

> @@ -1153,8 +1269,10 @@ static void sev_firmware_shutdown(struct sev_device *sev)
>  		/* The TMR area was encrypted, flush it from the cache */
>  		wbinvd_on_all_cpus();
>  
> -		free_pages((unsigned long)sev_es_tmr,
> -			   get_order(SEV_ES_TMR_SIZE));
> +
> +		__snp_free_firmware_pages(virt_to_page(sev_es_tmr),
> +					  get_order(sev_es_tmr_size),
> +					  false);
>  		sev_es_tmr = NULL;
>  	}
>  
> @@ -1204,16 +1322,6 @@ void sev_pci_init(void)
>  	    sev_update_firmware(sev->dev) == 0)
>  		sev_get_api_version();
>  
> -	/* Obtain the TMR memory area for SEV-ES use */
> -	tmr_page = alloc_pages(GFP_KERNEL, get_order(SEV_ES_TMR_SIZE));
> -	if (tmr_page) {
> -		sev_es_tmr = page_address(tmr_page);
> -	} else {
> -		sev_es_tmr = NULL;
> -		dev_warn(sev->dev,
> -			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> -	}
> -
>  	/*
>  	 * If boot CPU supports the SNP, then first attempt to initialize
>  	 * the SNP firmware.
> @@ -1229,6 +1337,16 @@ void sev_pci_init(void)
>  		}
>  	}
>  
> +	/* Obtain the TMR memory area for SEV-ES use */
> +	tmr_page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(sev_es_tmr_size), false);
> +	if (tmr_page) {
> +		sev_es_tmr = page_address(tmr_page);
> +	} else {
> +		sev_es_tmr = NULL;
> +		dev_warn(sev->dev,
> +			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> +	}

I think your patch ordering got a bit wonky.  AFAICT, the chunk that added
sev_snp_init() and friends in the previous patch 14 should have landed above
the TMR allocation, i.e. the code movement here should be unnecessary.

>  	/* Initialize the platform */
>  	rc = sev_platform_init(&error);
>  	if (rc && (error == SEV_RET_SECURE_DATA_INVALID)) {

...

> @@ -961,6 +965,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
>  	return -ENODEV;
>  }
>  
> +static inline void *snp_alloc_firmware_page(gfp_t mask)
> +{
> +	return NULL;
> +}
> +
> +static inline void snp_free_firmware_page(void *addr) { }

Hmm, I think we should probably bite the bullet and #ifdef and/or stub out large
swaths of svm/sev.c before adding SNP support.  sev.c is getting quite massive,
and we're accumulating more and more stubs outside of KVM because its SEV code
is compiled unconditionally.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-15 23:48   ` Sean Christopherson
@ 2021-07-16 12:55     ` Brijesh Singh
  2021-07-16 15:35       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 12:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/15/21 6:48 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> The behavior and requirement for the SEV-legacy command is altered when
>> the SNP firmware is in the INIT state. See SEV-SNP firmware specification
>> for more details.
>>
>> When SNP is INIT state, all the SEV-legacy commands that cause the
>> firmware to write memory must be in the firmware state. The TMR memory
> It'd be helpful to spell out Trusted Memory Region, I hadn't seen that
> term before and for some reason my brain immediately thought "xAPIC register!".

Noted.


>
>> is allocated by the host but updated by the firmware, so, it must be
>> in the firmware state.  Additionally, the TMR memory must be a 2MB aligned
>> instead of the 1MB, and the TMR length need to be 2MB instead of 1MB.
>> The helper __snp_{alloc,free}_firmware_pages() can be used for allocating
>> and freeing the memory used by the firmware.
> None of this actually states what the patch does, e.g. it's not clear whether
> all allocations are being converted to 2mb or just the SNP.  Looks like it's
> just SNP.  Something like this?
>
>   Allocate the Trusted Memory Region (TMR) as a 2mb sized/aligned region when
>   SNP is enabled to satisfy new requirements for SNP.  Continue allocating a
>   1mb region for !SNP configuration.
>
Only the TMR allocation is converted to use the 2mb when SNP is enabled.


>> While at it, provide API that can be used by others to allocate a page
>> that can be used by the firmware. The immediate user for this API will
>> be the KVM driver. The KVM driver to need to allocate a firmware context
>> page during the guest creation. The context page need to be updated
>> by the firmware. See the SEV-SNP specification for further details.
> ...
>
>> @@ -1153,8 +1269,10 @@ static void sev_firmware_shutdown(struct sev_device *sev)
>>  		/* The TMR area was encrypted, flush it from the cache */
>>  		wbinvd_on_all_cpus();
>>  
>> -		free_pages((unsigned long)sev_es_tmr,
>> -			   get_order(SEV_ES_TMR_SIZE));
>> +
>> +		__snp_free_firmware_pages(virt_to_page(sev_es_tmr),
>> +					  get_order(sev_es_tmr_size),
>> +					  false);
>>  		sev_es_tmr = NULL;
>>  	}
>>  
>> @@ -1204,16 +1322,6 @@ void sev_pci_init(void)
>>  	    sev_update_firmware(sev->dev) == 0)
>>  		sev_get_api_version();
>>  
>> -	/* Obtain the TMR memory area for SEV-ES use */
>> -	tmr_page = alloc_pages(GFP_KERNEL, get_order(SEV_ES_TMR_SIZE));
>> -	if (tmr_page) {
>> -		sev_es_tmr = page_address(tmr_page);
>> -	} else {
>> -		sev_es_tmr = NULL;
>> -		dev_warn(sev->dev,
>> -			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
>> -	}
>> -
>>  	/*
>>  	 * If boot CPU supports the SNP, then first attempt to initialize
>>  	 * the SNP firmware.
>> @@ -1229,6 +1337,16 @@ void sev_pci_init(void)
>>  		}
>>  	}
>>  
>> +	/* Obtain the TMR memory area for SEV-ES use */
>> +	tmr_page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(sev_es_tmr_size), false);
>> +	if (tmr_page) {
>> +		sev_es_tmr = page_address(tmr_page);
>> +	} else {
>> +		sev_es_tmr = NULL;
>> +		dev_warn(sev->dev,
>> +			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
>> +	}
> I think your patch ordering got a bit wonky.  AFAICT, the chunk that added
> sev_snp_init() and friends in the previous patch 14 should have landed above
> the TMR allocation, i.e. the code movement here should be unnecessary.

I was debating about it whether to include all the SNP supports in one
patch or divide it up. If I had included all legacy support new
requirement in the same patch which adds the SNP then it will be a big
patch. I had feeling that others may ask me to split it. So my approach is:

* In the first patch adds SNP support only

* Improve the legacy SEV/ES for the requirement when SNP is enabled.
Once SNP is enabled,  there are two new requirement for the legacy
SEV/ES guests

  1) TMR must be 2mb

  2) The buffer given to the firmware for the write must be in the
firmware state.

I also divided both of the new requirement in separate patches so that
its easy to review.


>
>>  	/* Initialize the platform */
>>  	rc = sev_platform_init(&error);
>>  	if (rc && (error == SEV_RET_SECURE_DATA_INVALID)) {
> ...
>
>> @@ -961,6 +965,13 @@ static inline int snp_guest_dbg_decrypt(struct sev_data_snp_dbg *data, int *erro
>>  	return -ENODEV;
>>  }
>>  
>> +static inline void *snp_alloc_firmware_page(gfp_t mask)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline void snp_free_firmware_page(void *addr) { }
> Hmm, I think we should probably bite the bullet and #ifdef and/or stub out large
> swaths of svm/sev.c before adding SNP support.  sev.c is getting quite massive,
> and we're accumulating more and more stubs outside of KVM because its SEV code
> is compiled unconditionally.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-16 12:55     ` Brijesh Singh
@ 2021-07-16 15:35       ` Sean Christopherson
  2021-07-16 15:47         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 15:35 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/15/21 6:48 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> @@ -1204,16 +1322,6 @@ void sev_pci_init(void)
> >>  	    sev_update_firmware(sev->dev) == 0)
> >>  		sev_get_api_version();
> >>  
> >> -	/* Obtain the TMR memory area for SEV-ES use */
> >> -	tmr_page = alloc_pages(GFP_KERNEL, get_order(SEV_ES_TMR_SIZE));
> >> -	if (tmr_page) {
> >> -		sev_es_tmr = page_address(tmr_page);
> >> -	} else {
> >> -		sev_es_tmr = NULL;
> >> -		dev_warn(sev->dev,
> >> -			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> >> -	}
> >> -
> >>  	/*
> >>  	 * If boot CPU supports the SNP, then first attempt to initialize
> >>  	 * the SNP firmware.
> >> @@ -1229,6 +1337,16 @@ void sev_pci_init(void)
> >>  		}
> >>  	}
> >>  
> >> +	/* Obtain the TMR memory area for SEV-ES use */
> >> +	tmr_page = __snp_alloc_firmware_pages(GFP_KERNEL, get_order(sev_es_tmr_size), false);
> >> +	if (tmr_page) {
> >> +		sev_es_tmr = page_address(tmr_page);
> >> +	} else {
> >> +		sev_es_tmr = NULL;
> >> +		dev_warn(sev->dev,
> >> +			 "SEV: TMR allocation failed, SEV-ES support unavailable\n");
> >> +	}
> > I think your patch ordering got a bit wonky.  AFAICT, the chunk that added
> > sev_snp_init() and friends in the previous patch 14 should have landed above
> > the TMR allocation, i.e. the code movement here should be unnecessary.
> 
> I was debating about it whether to include all the SNP supports in one
> patch or divide it up. If I had included all legacy support new
> requirement in the same patch which adds the SNP then it will be a big
> patch. I had feeling that others may ask me to split it.

It wasn't comment on the patch organization, rather that the code added in patch 14
appears to have landed in the wrong location within the code.  The above diff shows
that the TMR allocation is being moved around the SNP initialization code that was
added in patch 14 (the immediately prior patch).  Presumably the required order
doesn't magically change just because the TMR is now being allocated as a 2mb blob,
so either the code movement is unnecessary churn or the original location was wrong.
In either case, landing the SNP initialization code above the TMR allocation in
patch 14 would eliminate the above code movement.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 15/40] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  2021-07-16 15:35       ` Sean Christopherson
@ 2021-07-16 15:47         ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 15:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 10:35 AM, Sean Christopherson wrote:
> It wasn't comment on the patch organization, rather that the code added in patch 14
> appears to have landed in the wrong location within the code.  The above diff shows
> that the TMR allocation is being moved around the SNP initialization code that was
> added in patch 14 (the immediately prior patch).  Presumably the required order
> doesn't magically change just because the TMR is now being allocated as a 2mb blob,
> so either the code movement is unnecessary churn or the original location was wrong.
> In either case, landing the SNP initialization code above the TMR allocation in
> patch 14 would eliminate the above code movement.

Got it, I'll rearrange things in the previous patch to avoid this hunk.

thanks



^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers
  2021-07-15 19:28     ` Brijesh Singh
@ 2021-07-16 17:22       ` Brijesh Singh
  2021-07-20 22:06         ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 17:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/15/21 2:28 PM, Brijesh Singh wrote:
>
>
> On 7/15/21 1:37 PM, Sean Christopherson wrote:
>> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>>> The snp_lookup_page_in_rmptable() can be used by the host to read
>>> the RMP
>>> entry for a given page. The RMP entry format is documented in AMD
>>> PPR, see
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D296015&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7C2140214b3fbd4a71617008d947bf9ae7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637619710568694335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=AkCyolw0P%2BrRFF%2FAnRozld4GkegQ0hR%2F523DI48jB4g%3D&amp;reserved=0.
>>>
>>
>> Ewwwwww, the RMP format isn't architectural!?
>>
>>    Architecturally the format of RMP entries are not specified in
>> APM. In order
>>    to assist software, the following table specifies select portions
>> of the RMP
>>    entry format for this specific product.
>>
>
> Unfortunately yes.
>
> But the documented fields in the RMP entry is architectural. The entry
> fields are documented in the APM section 15.36. So, in future we are
> guaranteed to have those fields available. If we are reading the RMP
> table directly, then architecture should provide some other means to
> get to fields from the RMP entry.
>
>
>> I know we generally don't want to add infrastructure without good
>> reason, but on
>> the other hand exposing a microarchitectural data structure to the
>> kernel at large
>> is going to be a disaster if the format does change on a future
>> processor.
>>
>> Looking at the future patches, dump_rmpentry() is the only power
>> user, e.g.
>> everything else mostly looks at "assigned" and "level" (and one
>> ratelimited warn
>> on "validated" in snp_make_page_shared(), but I suspect that
>> particular check
>> can and should be dropped).
>>
>
> Yes, we need "assigned" and "level" and other entries are mainly for
> the debug purposes.
>
For the debug purposes, we would like to dump additional RMP entries. If
we go with your proposed function then how do we get those information
in the dump_rmpentry()? How about if we provide two functions; the first
function provides architectural format and second provides the raw
values which can be used by the dump_rmpentry() helper.

struct rmpentry *snp_lookup_rmpentry(unsigned long paddr, int *level);

The 'struct rmpentry' uses the format defined in APM Table 15-36.

struct _rmpentry *_snp_lookup_rmpentry(unsigned long paddr, int *level);

The 'struct _rmpentry' will use include the PPR definition (basically
what we have today in this patch).

Thoughts ?


>> So, what about hiding "struct rmpentry" and possibly renaming it to
>> something
>> scary/microarchitectural, e.g. something like
>>
>
> Yes, it will work fine.
>
>> /*
>>   * Returns 1 if the RMP entry is assigned, 0 if it exists but is not
>> assigned,
>>   * and -errno if there is no corresponding RMP entry.
>>   */
>> int snp_lookup_rmpentry(struct page *page, int *level)
>> {
>>     unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
>>     struct rmpentry *entry, *large_entry;
>>     unsigned long vaddr;
>>
>>     if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>>         return -ENXIO;
>>
>>     vaddr = rmptable_start + rmptable_page_offset(phys);
>>     if (unlikely(vaddr > rmptable_end))
>>         return -EXNIO;
>>
>>     entry = (struct rmpentry *)vaddr;
>>
>>     /* Read a large RMP entry to get the correct page level used in
>> RMP entry. */
>>     vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
>>     large_entry = (struct rmpentry *)vaddr;
>>     *level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
>>
>>     return !!entry->assigned;
>> }
>>
>>
>> And then move dump_rmpentry() (or add a helper) in sev.c so that
>> "struct rmpentry"
>> can be declared in sev.c.
>>
>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support Brijesh Singh
@ 2021-07-16 18:00   ` Sean Christopherson
  2021-07-16 18:46     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 18:00 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 411ed72f63af..abca2b9dee83 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -52,9 +52,14 @@ module_param_named(sev, sev_enabled, bool, 0444);
>  /* enable/disable SEV-ES support */
>  static bool sev_es_enabled = true;
>  module_param_named(sev_es, sev_es_enabled, bool, 0444);
> +
> +/* enable/disable SEV-SNP support */
> +static bool sev_snp_enabled = true;

Is it safe to incrementally introduce SNP support?  Or should the module param
be hidden until all support is in place?  E.g. what will happen when KVM allows
userspace to create SNP guests but doesn't yet have the RMP management added?

> +module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
>  #else
>  #define sev_enabled false
>  #define sev_es_enabled false
> +#define sev_snp_enabled  false
>  #endif /* CONFIG_KVM_AMD_SEV */
>  
>  #define AP_RESET_HOLD_NONE		0
> @@ -1825,6 +1830,7 @@ void __init sev_hardware_setup(void)
>  {
>  #ifdef CONFIG_KVM_AMD_SEV
>  	unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
> +	bool sev_snp_supported = false;
>  	bool sev_es_supported = false;
>  	bool sev_supported = false;
>  
> @@ -1888,9 +1894,21 @@ void __init sev_hardware_setup(void)
>  	pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
>  	sev_es_supported = true;
>  
> +	/* SEV-SNP support requested? */
> +	if (!sev_snp_enabled)
> +		goto out;
> +
> +	/* Is SEV-SNP enabled? */
> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))

Random question, why use cpu_feature_enabled?  Did something change in cpufeatures
that prevents using boot_cpu_has() here?

> +		goto out;
> +
> +	pr_info("SEV-SNP supported: %u ASIDs\n", min_sev_asid - 1);

Use sev_es_asid_count instead of manually recomputing the same; the latter
obfuscates the fact that ES and SNP share the same ASID pool.

Even better would be to report ES+SNP together, otherwise the user could easily
interpret ES and SNP having separate ASID pools.  And IMO the gotos for SNP are
overkill, e.g.

	sev_es_supported = true;
	sev_snp_supported = sev_snp_enabled &&
			    cpu_feature_enabled(X86_FEATURE_SEV_SNP);

	pr_info("SEV-ES %ssupported: %u ASIDs\n",
		sev_snp_supported ? "and SEV-SNP " : "", sev_es_asid_count);


> +	sev_snp_supported = true;
> +
>  out:
>  	sev_enabled = sev_supported;
>  	sev_es_enabled = sev_es_supported;
> +	sev_snp_enabled = sev_snp_supported;
>  #endif
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 1175edb02d33..b9ea99f8579e 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -58,6 +58,7 @@ enum {
>  struct kvm_sev_info {
>  	bool active;		/* SEV enabled guest */
>  	bool es_active;		/* SEV-ES enabled guest */
> +	bool snp_active;	/* SEV-SNP enabled guest */
>  	unsigned int asid;	/* ASID used for this guest */
>  	unsigned int handle;	/* SEV firmware handle */
>  	int fd;			/* SEV device fd */
> @@ -232,6 +233,17 @@ static inline bool sev_es_guest(struct kvm *kvm)
>  #endif
>  }
>  
> +static inline bool sev_snp_guest(struct kvm *kvm)
> +{
> +#ifdef CONFIG_KVM_AMD_SEV
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +
> +	return sev_es_guest(kvm) && sev->snp_active;

Can't this be reduced to:

	return to_kvm_svm(kvm)->sev_info.snp_active;

KVM should never set snp_active without also setting es_active.

Side topic, I think it would also be worthwhile to add to_sev (or maybe to_kvm_sev)
given the frequency of the "&to_kvm_svm(kvm)->sev_info" pattern.

> +#else
> +	return false;
> +#endif
> +}
> +
>  static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
>  {
>  	vmcb->control.clean = 0;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 28/40] KVM: X86: Introduce kvm_mmu_map_tdp_page() for use by SEV
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 28/40] KVM: X86: Introduce kvm_mmu_map_tdp_page() for use by SEV Brijesh Singh
@ 2021-07-16 18:15   ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 18:15 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> +int kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, int max_level)
> +{
> +	int r;
> +
> +	/*
> +	 * Loop on the page fault path to handle the case where an mmu_notifier
> +	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
> +	 * KVM needs to resume the guest in case the invalidation changed any
> +	 * of the page fault properties, i.e. the gpa or error code.  For this
> +	 * path, the gpa and error code are fixed by the caller, and the caller
> +	 * expects failure if and only if the page fault can't be fixed.
> +	 */
> +	do {
> +		r = direct_page_fault(vcpu, gpa, error_code, false, max_level, true);
> +	} while (r == RET_PF_RETRY);
> +
> +	return r;

This implementation is completely broken, which in turn means that the page state
change code is not well tested.  The mess is likely masked to some extent because
the call is bookendeda by calls to kvm_mmu_get_tdp_walk(), i.e. most of the time
it's not called, and when it is called, the bugs are hidden by the second walk
detecting that the mapping was not installed.

  1. direct_page_fault() does not return a pfn, it returns the action that should
     be taken by the caller.
  2. The while() can be optimized to bail on no_slot PFNs.
  3. mmu_topup_memory_caches() needs to be called here, otherwise @pfn will be
     uninitialized.  The alternative would be to set @pfn when that fails in
     direct_page_fault().
  4. The 'int' return value is wrong, it needs to be kvm_pfn_t.

A correct implementation can be found in the TDX series, the easiest thing would
be to suck in those patches.

https://lore.kernel.org/kvm/ceffc7ef0746c6064330ef5c30bc0bb5994a1928.1625186503.git.isaku.yamahata@intel.com/
https://lore.kernel.org/kvm/a7e7602375e1f63b32eda19cb8011f11794ebe28.1625186503.git.isaku.yamahata@intel.com/

> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
> +
>  static void nonpaging_init_context(struct kvm_vcpu *vcpu,
>  				   struct kvm_mmu *context)
>  {
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support
  2021-07-16 18:00   ` Sean Christopherson
@ 2021-07-16 18:46     ` Brijesh Singh
  2021-07-16 19:31       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 18:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 1:00 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 411ed72f63af..abca2b9dee83 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -52,9 +52,14 @@ module_param_named(sev, sev_enabled, bool, 0444);
>>  /* enable/disable SEV-ES support */
>>  static bool sev_es_enabled = true;
>>  module_param_named(sev_es, sev_es_enabled, bool, 0444);
>> +
>> +/* enable/disable SEV-SNP support */
>> +static bool sev_snp_enabled = true;
> Is it safe to incrementally introduce SNP support?  Or should the module param
> be hidden until all support is in place?  E.g. what will happen when KVM allows
> userspace to create SNP guests but doesn't yet have the RMP management added?

The SNP support depends on the RMP management. At least the patch
ordering in this series adds the RMP management first then updates
drivers to use the RMP specific APIs. If RMP is not initialized due to
someone not picking the commits in the order, then SNP guest creation
will fail. This is mainly because the first thing a guest creation does
is to call the SNP_INIT. The SNP_INIT firmware command verifies that RMP
is initialized before creating the guest context etc..


>> +module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
>>  #else
>>  #define sev_enabled false
>>  #define sev_es_enabled false
>> +#define sev_snp_enabled  false
>>  #endif /* CONFIG_KVM_AMD_SEV */
>>  
>>  #define AP_RESET_HOLD_NONE		0
>> @@ -1825,6 +1830,7 @@ void __init sev_hardware_setup(void)
>>  {
>>  #ifdef CONFIG_KVM_AMD_SEV
>>  	unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
>> +	bool sev_snp_supported = false;
>>  	bool sev_es_supported = false;
>>  	bool sev_supported = false;
>>  
>> @@ -1888,9 +1894,21 @@ void __init sev_hardware_setup(void)
>>  	pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
>>  	sev_es_supported = true;
>>  
>> +	/* SEV-SNP support requested? */
>> +	if (!sev_snp_enabled)
>> +		goto out;
>> +
>> +	/* Is SEV-SNP enabled? */
>> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> Random question, why use cpu_feature_enabled?  Did something change in cpufeatures
> that prevents using boot_cpu_has() here?


During the boot the kernel initialize the RMP table. If RMP table
initialization fail, then X86_FEATURE_SEV_SNP is cleared. In that case,
the cpu_feature_enabled() should return false. The idea is,
cpu_feature_enabled() will be set only when the RMP table is
successfully initialized and SYSCFG.SNP is set.


>> +		goto out;
>> +
>> +	pr_info("SEV-SNP supported: %u ASIDs\n", min_sev_asid - 1);
> Use sev_es_asid_count instead of manually recomputing the same; the latter
> obfuscates the fact that ES and SNP share the same ASID pool.
>
> Even better would be to report ES+SNP together, otherwise the user could easily
> interpret ES and SNP having separate ASID pools.  And IMO the gotos for SNP are
> overkill, e.g.
>
> 	sev_es_supported = true;
> 	sev_snp_supported = sev_snp_enabled &&
> 			    cpu_feature_enabled(X86_FEATURE_SEV_SNP);
>
> 	pr_info("SEV-ES %ssupported: %u ASIDs\n",
> 		sev_snp_supported ? "and SEV-SNP " : "", sev_es_asid_count);
>
Noted.


>> +	sev_snp_supported = true;
>> +
>>  out:
>>  	sev_enabled = sev_supported;
>>  	sev_es_enabled = sev_es_supported;
>> +	sev_snp_enabled = sev_snp_supported;
>>  #endif
>>  }
>>  
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 1175edb02d33..b9ea99f8579e 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -58,6 +58,7 @@ enum {
>>  struct kvm_sev_info {
>>  	bool active;		/* SEV enabled guest */
>>  	bool es_active;		/* SEV-ES enabled guest */
>> +	bool snp_active;	/* SEV-SNP enabled guest */
>>  	unsigned int asid;	/* ASID used for this guest */
>>  	unsigned int handle;	/* SEV firmware handle */
>>  	int fd;			/* SEV device fd */
>> @@ -232,6 +233,17 @@ static inline bool sev_es_guest(struct kvm *kvm)
>>  #endif
>>  }
>>  
>> +static inline bool sev_snp_guest(struct kvm *kvm)
>> +{
>> +#ifdef CONFIG_KVM_AMD_SEV
>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +
>> +	return sev_es_guest(kvm) && sev->snp_active;
> Can't this be reduced to:
>
> 	return to_kvm_svm(kvm)->sev_info.snp_active;
>
> KVM should never set snp_active without also setting es_active.


The approach here is similar to SEV/ES. IIRC, it was done mainly to
avoid adding dead code when CONFIG_KVM_AMD_SEV is disabled. Most of the
function related to SEV/ES/SNP call the
sev_guest()/sev_es_guest()/sev_snp_guest() on the entry. Instead of
#ifdef all those functions, we can #ifdef sev_snp_guest(); compiler will
see that if() statement will always return false, so it will not include
the remaining body of the function.


>
> Side topic, I think it would also be worthwhile to add to_sev (or maybe to_kvm_sev)
> given the frequency of the "&to_kvm_svm(kvm)->sev_info" pattern.
>
>> +#else
>> +	return false;
>> +#endif
>> +}
>> +
>>  static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
>>  {
>>  	vmcb->control.clean = 0;
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP Brijesh Singh
@ 2021-07-16 19:19   ` Sean Christopherson
  2021-07-16 20:41     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 19:19 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> When running an SEV-SNP VM, the sPA used to index the RMP entry is
> obtained through the TDP translation (gva->gpa->spa). The TDP page
> level is checked against the page level programmed in the RMP entry.
> If the page level does not match, then it will cause a nested page
> fault with the RMP bit set to indicate the RMP violation.
> 
> To keep the TDP and RMP page level's in sync, the KVM fault handle
> kvm_handle_page_fault() will call get_tdp_max_page_level() to get
> the maximum allowed page level so that it can limit the TDP level.
> 
> In the case of SEV-SNP guest, the get_tdp_max_page_level() will consult
> the RMP table to compute the maximum allowed page level for a given
> GPA.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu/mmu.c          |  6 ++++--
>  arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          |  1 +
>  arch/x86/kvm/svm/svm.h          |  1 +
>  arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
>  6 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 188110ab2c02..cd2e19e1d323 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1384,6 +1384,7 @@ struct kvm_x86_ops {
>  
>  	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> +	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);

This is a poor name.  The constraint comes from the RMP, not TDP, and technically
speaking applies to all forms of paging.  It just happens to be relevant only to
TDP because NPT is required for SNP.  And KVM already incorporates the max TDP
level in kvm_configure_mmu().

Regarding the params, I'd much prefer to have this take "struct kvm *kvm" instead
of the vCPU.  It obviously doesn't change the functionality in any way, but I'd
like it to be clear to readers that the adjustment is tied to the VM, not the vCPU.

I think I'd also vote to drop @max_level and make this a pure constraint input as
opposed to an adjuster.

Another option would be to drop the kvm_x86_ops hooks entirely and call
snp_lookup_page_in_rmptable() directly from MMU code.  That would require tracking
that a VM is SNP-enabled in arch code, but I'm pretty sure info has already bled
into common KVM in one form or another.

>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0144c40d09c7..7991ffae7b31 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3781,11 +3781,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
>  				u32 error_code, bool prefault)
>  {
> +	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);

This is completely bogus, nonpaging_page_fault() is used iff TDP is disabled.

> +
>  	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
>  
>  	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
>  	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
> -				 PG_LEVEL_2M, false);
> +				 max_level, false);
>  }
>  
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> @@ -3826,7 +3828,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  {
>  	int max_level;
>  
> -	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
> +	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);

This is unnecessary.  The max mapping level is computed by factoring in all
constraints, of which there are many.  In this case, KVM is consulting the guest's
MTRR configuration to avoid creating a page that spans different memtypes (because
the guest MTRRs are effectively represented in the TDP PTE).  SNP's RMP constraints
have no relevance to the MTRR constraint, or any other constraint for that matter.

TL;DR: the RMP constraint belong in kvm_mmu_max_mapping_level() and nowhere else.
I would go so far as to argue it belong in host_pfn_mapping_level(), after the
call to lookup_address_in_mm().

>  	     max_level > PG_LEVEL_4K;
>  	     max_level--) {
>  		int page_num = KVM_PAGES_PER_HPAGE(max_level);
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 3f8824c9a5dc..fd2d00ad80b7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3206,3 +3206,23 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
>  
>  	return pfn_to_page(pfn);
>  }
> +
> +int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
> +{
> +	struct rmpentry *e;
> +	kvm_pfn_t pfn;
> +	int level;
> +
> +	if (!sev_snp_guest(vcpu->kvm))

I can't tell if this check is correct.  Per the APM:

  When SEV-SNP is enabled globally, the processor places restrictions on all memory
  accesses based on the contents of the RMP, whether the accesses are performed by
  the hypervisor, a legacy guest VM, a non-SNP guest VM or an SNP-active guest VM.
  The processor may perform one or more of the following checks depending on the
  context of the access:

  ...

  Page-Size: Checks that the following conditions are met:
    - If the nested page table indicates a 2MB or 1GB page size, the Page_Size field
      of the RMP entry of the target page is 1.
    - If the nested page table indicates a 4KB page size, the Page_Size field of the
      RMP entry of the target page is 0.

The Page-Size bullet does not have any qualifiers about the NPT checks applying
only to SNP guests.  The Hypervisor-Owned bullet implies that unassigned pages
do not need to have identical sizes, but it's not clear whether or not so called
"Hypervisor-Owned" pages override the nested page tables.

Table 15.36 is similarly vague:

  Assigned Flag indicating that the system physical page is assigned to a guest
  or to the AMD-SP.
    0: Owned by the hypervisor
    1: Owned by a guest or the AMD-SP

My assumption is that all of the "guest owned" stuff really means "SNP guest owned",
e.g. section 15.36.5 says "The hypervisor manages the SEV-SNP security attributes of
pages assigned to SNP-active guests by altering the RMP entries of those pages", but
that's not at all clear throughout most of the RMP documentation.

Regardless of the actual behavior, the APM needs serious cleanup on the aforementioned
sections.  E.g. as written, the "processor may perform one or more of the following
checks depending on the context of the access" verbiage basically gives the CPU carte
blanche to do whatever the hell it wants.

> +		return max_level;
> +
> +	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> +	if (is_error_noslot_pfn(pfn))
> +		return max_level;
> +
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);

Assuming pfn is backed by struct page is broken, at least given the existing
call sites..  It might hold true that only struct page pfns are covered by the
RMP, but assuming pfn_to_page() will return a valid pointer here is completely
wrong.  Unless I'm missing something, taking a struct page anywhere in the RMP
helpers is at best sketchy and at worst broken in and of itself.  IMO, the RMP
code should always take a raw PFN and do the necessary checks before assuming
anything about the PFN.  At a glance, the only case that needs additional checks
is the page_to_virt() logic in rmpupdate().

> +	if (unlikely(!e))
> +		return max_level;
> +
> +	return min_t(uint32_t, level, max_level);

As the APM is currently worded, this is wrong, and the whole "tdp_max_page_level"
name is wrong.  As noted above, the Page-Size bullet points states that 2mb/1gb
pages in the NPT _must_ have RMP.page_size=1, and 4kb pages in the NPT _must_
have RMP.page_size=0.  That means that the RMP adjustment is not a constraint,
it's an exact requirement.  Specifically, if the RMP is a 2mb page then KVM must
install a 2mb (or 1gb) page.  Maybe it works because KVM will PSMASH the RMP
after installing a bogus 4kb NPT and taking an RMP violation, but that's a very
convoluted and sub-optimal solution.

That other obvious bug is that this doesn't play nice with 1gb pages.  A 2mb RMP
entry should _not_ force KVM to use a 2mb page instead of a 1gb page.

> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support
  2021-07-16 18:46     ` Brijesh Singh
@ 2021-07-16 19:31       ` Sean Christopherson
  2021-07-16 21:03         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 19:31 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 1:00 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> >> index 411ed72f63af..abca2b9dee83 100644
> >> --- a/arch/x86/kvm/svm/sev.c
> >> +++ b/arch/x86/kvm/svm/sev.c
> >> @@ -52,9 +52,14 @@ module_param_named(sev, sev_enabled, bool, 0444);
> >>  /* enable/disable SEV-ES support */
> >>  static bool sev_es_enabled = true;
> >>  module_param_named(sev_es, sev_es_enabled, bool, 0444);
> >> +
> >> +/* enable/disable SEV-SNP support */
> >> +static bool sev_snp_enabled = true;
> > Is it safe to incrementally introduce SNP support?  Or should the module param
> > be hidden until all support is in place?  E.g. what will happen when KVM allows
> > userspace to create SNP guests but doesn't yet have the RMP management added?
> 
> The SNP support depends on the RMP management. At least the patch
> ordering in this series adds the RMP management first then updates
> drivers to use the RMP specific APIs.

Yep, got that.

> If RMP is not initialized due to someone not picking the commits in the
> order, then SNP guest creation will fail.

That's not what I was asking.  My question is if KVM will break/fail if someone
runs a KVM build with SNP enabled halfway through the series.  E.g. if I make a
KVM build at patch 22, "KVM: SVM: Add KVM_SNP_INIT command", what will happen if
I attempt to launch an SNP guest?  Obviously it won't fully succeed, but will KVM
fail gracefully and do all the proper cleanup?  Repeat the question for all patches
between this one and the final patch of the series.

SNP simply not working is ok, but if KVM explodes or does weird things without
"full" SNP support, then at minimum the module param should be off by default
until it's safe to enable.  E.g. for the TDP MMU, I believe the approach was to
put all the machinery in place but not actually let userspace flip on the module
param until the full implementation was ready.  Bisecting and testing the
individual commits is a bit painful because it requires modifying KVM code, but
on the plus side unrelated bisects won't stumble into a half-baked state.

> >> +module_param_named(sev_snp, sev_snp_enabled, bool, 0444);
> >>  #else
> >>  #define sev_enabled false
> >>  #define sev_es_enabled false
> >> +#define sev_snp_enabled  false
> >>  #endif /* CONFIG_KVM_AMD_SEV */
> >>  
> >>  #define AP_RESET_HOLD_NONE		0
> >> @@ -1825,6 +1830,7 @@ void __init sev_hardware_setup(void)
> >>  {
> >>  #ifdef CONFIG_KVM_AMD_SEV
> >>  	unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
> >> +	bool sev_snp_supported = false;
> >>  	bool sev_es_supported = false;
> >>  	bool sev_supported = false;
> >>  
> >> @@ -1888,9 +1894,21 @@ void __init sev_hardware_setup(void)
> >>  	pr_info("SEV-ES supported: %u ASIDs\n", sev_es_asid_count);
> >>  	sev_es_supported = true;
> >>  
> >> +	/* SEV-SNP support requested? */
> >> +	if (!sev_snp_enabled)
> >> +		goto out;
> >> +
> >> +	/* Is SEV-SNP enabled? */
> >> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> > Random question, why use cpu_feature_enabled?  Did something change in cpufeatures
> > that prevents using boot_cpu_has() here?
> 
> 
> During the boot the kernel initialize the RMP table. If RMP table
> initialization fail, then X86_FEATURE_SEV_SNP is cleared. In that case,
> the cpu_feature_enabled() should return false. The idea is,
> cpu_feature_enabled() will be set only when the RMP table is
> successfully initialized and SYSCFG.SNP is set.

Ya, got that, but again not what I was asking :-)  Why use cpu_feature_enabled()
instead of boot_cpu_has()?  As a random developer, I would fully expect that
boot_cpu_has(X86_FEATURE_SEV_SNP) is true iff SNP is fully enabled by the kernel.

> >> +		goto out;
> >> +
> >> +	pr_info("SEV-SNP supported: %u ASIDs\n", min_sev_asid - 1);
> > Use sev_es_asid_count instead of manually recomputing the same; the latter
> > obfuscates the fact that ES and SNP share the same ASID pool.
> >
> > Even better would be to report ES+SNP together, otherwise the user could easily
> > interpret ES and SNP having separate ASID pools.  And IMO the gotos for SNP are
> > overkill, e.g.
> >
> > 	sev_es_supported = true;
> > 	sev_snp_supported = sev_snp_enabled &&
> > 			    cpu_feature_enabled(X86_FEATURE_SEV_SNP);
> >
> > 	pr_info("SEV-ES %ssupported: %u ASIDs\n",
> > 		sev_snp_supported ? "and SEV-SNP " : "", sev_es_asid_count);
> >
> >> +static inline bool sev_snp_guest(struct kvm *kvm)
> >> +{
> >> +#ifdef CONFIG_KVM_AMD_SEV
> >> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> +
> >> +	return sev_es_guest(kvm) && sev->snp_active;
> > Can't this be reduced to:
> >
> > 	return to_kvm_svm(kvm)->sev_info.snp_active;
> >
> > KVM should never set snp_active without also setting es_active.
> 
> 
> The approach here is similar to SEV/ES. IIRC, it was done mainly to
> avoid adding dead code when CONFIG_KVM_AMD_SEV is disabled.

But this is already in an #ifdef, checking sev_es_guest() is pointless.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command Brijesh Singh
@ 2021-07-16 19:33   ` Sean Christopherson
  2021-07-16 21:25     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 19:33 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 3fd9a7e9d90c..989a64aa1ae5 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1678,6 +1678,9 @@ enum sev_cmd_id {
>  	/* Guest Migration Extension */
>  	KVM_SEV_SEND_CANCEL,
>  
> +	/* SNP specific commands */
> +	KVM_SEV_SNP_INIT = 256,

Is there any meaning behind '256'?  If not, why skip a big chunk?  I wouldn't be
concerned if it weren't for KVM_SEV_NR_MAX, whose existence arguably implies that
0-KVM_SEV_NR_MAX-1 are all valid SEV commands.

> +
>  	KVM_SEV_NR_MAX,
>  };

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command Brijesh Singh
  2021-07-12 18:45   ` Peter Gonda
@ 2021-07-16 19:43   ` Sean Christopherson
  2021-07-16 21:42     ` Brijesh Singh
  1 sibling, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 19:43 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> @@ -1527,6 +1530,100 @@ static int sev_receive_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	return sev_issue_cmd(kvm, SEV_CMD_RECEIVE_FINISH, &data, &argp->error);
>  }
>  
> +static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> +	struct sev_data_snp_gctx_create data = {};
> +	void *context;
> +	int rc;
> +
> +	/* Allocate memory for context page */

Eh, I'd drop this comment.  It's quite obvious that a page is being allocated
and that it's being assigned to the context.

> +	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
> +	if (!context)
> +		return NULL;
> +
> +	data.gctx_paddr = __psp_pa(context);
> +	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
> +	if (rc) {
> +		snp_free_firmware_page(context);
> +		return NULL;
> +	}
> +
> +	return context;
> +}
> +
> +static int snp_bind_asid(struct kvm *kvm, int *error)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct sev_data_snp_activate data = {};
> +	int asid = sev_get_asid(kvm);
> +	int ret, retry_count = 0;
> +
> +	/* Activate ASID on the given context */
> +	data.gctx_paddr = __psp_pa(sev->snp_context);
> +	data.asid   = asid;
> +again:
> +	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
> +
> +	/* Check if the DF_FLUSH is required, and try again */

Please provide more info on why this may be necessary.  I can see from the code
that it does a flush and retries, but I have no idea why a flush would be required
in the first place, e.g. why can't KVM guarantee that everything is in the proper
state before attempting to bind an ASID?

> +	if (ret && (*error == SEV_RET_DFFLUSH_REQUIRED) && (!retry_count)) {
> +		/* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */
> +		down_read(&sev_deactivate_lock);
> +		wbinvd_on_all_cpus();
> +		ret = snp_guest_df_flush(error);
> +		up_read(&sev_deactivate_lock);
> +
> +		if (ret)
> +			return ret;
> +
> +		/* only one retry */

Again, please explain why.  Is this arbitrary?  Is retrying more than once
guaranteed to be useless?

> +		retry_count = 1;
> +
> +		goto again;
> +	}
> +
> +	return ret;
> +}

...

>  void sev_vm_destroy(struct kvm *kvm)
>  {
>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -1847,7 +1969,15 @@ void sev_vm_destroy(struct kvm *kvm)
>  
>  	mutex_unlock(&kvm->lock);
>  
> -	sev_unbind_asid(kvm, sev->handle);
> +	if (sev_snp_guest(kvm)) {
> +		if (snp_decommission_context(kvm)) {
> +			pr_err("Failed to free SNP guest context, leaking asid!\n");

I agree with Peter that this likely warrants a WARN.  If a WARN isn't justified,
e.g. this can happen without a KVM/CPU bug, then there absolutely needs to be a
massive comment explaining why we have code that result in memory leaks.

> +			return;
> +		}
> +	} else {
> +		sev_unbind_asid(kvm, sev->handle);
> +	}
> +
>  	sev_asid_free(sev);
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index b9ea99f8579e..bc5582b44356 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -67,6 +67,7 @@ struct kvm_sev_info {
>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>  	struct misc_cg *misc_cg; /* For misc cgroup accounting */
> +	void *snp_context;      /* SNP guest context page */
>  };
>  
>  struct kvm_svm {
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 989a64aa1ae5..dbd05179d8fa 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1680,6 +1680,7 @@ enum sev_cmd_id {
>  
>  	/* SNP specific commands */
>  	KVM_SEV_SNP_INIT = 256,
> +	KVM_SEV_SNP_LAUNCH_START,
>  
>  	KVM_SEV_NR_MAX,
>  };
> @@ -1781,6 +1782,14 @@ struct kvm_snp_init {
>  	__u64 flags;
>  };
>  
> +struct kvm_sev_snp_launch_start {
> +	__u64 policy;
> +	__u64 ma_uaddr;
> +	__u8 ma_en;
> +	__u8 imi_en;
> +	__u8 gosvw[16];

Hmm, I'd prefer to pad this out to be 8-byte sized.

> +};
> +
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>  #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
>  #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command Brijesh Singh
@ 2021-07-16 20:01   ` Sean Christopherson
  2021-07-16 22:00     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 20:01 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> +{
> +	unsigned long npages, vaddr, vaddr_end, i, next_vaddr;
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct sev_data_snp_launch_update data = {};
> +	struct kvm_sev_snp_launch_update params;
> +	int *error = &argp->error;
> +	struct kvm_vcpu *vcpu;
> +	struct page **inpages;
> +	struct rmpupdate e;
> +	int ret;
> +
> +	if (!sev_snp_guest(kvm))
> +		return -ENOTTY;
> +
> +	if (!sev->snp_context)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
> +		return -EFAULT;
> +
> +	data.gctx_paddr = __psp_pa(sev->snp_context);
> +
> +	/* Lock the user memory. */
> +	inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);

params.uaddr needs to be checked for validity, e.g. proper alignment.
sev_pin_memory() does some checks, but not all checks.

> +	if (!inpages)
> +		return -ENOMEM;
> +
> +	vcpu = kvm_get_vcpu(kvm, 0);
> +	vaddr = params.uaddr;
> +	vaddr_end = vaddr + params.len;
> +
> +	for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i++) {
> +		unsigned long psize, pmask;
> +		int level = PG_LEVEL_4K;
> +		gpa_t gpa;
> +
> +		if (!hva_to_gpa(kvm, vaddr, &gpa)) {

I'm having a bit of deja vu...  This flow needs to hold kvm->srcu to do a memslot
lookup.

That said, IMO having KVM do the hva->gpa is not a great ABI.  The memslots are
completely arbitrary (from a certain point of view) and have no impact on the
validity of the memory pinning or PSP command.  E.g. a memslot update while this
code is in-flight would be all kinds of weird.

In other words, make userspace provide both the hva (because it's sadly needed
to pin memory) as well as the target gpa.  That prevents KVM from having to deal
with memslot lookups and also means that userspace can issue the command before
configuring the memslots (though I've no idea if that's actually feasible for
any userspace VMM).

> +			ret = -EINVAL;
> +			goto e_unpin;
> +		}
> +
> +		psize = page_level_size(level);
> +		pmask = page_level_mask(level);

Is there any hope of this path supporting 2mb/1gb pages in the not-too-distant
future?  If not, then I vote to do away with the indirection and just hardcode
4kg sizes in the flow.  I.e. if this works on 4kb chunks, make that obvious.

> +		gpa = gpa & pmask;
> +
> +		/* Transition the page state to pre-guest */
> +		memset(&e, 0, sizeof(e));
> +		e.assigned = 1;
> +		e.gpa = gpa;
> +		e.asid = sev_get_asid(kvm);
> +		e.immutable = true;
> +		e.pagesize = X86_TO_RMP_PG_LEVEL(level);
> +		ret = rmpupdate(inpages[i], &e);

What happens if userspace pulls a stupid and assigns the same page to multiple
SNP guests?  Does RMPUPDATE fail?  Can one RMPUPDATE overwrite another?

> +		if (ret) {
> +			ret = -EFAULT;
> +			goto e_unpin;
> +		}
> +
> +		data.address = __sme_page_pa(inpages[i]);
> +		data.page_size = e.pagesize;
> +		data.page_type = params.page_type;
> +		data.vmpl3_perms = params.vmpl3_perms;
> +		data.vmpl2_perms = params.vmpl2_perms;
> +		data.vmpl1_perms = params.vmpl1_perms;
> +		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
> +		if (ret) {
> +			snp_page_reclaim(inpages[i], e.pagesize);
> +			goto e_unpin;
> +		}
> +
> +		next_vaddr = (vaddr & pmask) + psize;
> +	}
> +
> +e_unpin:
> +	/* Content of memory is updated, mark pages dirty */
> +	memset(&e, 0, sizeof(e));
> +	for (i = 0; i < npages; i++) {
> +		set_page_dirty_lock(inpages[i]);
> +		mark_page_accessed(inpages[i]);
> +
> +		/*
> +		 * If its an error, then update RMP entry to change page ownership
> +		 * to the hypervisor.
> +		 */
> +		if (ret)
> +			rmpupdate(inpages[i], &e);

This feels wrong since it's purging _all_ RMP entries, not just those that were
successfully modified.  And maybe add a RMP "reset" helper, e.g. why is zeroing
the RMP entry the correct behavior?

> +	}
> +
> +	/* Unlock the user pages */
> +	sev_unpin_memory(kvm, inpages, npages);
> +
> +	return ret;
> +}
> +


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates Brijesh Singh
@ 2021-07-16 20:09   ` Sean Christopherson
  2021-07-16 22:16     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 20:09 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The guest pages of the SEV-SNP VM maybe added as a private page in the
> RMP entry (assigned bit is set). The guest private pages must be
> transitioned to the hypervisor state before its freed.

Isn't this patch needed much earlier in the series, i.e. when the first RMPUPDATE
usage goes in?

> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/kvm/svm/sev.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 1f0635ac9ff9..4468995dd209 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -1940,6 +1940,45 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>  static void __unregister_enc_region_locked(struct kvm *kvm,
>  					   struct enc_region *region)
>  {
> +	struct rmpupdate val = {};
> +	unsigned long i, pfn;
> +	struct rmpentry *e;
> +	int level, rc;
> +
> +	/*
> +	 * The guest memory pages are assigned in the RMP table. Unassign it
> +	 * before releasing the memory.
> +	 */
> +	if (sev_snp_guest(kvm)) {
> +		for (i = 0; i < region->npages; i++) {
> +			pfn = page_to_pfn(region->pages[i]);
> +
> +			if (need_resched())
> +				schedule();

This can simply be "cond_resched();"

> +
> +			e = snp_lookup_page_in_rmptable(region->pages[i], &level);
> +			if (unlikely(!e))
> +				continue;
> +
> +			/* If its not a guest assigned page then skip it. */
> +			if (!rmpentry_assigned(e))
> +				continue;
> +
> +			/* Is the page part of a 2MB RMP entry? */
> +			if (level == PG_LEVEL_2M) {
> +				val.pagesize = RMP_PG_SIZE_2M;
> +				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> +			} else {
> +				val.pagesize = RMP_PG_SIZE_4K;

This raises yet more questions (for me) as to the interaction between Page-Size
and Hyperivsor-Owned flags in the RMP.  It also raises questions on the correctness
of zeroing the RMP entry if KVM_SEV_SNP_LAUNCH_START (in the previous patch).

> +			}
> +
> +			/* Transition the page to hypervisor owned. */
> +			rc = rmpupdate(pfn_to_page(pfn), &val);
> +			if (rc)
> +				pr_err("Failed to release pfn 0x%lx ret=%d\n", pfn, rc);

This is not robust, e.g. KVM will unpin the memory and release it back to the
kernel with a stale RMP entry.  Shouldn't this be a WARN+leak situation?

> +		}
> +	}
> +
>  	sev_unpin_memory(kvm, region->pages, region->npages);
>  	list_del(&region->list);
>  	kfree(region);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command Brijesh Singh
@ 2021-07-16 20:18   ` Sean Christopherson
  2021-07-16 22:48     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 20:18 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> +        struct kvm_sev_snp_launch_finish {
> +                __u64 id_block_uaddr;
> +                __u64 id_auth_uaddr;
> +                __u8 id_block_en;
> +                __u8 auth_key_en;
> +                __u8 host_data[32];

Pad this one too?

> +        };
> +
> +
> +See SEV-SNP specification for further details on launch finish input parameters.

...

> +	data->gctx_paddr = __psp_pa(sev->snp_context);
> +	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);

Shouldn't KVM unwind everything it did if LAUNCH_FINISH fails?  And if that's
not possible, take steps to make the VM unusable?

> +
> +	kfree(id_auth);
> +
> +e_free_id_block:
> +	kfree(id_block);
> +
> +e_free:
> +	kfree(data);
> +
> +	return ret;
> +}
> +

...

> @@ -2346,8 +2454,25 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>  
>  	if (vcpu->arch.guest_state_protected)
>  		sev_flush_guest_memory(svm, svm->vmsa, PAGE_SIZE);
> +
> +	/*
> +	 * If its an SNP guest, then VMSA was added in the RMP entry as a guest owned page.
> +	 * Transition the page to hyperivosr state before releasing it back to the system.

"hyperivosr" typo.  And please wrap at 80 chars.

> +	 */
> +	if (sev_snp_guest(vcpu->kvm)) {
> +		struct rmpupdate e = {};
> +		int rc;
> +
> +		rc = rmpupdate(virt_to_page(svm->vmsa), &e);

So why does this not need to go through snp_page_reclaim()?

> +		if (rc) {
> +			pr_err("Failed to release SNP guest VMSA page (rc %d), leaking it\n", rc);

Seems like a WARN would be simpler.  But the more I see the rmpupdate(..., {0})
pattern, the more I believe that nuking an RMP entry needs a dedicated helper.

> +			goto skip_vmsa_free;


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 30/40] KVM: X86: Define new RMP check related #NPF error bits
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 30/40] KVM: X86: Define new RMP check related #NPF error bits Brijesh Singh
@ 2021-07-16 20:22   ` Sean Christopherson
  2021-07-17  0:34     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 20:22 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

Nit, please use "KVM: x86:" for the shortlogs.  And ubernit, the "new" part is
redundant and/or misleading, e.g. implies that more error code bits are being
added to existing SNP/RMP checks.  E.g.

  KVM: x86: Define RMP page fault error code bits for #NPT

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> When SEV-SNP is enabled globally, the hardware places restrictions on all
> memory accesses based on the RMP entry, whether the hyperviso or a VM,

Another typo.

> performs the accesses. When hardware encounters an RMP access violation
> during a guest access, it will cause a #VMEXIT(NPF).

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 31/40] KVM: X86: update page-fault trace to log the 64-bit error code
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 31/40] KVM: X86: update page-fault trace to log the 64-bit error code Brijesh Singh
@ 2021-07-16 20:25   ` Sean Christopherson
  2021-07-17  0:35     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 20:25 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The page-fault error code is a 64-bit value, but the trace prints only

It's worth clarifying that #NPT has a 64-bit error code, and so KVM also passes
around a 64-bit PFEC.  E.g. the above statement is wrong for legacy #PF.

> the lower 32-bits. Some of the SEV-SNP RMP fault error codes are
> available in the upper 32-bits.

Can you send this separately with Cc: stable@?  And I guess tweak the changelog
to replace "SEV-SNP RMP" with a reference to e.g. PFERR_GUEST_FINAL_MASK.  KVM
already has error codes that can set the upper bits.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  2021-07-16 19:19   ` Sean Christopherson
@ 2021-07-16 20:41     ` Brijesh Singh
  2021-07-20 19:38       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 20:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 2:19 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> When running an SEV-SNP VM, the sPA used to index the RMP entry is
>> obtained through the TDP translation (gva->gpa->spa). The TDP page
>> level is checked against the page level programmed in the RMP entry.
>> If the page level does not match, then it will cause a nested page
>> fault with the RMP bit set to indicate the RMP violation.
>>
>> To keep the TDP and RMP page level's in sync, the KVM fault handle
>> kvm_handle_page_fault() will call get_tdp_max_page_level() to get
>> the maximum allowed page level so that it can limit the TDP level.
>>
>> In the case of SEV-SNP guest, the get_tdp_max_page_level() will consult
>> the RMP table to compute the maximum allowed page level for a given
>> GPA.
>>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  1 +
>>  arch/x86/kvm/mmu/mmu.c          |  6 ++++--
>>  arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
>>  arch/x86/kvm/svm/svm.c          |  1 +
>>  arch/x86/kvm/svm/svm.h          |  1 +
>>  arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
>>  6 files changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 188110ab2c02..cd2e19e1d323 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1384,6 +1384,7 @@ struct kvm_x86_ops {
>>  
>>  	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
>>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>> +	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
> This is a poor name.  The constraint comes from the RMP, not TDP, and technically
> speaking applies to all forms of paging.  It just happens to be relevant only to
> TDP because NPT is required for SNP.  And KVM already incorporates the max TDP
> level in kvm_configure_mmu().

Noted.


>
> Regarding the params, I'd much prefer to have this take "struct kvm *kvm" instead
> of the vCPU.  It obviously doesn't change the functionality in any way, but I'd
> like it to be clear to readers that the adjustment is tied to the VM, not the vCPU.

Noted.


> I think I'd also vote to drop @max_level and make this a pure constraint input as
> opposed to an adjuster.


Noted.

> Another option would be to drop the kvm_x86_ops hooks entirely and call
> snp_lookup_page_in_rmptable() directly from MMU code.  That would require tracking
> that a VM is SNP-enabled in arch code, but I'm pretty sure info has already bled
> into common KVM in one form or another.

I would prefer this as it eliminates some of the other unnecessary call
sites. Unfortunately, currently there is no generic way to know if its
an SEV guest (outside the svm/*).  So far there was no need as such but
with SNP having such information would help. Should we extend the
'struct kvm' to include a new field that can be used to determine the
guest type. Something like

enum {

   GUEST_TYPE_SEV,

   GUEST_TYPE_SEV_ES,

   GUEST_TYPE_SEV_SNP,

};

struct kvm {

   ...

  u64 enc_type;

};

bool kvm_guest_enc_type(struct kvm *kvm, enum type); {

    return !!kvm->enc_type & type;

}

The mmu.c can then call kvm_guest_enc_type() to check if its SNP guest
and use the SNP lookup directly to determine the pagesize.


>
>>  };
>>  
>>  struct kvm_x86_nested_ops {
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 0144c40d09c7..7991ffae7b31 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3781,11 +3781,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
>>  				u32 error_code, bool prefault)
>>  {
>> +	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);
> This is completely bogus, nonpaging_page_fault() is used iff TDP is disabled.

Ah, I totally missed it.


>> +
>>  	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
>>  
>>  	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
>>  	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
>> -				 PG_LEVEL_2M, false);
>> +				 max_level, false);
>>  }
>>  
>>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>> @@ -3826,7 +3828,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  {
>>  	int max_level;
>>  
>> -	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
>> +	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);
> This is unnecessary.  The max mapping level is computed by factoring in all
> constraints, of which there are many.  In this case, KVM is consulting the guest's
> MTRR configuration to avoid creating a page that spans different memtypes (because
> the guest MTRRs are effectively represented in the TDP PTE).  SNP's RMP constraints
> have no relevance to the MTRR constraint, or any other constraint for that matter.
>
> TL;DR: the RMP constraint belong in kvm_mmu_max_mapping_level() and nowhere else.
> I would go so far as to argue it belong in host_pfn_mapping_level(), after the
> call to lookup_address_in_mm().


I agree with you; One of the case which I was trying to cover is what if
we do a pre-fault and while generating the prefault we can tell the
handler our max page level; The example is: "Guest issues a page state
transition request to add the page as 2mb". We execute the below steps
to fulfill the request

* create a prefault with a max_level set to 2mb.

* the fault handler may find that it cannot use the large page in the
npt, and it may default to 4k

* read the page-size from the npt;  use the npt pagesize in the rmptable
instead of the guest requested page-size.

We keep the NPT and RMP in sync after the page state change is completed
and avoid any extra RMP fault due to the size mismatch etc.


>>  	     max_level > PG_LEVEL_4K;
>>  	     max_level--) {
>>  		int page_num = KVM_PAGES_PER_HPAGE(max_level);
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 3f8824c9a5dc..fd2d00ad80b7 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -3206,3 +3206,23 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
>>  
>>  	return pfn_to_page(pfn);
>>  }
>> +
>> +int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
>> +{
>> +	struct rmpentry *e;
>> +	kvm_pfn_t pfn;
>> +	int level;
>> +
>> +	if (!sev_snp_guest(vcpu->kvm))
> I can't tell if this check is correct.  Per the APM:
>
>   When SEV-SNP is enabled globally, the processor places restrictions on all memory
>   accesses based on the contents of the RMP, whether the accesses are performed by
>   the hypervisor, a legacy guest VM, a non-SNP guest VM or an SNP-active guest VM.
>   The processor may perform one or more of the following checks depending on the
>   context of the access:
>
>   ...
>
>   Page-Size: Checks that the following conditions are met:
>     - If the nested page table indicates a 2MB or 1GB page size, the Page_Size field
>       of the RMP entry of the target page is 1.
>     - If the nested page table indicates a 4KB page size, the Page_Size field of the
>       RMP entry of the target page is 0.
>
> The Page-Size bullet does not have any qualifiers about the NPT checks applying
> only to SNP guests.  The Hypervisor-Owned bullet implies that unassigned pages
> do not need to have identical sizes, but it's not clear whether or not so called
> "Hypervisor-Owned" pages override the nested page tables.
>
> Table 15.36 is similarly vague:
>
>   Assigned Flag indicating that the system physical page is assigned to a guest
>   or to the AMD-SP.
>     0: Owned by the hypervisor
>     1: Owned by a guest or the AMD-SP
>
> My assumption is that all of the "guest owned" stuff really means "SNP guest owned",
> e.g. section 15.36.5 says "The hypervisor manages the SEV-SNP security attributes of
> pages assigned to SNP-active guests by altering the RMP entries of those pages", but
> that's not at all clear throughout most of the RMP documentation.
>
> Regardless of the actual behavior, the APM needs serious cleanup on the aforementioned
> sections.  E.g. as written, the "processor may perform one or more of the following
> checks depending on the context of the access" verbiage basically gives the CPU carte
> blanche to do whatever the hell it wants.

I'll raise your concern to the documentation folks so that they clarify
that the page-size check is applicable to the SNP active guests only.


>> +		return max_level;
>> +
>> +	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
>> +	if (is_error_noslot_pfn(pfn))
>> +		return max_level;
>> +
>> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
> Assuming pfn is backed by struct page is broken, at least given the existing
> call sites..  It might hold true that only struct page pfns are covered by the
> RMP, but assuming pfn_to_page() will return a valid pointer here is completely
> wrong.  Unless I'm missing something, taking a struct page anywhere in the RMP
> helpers is at best sketchy and at worst broken in and of itself.  IMO, the RMP
> code should always take a raw PFN and do the necessary checks before assuming
> anything about the PFN.  At a glance, the only case that needs additional checks
> is the page_to_virt() logic in rmpupdate().

I agree. Dave also hinted the similar feedback. In next version of the
patch I will stick to use the pfn and then SNP lookup with do the
required checking.


>> +	if (unlikely(!e))
>> +		return max_level;
>> +
>> +	return min_t(uint32_t, level, max_level);
> As the APM is currently worded, this is wrong, and the whole "tdp_max_page_level"
> name is wrong.  As noted above, the Page-Size bullet points states that 2mb/1gb
> pages in the NPT _must_ have RMP.page_size=1, and 4kb pages in the NPT _must_
> have RMP.page_size=0.  That means that the RMP adjustment is not a constraint,
> it's an exact requirement.  Specifically, if the RMP is a 2mb page then KVM must
> install a 2mb (or 1gb) page.  Maybe it works because KVM will PSMASH the RMP
> after installing a bogus 4kb NPT and taking an RMP violation, but that's a very
> convoluted and sub-optimal solution.

This is why I was passing the preferred max_level in the pre-fault
handle then later query the npt level; use the npt level in the RMP to
make sure they are in sync.

There is yet another reason why we can't avoid the PSMASH after doing
everything to ensure that NPT and RMP are in sync. e.g if NPT and RMP
are programmed with 2mb size but the guest tries to PVALIDATE the page
as a 4k. In that case, we will see #NPF with page size mismatch and have
to perform psmash.


>
> That other obvious bug is that this doesn't play nice with 1gb pages.  A 2mb RMP
> entry should _not_ force KVM to use a 2mb page instead of a 1gb page.
>
>> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT Brijesh Singh
@ 2021-07-16 20:45   ` Sean Christopherson
  2021-07-17  0:44     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 20:45 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> SEV-SNP guests are required to perform a GHCB GPA registration (see
> section 2.5.2 in GHCB specification). Before using a GHCB GPA for a vCPU

It's section 2.3.2 in version 2.0 of the spec.

> the first time, a guest must register the vCPU GHCB GPA. If hypervisor
> can work with the guest requested GPA then it must respond back with the
> same GPA otherwise return -1.
>
> On VMEXIT, Verify that GHCB GPA matches with the registered value. If a
> mismatch is detected then abort the guest.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/sev-common.h |  2 ++
>  arch/x86/kvm/svm/sev.c            | 25 +++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.h            |  7 +++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
> index 466baa9cd0f5..6990d5a9d73c 100644
> --- a/arch/x86/include/asm/sev-common.h
> +++ b/arch/x86/include/asm/sev-common.h
> @@ -60,8 +60,10 @@
>  	GHCB_MSR_GPA_REG_REQ)
>  
>  #define GHCB_MSR_GPA_REG_RESP		0x013
> +#define GHCB_MSR_GPA_REG_ERROR		GENMASK_ULL(51, 0)
>  #define GHCB_MSR_GPA_REG_RESP_VAL(v)	((v) >> GHCB_MSR_GPA_REG_VALUE_POS)
>  
> +
>  /* SNP Page State Change */
>  #define GHCB_MSR_PSC_REQ		0x014
>  #define SNP_PAGE_STATE_PRIVATE		1
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index fd2d00ad80b7..3af5d1ad41bf 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2922,6 +2922,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  				GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
>  		break;
>  	}
> +	case GHCB_MSR_GPA_REG_REQ: {

Shouldn't KVM also support "Get preferred GHCB GPA", at least to the point where
it responds with "No preferred GPA".  AFAICT, this series doesn't cover that,
i.e. KVM will kill a guest that requests the VMM's preferred GPA.

> +		kvm_pfn_t pfn;
> +		u64 gfn;
> +
> +		gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_REG_GFN_MASK,
> +					GHCB_MSR_GPA_REG_VALUE_POS);

This is confusing, the MASK/POS reference both GPA and GFN.

> +
> +		pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
> +		if (is_error_noslot_pfn(pfn))

Checking the mapped PFN at this time isn't wrong, but it's also not complete,
e.g. nothing prevents userspace from changing the gpa->hva mapping after the
initial registration.  Not that that's likely to happen (or not break the guest),
but my point is that random checks on the backing PFN really have no meaning in
KVM unless KVM can guarantee that the PFN is stable for the duration of its use.

And conversely, the GHCB doesn't require the GHCB to be shared until the first
use.  E.g. arguably KVM should fully check the usability of the GPA, but the
GHCB spec disallows that.  And I honestly can't see why SNP is special with
respect to the GHCB.  ES guests will explode just as badly if the GPA points at
garbage.

I guess I'm not against the check, but it feels extremely arbitrary.

> +			gfn = GHCB_MSR_GPA_REG_ERROR;
> +		else
> +			svm->ghcb_registered_gpa = gfn_to_gpa(gfn);
> +
> +		set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_REG_GFN_MASK,
> +				  GHCB_MSR_GPA_REG_VALUE_POS);
> +		set_ghcb_msr_bits(svm, GHCB_MSR_GPA_REG_RESP, GHCB_MSR_INFO_MASK,
> +				  GHCB_MSR_INFO_POS);
> +		break;
> +	}
>  	case GHCB_MSR_TERM_REQ: {
>  		u64 reason_set, reason_code;
>  
> @@ -2970,6 +2989,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  		return -EINVAL;
>  	}
>  
> +	/* SEV-SNP guest requires that the GHCB GPA must be registered */
> +	if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
> +		vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);

I saw this a few other place.  vcpu_unimpl() is not the right API.  KVM supports
the guest request, the problem is that the GHCB spec _requires_ KVM to terminate
the guest in this case.

> +		return -EINVAL;
> +	}
> +
>  	svm->ghcb = svm->ghcb_map.hva;
>  	ghcb = svm->ghcb_map.hva;
>  
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 32abcbd774d0..af4cce39b30f 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -185,6 +185,8 @@ struct vcpu_svm {
>  	bool ghcb_sa_free;
>  
>  	bool guest_state_loaded;
> +
> +	u64 ghcb_registered_gpa;
>  };
>  
>  struct svm_cpu_data {
> @@ -245,6 +247,11 @@ static inline bool sev_snp_guest(struct kvm *kvm)
>  #endif
>  }
>  
> +static inline bool ghcb_gpa_is_registered(struct vcpu_svm *svm, u64 val)
> +{
> +	return svm->ghcb_registered_gpa == val;
> +}
> +
>  static inline void vmcb_mark_all_dirty(struct vmcb *vmcb)
>  {
>  	vmcb->control.clean = 0;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT Brijesh Singh
@ 2021-07-16 21:00   ` Sean Christopherson
  2021-07-19 14:19     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 21:00 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)

I can live with e.g. GHCB_MSR_PSC_REQ, but I'd strongly prefer to spell this out,
e.g. __snp_handle_page_state_change() or whatever.  I had a hell of a time figuring
out what PSC was the first time I saw it in some random context.

> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	int rc, tdp_level;
> +	kvm_pfn_t pfn;
> +	gpa_t gpa_end;
> +
> +	gpa_end = gpa + page_level_size(level);
> +
> +	while (gpa < gpa_end) {
> +		/*
> +		 * Get the pfn and level for the gpa from the nested page table.
> +		 *
> +		 * If the TDP walk failed, then its safe to say that we don't have a valid
> +		 * mapping for the gpa in the nested page table. Create a fault to map the
> +		 * page is nested page table.
> +		 */
> +		if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) {
> +			pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level);
> +			if (is_error_noslot_pfn(pfn))
> +				goto out;
> +
> +			if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level))
> +				goto out;
> +		}
> +
> +		/* Adjust the level so that we don't go higher than the backing page level */
> +		level = min_t(size_t, level, tdp_level);
> +
> +		write_lock(&kvm->mmu_lock);

Retrieving the PFN and level outside of mmu_lock is not correct.  Because the
pages are pinned and the VMM is not malicious, it will function as intended, but
it is far from correct.

The overall approach also feels wrong, e.g. a guest won't be able to convert a
2mb chunk back to a 2mb large page if KVM mapped the GPA as a 4kb page in the
past (from a different conversion).

I'd also strongly prefer to have a common flow between SNP and TDX for converting
between shared/prviate.

I'll circle back to this next week, it'll probably take a few hours of staring
to figure out a solution, if a common one for SNP+TDX is even possible.

> +
> +		switch (op) {
> +		case SNP_PAGE_STATE_SHARED:
> +			rc = snp_make_page_shared(vcpu, gpa, pfn, level);
> +			break;
> +		case SNP_PAGE_STATE_PRIVATE:
> +			rc = snp_make_page_private(vcpu, gpa, pfn, level);
> +			break;
> +		default:
> +			rc = -EINVAL;
> +			break;
> +		}
> +
> +		write_unlock(&kvm->mmu_lock);
> +
> +		if (rc) {
> +			pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
> +					   op, gpa, pfn, level, rc);
> +			goto out;
> +		}
> +
> +		gpa = gpa + page_level_size(level);
> +	}
> +
> +out:
> +	return rc;
> +}
> +
>  static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  {
>  	struct vmcb_control_area *control = &svm->vmcb->control;
> @@ -2941,6 +3063,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  				  GHCB_MSR_INFO_POS);
>  		break;
>  	}
> +	case GHCB_MSR_PSC_REQ: {
> +		gfn_t gfn;
> +		int ret;
> +		u8 op;
> +
> +		gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS);
> +		op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS);
> +
> +		ret = __snp_handle_psc(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K);
> +
> +		/* If failed to change the state then spec requires to return all F's */

That doesn't mesh with what I could find:

  o 0x015 – SNP Page State Change Response
    ▪ GHCBData[63:32] – Error code
    ▪ GHCBData[31:12] – Reserved, must be zero
  Written by the hypervisor in response to a Page State Change request. Any non-
  zero value for the error code indicates that the page state change was not
  successful.

And if "all Fs" is indeed the error code, 'int ret' probably only works by luck
since the return value is a 64-bit value, where as ret is a 32-bit signed int.

> +		if (ret)
> +			ret = -1;

Uh, this is fubar.   You've created a shadow of 'ret', i.e. the outer ret is likely
uninitialized.

> +
> +		set_ghcb_msr_bits(svm, ret, GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS);
> +		set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS);
> +		set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
> +		break;
> +	}
>  	case GHCB_MSR_TERM_REQ: {
>  		u64 reason_set, reason_code;
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 21/40] KVM: SVM: Add initial SEV-SNP support
  2021-07-16 19:31       ` Sean Christopherson
@ 2021-07-16 21:03         ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 21:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 2:31 PM, Sean Christopherson wrote:
> That's not what I was asking.  My question is if KVM will break/fail if someone
> runs a KVM build with SNP enabled halfway through the series.  E.g. if I make a
> KVM build at patch 22, "KVM: SVM: Add KVM_SNP_INIT command", what will happen if
> I attempt to launch an SNP guest?  Obviously it won't fully succeed, but will KVM
> fail gracefully and do all the proper cleanup?  Repeat the question for all patches
> between this one and the final patch of the series.
>
> SNP simply not working is ok, but if KVM explodes or does weird things without
> "full" SNP support, then at minimum the module param should be off by default
> until it's safe to enable.  E.g. for the TDP MMU, I believe the approach was to
> put all the machinery in place but not actually let userspace flip on the module
> param until the full implementation was ready.  Bisecting and testing the
> individual commits is a bit painful because it requires modifying KVM code, but
> on the plus side unrelated bisects won't stumble into a half-baked state.

There is one to two patches where I can think of that we may break the
KVM if SNP guest is created before applying the full series. In one
patch we add LAUNCH_UPDATE but reclaim is done in next patch. I like
your idea to push the module init  later in the series.


>
> Ya, got that, but again not what I was asking :-)  Why use cpu_feature_enabled()
> instead of boot_cpu_has()?  As a random developer, I would fully expect that
> boot_cpu_has(X86_FEATURE_SEV_SNP) is true iff SNP is fully enabled by the kernel.

I have to check but I think boot_cpu_has(X64_FEATURE_SEV_SNP) will
return true even when the CONFIG_MEM_ENCRYPT is disabled.


>
>> The approach here is similar to SEV/ES. IIRC, it was done mainly to
>> avoid adding dead code when CONFIG_KVM_AMD_SEV is disabled.
> But this is already in an #ifdef, checking sev_es_guest() is pointless.


Ah Good point.



^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 34/40] KVM: SVM: Add support to handle Page State Change VMGEXIT
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 34/40] KVM: SVM: Add support to handle " Brijesh Singh
@ 2021-07-16 21:14   ` Sean Christopherson
  2021-07-19 14:24     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-16 21:14 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> +static unsigned long snp_handle_psc(struct vcpu_svm *svm, struct ghcb *ghcb)
> +{
> +	struct kvm_vcpu *vcpu = &svm->vcpu;
> +	int level, op, rc = PSC_UNDEF_ERR;
> +	struct snp_psc_desc *info;
> +	struct psc_entry *entry;
> +	gpa_t gpa;
> +
> +	if (!sev_snp_guest(vcpu->kvm))
> +		goto out;
> +
> +	if (!setup_vmgexit_scratch(svm, true, sizeof(ghcb->save.sw_scratch))) {
> +		pr_err("vmgexit: scratch area is not setup.\n");
> +		rc = PSC_INVALID_HDR;
> +		goto out;
> +	}
> +
> +	info = (struct snp_psc_desc *)svm->ghcb_sa;
> +	entry = &info->entries[info->hdr.cur_entry];

Grabbing "entry" here is unnecessary and confusing.

> +
> +	if ((info->hdr.cur_entry >= VMGEXIT_PSC_MAX_ENTRY) ||
> +	    (info->hdr.end_entry >= VMGEXIT_PSC_MAX_ENTRY) ||
> +	    (info->hdr.cur_entry > info->hdr.end_entry)) {

There's a TOCTOU bug here if the guest uses the GHCB instead of a scratch area.
If the guest uses the scratch area, then KVM makes a full copy into kernel memory.
But if the guest uses the GHCB, then KVM maps the GHCB into kernel address space
but doesn't make a full copy, i.e. the guest can modify the data while it's being
processed by KVM.

IIRC, Peter and I discussed the sketchiness of the GHCB mapping offline a few
times, but determined that there were no existing SEV-ES bugs because the guest
could only submarine its own emulation request.  But here, it could coerce KVM
into running off the end of a buffer.

I think you can get away with capturing cur_entry/end_entry locally, though
copying the GHCB would be more robust.  That would also make the code a bit
prettier, e.g.

	cur = info->hdr.cur_entry;
	end = info->hdr.end_entry;

> +		rc = PSC_INVALID_ENTRY;
> +		goto out;
> +	}
> +
> +	while (info->hdr.cur_entry <= info->hdr.end_entry) {

Make this a for loop?

	for ( ; cur_entry < end_entry; cur_entry++)

> +		entry = &info->entries[info->hdr.cur_entry];

Does this need array_index_nospec() treatment?

> +		gpa = gfn_to_gpa(entry->gfn);
> +		level = RMP_TO_X86_PG_LEVEL(entry->pagesize);
> +		op = entry->operation;
> +
> +		if (!IS_ALIGNED(gpa, page_level_size(level))) {
> +			rc = PSC_INVALID_ENTRY;
> +			goto out;
> +		}
> +
> +		rc = __snp_handle_psc(vcpu, op, gpa, level);
> +		if (rc)
> +			goto out;
> +
> +		info->hdr.cur_entry++;
> +	}
> +
> +out:

And for the copy case:

	info->hdr.cur_entry = cur;

> +	return rc ? map_to_psc_vmgexit_code(rc) : 0;
> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command
  2021-07-16 19:33   ` Sean Christopherson
@ 2021-07-16 21:25     ` Brijesh Singh
  2021-07-19 20:24       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 21:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 2:33 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 3fd9a7e9d90c..989a64aa1ae5 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1678,6 +1678,9 @@ enum sev_cmd_id {
>>  	/* Guest Migration Extension */
>>  	KVM_SEV_SEND_CANCEL,
>>  
>> +	/* SNP specific commands */
>> +	KVM_SEV_SNP_INIT = 256,
> Is there any meaning behind '256'?  If not, why skip a big chunk?  I wouldn't be
> concerned if it weren't for KVM_SEV_NR_MAX, whose existence arguably implies that
> 0-KVM_SEV_NR_MAX-1 are all valid SEV commands.

In previous patches, Peter highlighted that we should keep some gap
between the SEV/ES and SNP to leave room for legacy SEV/ES expansion. I
was not sure how many we need to reserve without knowing what will come
in the future; especially recently some of the command additional  are
not linked to the firmware. I am okay to reduce the gap or remove the
gap all together.


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 23/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
  2021-07-16 19:43   ` Sean Christopherson
@ 2021-07-16 21:42     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 21:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 2:43 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> @@ -1527,6 +1530,100 @@ static int sev_receive_finish(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  	return sev_issue_cmd(kvm, SEV_CMD_RECEIVE_FINISH, &data, &argp->error);
>>  }
>>  
>> +static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> +{
>> +	struct sev_data_snp_gctx_create data = {};
>> +	void *context;
>> +	int rc;
>> +
>> +	/* Allocate memory for context page */
> Eh, I'd drop this comment.  It's quite obvious that a page is being allocated
> and that it's being assigned to the context.
>
>> +	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
>> +	if (!context)
>> +		return NULL;
>> +
>> +	data.gctx_paddr = __psp_pa(context);
>> +	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
>> +	if (rc) {
>> +		snp_free_firmware_page(context);
>> +		return NULL;
>> +	}
>> +
>> +	return context;
>> +}
>> +
>> +static int snp_bind_asid(struct kvm *kvm, int *error)
>> +{
>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +	struct sev_data_snp_activate data = {};
>> +	int asid = sev_get_asid(kvm);
>> +	int ret, retry_count = 0;
>> +
>> +	/* Activate ASID on the given context */
>> +	data.gctx_paddr = __psp_pa(sev->snp_context);
>> +	data.asid   = asid;
>> +again:
>> +	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
>> +
>> +	/* Check if the DF_FLUSH is required, and try again */
> Please provide more info on why this may be necessary.  I can see from the code
> that it does a flush and retries, but I have no idea why a flush would be required
> in the first place, e.g. why can't KVM guarantee that everything is in the proper
> state before attempting to bind an ASID?


Ah good question, we already have function to recycle the ASIDs. The
recycle happen during the ASID allocation. While recycling it issues the
SEV/ES DF_FLUSH command. That function need to be enhanced to use the
SNP specific DF_FLUSH command when ASID's are getting reused. I wish we
had one DF_FLUSH which internally takes care of both the cases. Thinking
loud, maybe firmware team decided to add a new one because what if
someone is not using the SEV and SEV-ES or some fw does not support
legacy SEV commands. I will fix it and remove the DF_FLUSH from the
launch_start.


>
>> +	if (ret && (*error == SEV_RET_DFFLUSH_REQUIRED) && (!retry_count)) {
>> +		/* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */
>> +		down_read(&sev_deactivate_lock);
>> +		wbinvd_on_all_cpus();
>> +		ret = snp_guest_df_flush(error);
>> +		up_read(&sev_deactivate_lock);
>> +
>> +		if (ret)
>> +			return ret;
>> +
>> +		/* only one retry */
> Again, please explain why.  Is this arbitrary?  Is retrying more than once
> guaranteed to be useless?
>
>> +		retry_count = 1;
>> +
>> +		goto again;
>> +	}
>> +
>> +	return ret;
>> +}
> ...
>
>>  void sev_vm_destroy(struct kvm *kvm)
>>  {
>>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> @@ -1847,7 +1969,15 @@ void sev_vm_destroy(struct kvm *kvm)
>>  
>>  	mutex_unlock(&kvm->lock);
>>  
>> -	sev_unbind_asid(kvm, sev->handle);
>> +	if (sev_snp_guest(kvm)) {
>> +		if (snp_decommission_context(kvm)) {
>> +			pr_err("Failed to free SNP guest context, leaking asid!\n");
> I agree with Peter that this likely warrants a WARN.  If a WARN isn't justified,
> e.g. this can happen without a KVM/CPU bug, then there absolutely needs to be a
> massive comment explaining why we have code that result in memory leaks.


Ack.

>
>> +			return;
>> +		}
>> +	} else {
>> +		sev_unbind_asid(kvm, sev->handle);
>> +	}
>> +
>>  	sev_asid_free(sev);
>>  }
>>  
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index b9ea99f8579e..bc5582b44356 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -67,6 +67,7 @@ struct kvm_sev_info {
>>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
>>  	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>>  	struct misc_cg *misc_cg; /* For misc cgroup accounting */
>> +	void *snp_context;      /* SNP guest context page */
>>  };
>>  
>>  struct kvm_svm {
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 989a64aa1ae5..dbd05179d8fa 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1680,6 +1680,7 @@ enum sev_cmd_id {
>>  
>>  	/* SNP specific commands */
>>  	KVM_SEV_SNP_INIT = 256,
>> +	KVM_SEV_SNP_LAUNCH_START,
>>  
>>  	KVM_SEV_NR_MAX,
>>  };
>> @@ -1781,6 +1782,14 @@ struct kvm_snp_init {
>>  	__u64 flags;
>>  };
>>  
>> +struct kvm_sev_snp_launch_start {
>> +	__u64 policy;
>> +	__u64 ma_uaddr;
>> +	__u8 ma_en;
>> +	__u8 imi_en;
>> +	__u8 gosvw[16];
> Hmm, I'd prefer to pad this out to be 8-byte sized.

Noted.


>> +};
>> +
>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>>  #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
>>  #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-07-16 20:01   ` Sean Christopherson
@ 2021-07-16 22:00     ` Brijesh Singh
  2021-07-19 20:51       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 22:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 3:01 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> +static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>> +{
>> +	unsigned long npages, vaddr, vaddr_end, i, next_vaddr;
>> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> +	struct sev_data_snp_launch_update data = {};
>> +	struct kvm_sev_snp_launch_update params;
>> +	int *error = &argp->error;
>> +	struct kvm_vcpu *vcpu;
>> +	struct page **inpages;
>> +	struct rmpupdate e;
>> +	int ret;
>> +
>> +	if (!sev_snp_guest(kvm))
>> +		return -ENOTTY;
>> +
>> +	if (!sev->snp_context)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(&params, (void __user *)(uintptr_t)argp->data, sizeof(params)))
>> +		return -EFAULT;
>> +
>> +	data.gctx_paddr = __psp_pa(sev->snp_context);
>> +
>> +	/* Lock the user memory. */
>> +	inpages = sev_pin_memory(kvm, params.uaddr, params.len, &npages, 1);
> params.uaddr needs to be checked for validity, e.g. proper alignment.
> sev_pin_memory() does some checks, but not all checks.
>
Noted


>> +	if (!inpages)
>> +		return -ENOMEM;
>> +
>> +	vcpu = kvm_get_vcpu(kvm, 0);
>> +	vaddr = params.uaddr;
>> +	vaddr_end = vaddr + params.len;
>> +
>> +	for (i = 0; vaddr < vaddr_end; vaddr = next_vaddr, i++) {
>> +		unsigned long psize, pmask;
>> +		int level = PG_LEVEL_4K;
>> +		gpa_t gpa;
>> +
>> +		if (!hva_to_gpa(kvm, vaddr, &gpa)) {
> I'm having a bit of deja vu...  This flow needs to hold kvm->srcu to do a memslot
> lookup.
>
> That said, IMO having KVM do the hva->gpa is not a great ABI.  The memslots are
> completely arbitrary (from a certain point of view) and have no impact on the
> validity of the memory pinning or PSP command.  E.g. a memslot update while this
> code is in-flight would be all kinds of weird.
>
> In other words, make userspace provide both the hva (because it's sadly needed
> to pin memory) as well as the target gpa.  That prevents KVM from having to deal
> with memslot lookups and also means that userspace can issue the command before
> configuring the memslots (though I've no idea if that's actually feasible for
> any userspace VMM).

The operation happen during the guest creation time so I was not sure if
memslot will be updated while we are executing this command. But I guess
its possible that a VMM may run different thread which may update
memslot while another thread calls the encryption. I'll let userspace
provide both the HVA and GPA as you recommended.


>> +			ret = -EINVAL;
>> +			goto e_unpin;
>> +		}
>> +
>> +		psize = page_level_size(level);
>> +		pmask = page_level_mask(level);
> Is there any hope of this path supporting 2mb/1gb pages in the not-too-distant
> future?  If not, then I vote to do away with the indirection and just hardcode
> 4kg sizes in the flow.  I.e. if this works on 4kb chunks, make that obvious.

No plans to do 1g/2mb in this path. I will make that obvious by
hardcoding it.


>> +		gpa = gpa & pmask;
>> +
>> +		/* Transition the page state to pre-guest */
>> +		memset(&e, 0, sizeof(e));
>> +		e.assigned = 1;
>> +		e.gpa = gpa;
>> +		e.asid = sev_get_asid(kvm);
>> +		e.immutable = true;
>> +		e.pagesize = X86_TO_RMP_PG_LEVEL(level);
>> +		ret = rmpupdate(inpages[i], &e);
> What happens if userspace pulls a stupid and assigns the same page to multiple
> SNP guests?  Does RMPUPDATE fail?  Can one RMPUPDATE overwrite another?

The RMPUPDATE is available to the hv and it can call anytime with
whatever it want. The important thing is the RMPUPDATE + PVALIDATE
combination is what locks the page. In this case, PSP firmware updates
the RMP table and also validates the page.

If someone else attempts to issue another RMPUPDATE then Validated bit
will be cleared and page is no longer used as a private. Access to
unvalidated page will cause #VC.


>
>> +		if (ret) {
>> +			ret = -EFAULT;
>> +			goto e_unpin;
>> +		}
>> +
>> +		data.address = __sme_page_pa(inpages[i]);
>> +		data.page_size = e.pagesize;
>> +		data.page_type = params.page_type;
>> +		data.vmpl3_perms = params.vmpl3_perms;
>> +		data.vmpl2_perms = params.vmpl2_perms;
>> +		data.vmpl1_perms = params.vmpl1_perms;
>> +		ret = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_UPDATE, &data, error);
>> +		if (ret) {
>> +			snp_page_reclaim(inpages[i], e.pagesize);
>> +			goto e_unpin;
>> +		}
>> +
>> +		next_vaddr = (vaddr & pmask) + psize;
>> +	}
>> +
>> +e_unpin:
>> +	/* Content of memory is updated, mark pages dirty */
>> +	memset(&e, 0, sizeof(e));
>> +	for (i = 0; i < npages; i++) {
>> +		set_page_dirty_lock(inpages[i]);
>> +		mark_page_accessed(inpages[i]);
>> +
>> +		/*
>> +		 * If its an error, then update RMP entry to change page ownership
>> +		 * to the hypervisor.
>> +		 */
>> +		if (ret)
>> +			rmpupdate(inpages[i], &e);
> This feels wrong since it's purging _all_ RMP entries, not just those that were
> successfully modified.  And maybe add a RMP "reset" helper, e.g. why is zeroing
> the RMP entry the correct behavior?

By default all the pages are hypervior owned (i.e zero). If the
LAUNCH_UPDATE was successful then page should have transition from the
hypervisor owned to guest valid. By zero'ing it are reverting it back to
hypevisor owned.

I agree that I optimize it to clear the modified entries only and leave
everything else as a default.

thanks

>> +	}
>> +
>> +	/* Unlock the user pages */
>> +	sev_unpin_memory(kvm, inpages, npages);
>> +
>> +	return ret;
>> +}
>> +

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-16 20:09   ` Sean Christopherson
@ 2021-07-16 22:16     ` Brijesh Singh
  2021-07-17  0:46       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 22:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 3:09 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> The guest pages of the SEV-SNP VM maybe added as a private page in the
>> RMP entry (assigned bit is set). The guest private pages must be
>> transitioned to the hypervisor state before its freed.
> Isn't this patch needed much earlier in the series, i.e. when the first RMPUPDATE
> usage goes in?

Yes, the first RMPUPDATE usage is in the LAUNCH_UPDATE patch and this
should be squashed in that patch.


>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/kvm/svm/sev.c | 39 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 39 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 1f0635ac9ff9..4468995dd209 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -1940,6 +1940,45 @@ find_enc_region(struct kvm *kvm, struct kvm_enc_region *range)
>>  static void __unregister_enc_region_locked(struct kvm *kvm,
>>  					   struct enc_region *region)
>>  {
>> +	struct rmpupdate val = {};
>> +	unsigned long i, pfn;
>> +	struct rmpentry *e;
>> +	int level, rc;
>> +
>> +	/*
>> +	 * The guest memory pages are assigned in the RMP table. Unassign it
>> +	 * before releasing the memory.
>> +	 */
>> +	if (sev_snp_guest(kvm)) {
>> +		for (i = 0; i < region->npages; i++) {
>> +			pfn = page_to_pfn(region->pages[i]);
>> +
>> +			if (need_resched())
>> +				schedule();
> This can simply be "cond_resched();"

Yes.


>
>> +
>> +			e = snp_lookup_page_in_rmptable(region->pages[i], &level);
>> +			if (unlikely(!e))
>> +				continue;
>> +
>> +			/* If its not a guest assigned page then skip it. */
>> +			if (!rmpentry_assigned(e))
>> +				continue;
>> +
>> +			/* Is the page part of a 2MB RMP entry? */
>> +			if (level == PG_LEVEL_2M) {
>> +				val.pagesize = RMP_PG_SIZE_2M;
>> +				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
>> +			} else {
>> +				val.pagesize = RMP_PG_SIZE_4K;
> This raises yet more questions (for me) as to the interaction between Page-Size
> and Hyperivsor-Owned flags in the RMP.  It also raises questions on the correctness
> of zeroing the RMP entry if KVM_SEV_SNP_LAUNCH_START (in the previous patch).

I assume you mean the LAUNCH_UPDATE because that's when we need to
perform the RMPUPDATE. The hypervisor owned means all zero in the RMP entry.


>> +			}
>> +
>> +			/* Transition the page to hypervisor owned. */
>> +			rc = rmpupdate(pfn_to_page(pfn), &val);
>> +			if (rc)
>> +				pr_err("Failed to release pfn 0x%lx ret=%d\n", pfn, rc);
> This is not robust, e.g. KVM will unpin the memory and release it back to the
> kernel with a stale RMP entry.  Shouldn't this be a WARN+leak situation?

Yes. Maybe we should increase the page refcount to ensure that this page
is not reused after the process is terminated ?


>> +		}
>> +	}
>> +
>>  	sev_unpin_memory(kvm, region->pages, region->npages);
>>  	list_del(&region->list);
>>  	kfree(region);
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-16 20:18   ` Sean Christopherson
@ 2021-07-16 22:48     ` Brijesh Singh
  2021-07-19 16:54       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-16 22:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 3:18 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> +        struct kvm_sev_snp_launch_finish {
>> +                __u64 id_block_uaddr;
>> +                __u64 id_auth_uaddr;
>> +                __u8 id_block_en;
>> +                __u8 auth_key_en;
>> +                __u8 host_data[32];
> Pad this one too?

Noted.


>
>> +        };
>> +
>> +
>> +See SEV-SNP specification for further details on launch finish input parameters.
> ...
>
>> +	data->gctx_paddr = __psp_pa(sev->snp_context);
>> +	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
> Shouldn't KVM unwind everything it did if LAUNCH_FINISH fails?  And if that's
> not possible, take steps to make the VM unusable?

Well, I am not sure if VM need to unwind. If the command fail but VMM
decide to ignore the error then VMRUN will probably fail and user will
get the KVM shutdown event. The LAUNCH_FINISH command finalizes the VM
launch process, the firmware will probably not load the memory
encryption keys until it moves to the running state.


>> +
>> +	kfree(id_auth);
>> +
>> +e_free_id_block:
>> +	kfree(id_block);
>> +
>> +e_free:
>> +	kfree(data);
>> +
>> +	return ret;
>> +}
>> +
> ...
>
>> @@ -2346,8 +2454,25 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>>  
>>  	if (vcpu->arch.guest_state_protected)
>>  		sev_flush_guest_memory(svm, svm->vmsa, PAGE_SIZE);
>> +
>> +	/*
>> +	 * If its an SNP guest, then VMSA was added in the RMP entry as a guest owned page.
>> +	 * Transition the page to hyperivosr state before releasing it back to the system.
> "hyperivosr" typo.  And please wrap at 80 chars.

Noted.


>
>> +	 */
>> +	if (sev_snp_guest(vcpu->kvm)) {
>> +		struct rmpupdate e = {};
>> +		int rc;
>> +
>> +		rc = rmpupdate(virt_to_page(svm->vmsa), &e);
> So why does this not need to go through snp_page_reclaim()?

As I said in previous comments that by default all the memory is in the
hypervisor state. if the rmpupdate() failed that means nothing is
changed in the RMP and there is no need to reclaim. The reclaim is
required only if the pages are assigned in the RMP table.


>
>> +		if (rc) {
>> +			pr_err("Failed to release SNP guest VMSA page (rc %d), leaking it\n", rc);
> Seems like a WARN would be simpler.  But the more I see the rmpupdate(..., {0})
> pattern, the more I believe that nuking an RMP entry needs a dedicated helper.


Yes, let me try coming up with helper for it.


>
>> +			goto skip_vmsa_free;

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 30/40] KVM: X86: Define new RMP check related #NPF error bits
  2021-07-16 20:22   ` Sean Christopherson
@ 2021-07-17  0:34     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-17  0:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 3:22 PM, Sean Christopherson wrote:
> Nit, please use "KVM: x86:" for the shortlogs.  And ubernit, the "new" part is
> redundant and/or misleading, e.g. implies that more error code bits are being
> added to existing SNP/RMP checks.  E.g.
>
>   KVM: x86: Define RMP page fault error code bits for #NPT

Noted.


> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> When SEV-SNP is enabled globally, the hardware places restrictions on all
>> memory accesses based on the RMP entry, whether the hyperviso or a VM,
> Another typo.

Noted. thanks



^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 31/40] KVM: X86: update page-fault trace to log the 64-bit error code
  2021-07-16 20:25   ` Sean Christopherson
@ 2021-07-17  0:35     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-17  0:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 3:25 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> The page-fault error code is a 64-bit value, but the trace prints only
> It's worth clarifying that #NPT has a 64-bit error code, and so KVM also passes
> around a 64-bit PFEC.  E.g. the above statement is wrong for legacy #PF.
>
>> the lower 32-bits. Some of the SEV-SNP RMP fault error codes are
>> available in the upper 32-bits.
> Can you send this separately with Cc: stable@?  And I guess tweak the changelog
> to replace "SEV-SNP RMP" with a reference to e.g. PFERR_GUEST_FINAL_MASK.  KVM
> already has error codes that can set the upper bits.

Will do.


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
  2021-07-16 20:45   ` Sean Christopherson
@ 2021-07-17  0:44     ` Brijesh Singh
  2021-07-19 20:04       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-17  0:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 3:45 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> SEV-SNP guests are required to perform a GHCB GPA registration (see
>> section 2.5.2 in GHCB specification). Before using a GHCB GPA for a vCPU
> It's section 2.3.2 in version 2.0 of the spec.
Ah, I will fix it.
>
>> the first time, a guest must register the vCPU GHCB GPA. If hypervisor
>> can work with the guest requested GPA then it must respond back with the
>> same GPA otherwise return -1.
>>
>> On VMEXIT, Verify that GHCB GPA matches with the registered value. If a
>> mismatch is detected then abort the guest.
>>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/include/asm/sev-common.h |  2 ++
>>  arch/x86/kvm/svm/sev.c            | 25 +++++++++++++++++++++++++
>>  arch/x86/kvm/svm/svm.h            |  7 +++++++
>>  3 files changed, 34 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
>> index 466baa9cd0f5..6990d5a9d73c 100644
>> --- a/arch/x86/include/asm/sev-common.h
>> +++ b/arch/x86/include/asm/sev-common.h
>> @@ -60,8 +60,10 @@
>>  	GHCB_MSR_GPA_REG_REQ)
>>  
>>  #define GHCB_MSR_GPA_REG_RESP		0x013
>> +#define GHCB_MSR_GPA_REG_ERROR		GENMASK_ULL(51, 0)
>>  #define GHCB_MSR_GPA_REG_RESP_VAL(v)	((v) >> GHCB_MSR_GPA_REG_VALUE_POS)
>>  
>> +
>>  /* SNP Page State Change */
>>  #define GHCB_MSR_PSC_REQ		0x014
>>  #define SNP_PAGE_STATE_PRIVATE		1
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index fd2d00ad80b7..3af5d1ad41bf 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2922,6 +2922,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>>  				GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS);
>>  		break;
>>  	}
>> +	case GHCB_MSR_GPA_REG_REQ: {
> Shouldn't KVM also support "Get preferred GHCB GPA", at least to the point where
> it responds with "No preferred GPA".  AFAICT, this series doesn't cover that,
> i.e. KVM will kill a guest that requests the VMM's preferred GPA.

Good point, for the completeness we should add the preferred GPA MSR and
return appropriate response code to cover the cases where non Linux
guest may use this vmgexit to determine the GHCB GPA.


>
>> +		kvm_pfn_t pfn;
>> +		u64 gfn;
>> +
>> +		gfn = get_ghcb_msr_bits(svm, GHCB_MSR_GPA_REG_GFN_MASK,
>> +					GHCB_MSR_GPA_REG_VALUE_POS);
> This is confusing, the MASK/POS reference both GPA and GFN.

Let me see if I can improve it to avoid the naming confusion. Most of
the naming recommending came during the part1 review, I will check with
Boris and other to see if they are okay with new names.


>
>> +
>> +		pfn = kvm_vcpu_gfn_to_pfn(vcpu, gfn);
>> +		if (is_error_noslot_pfn(pfn))
> Checking the mapped PFN at this time isn't wrong, but it's also not complete,
> e.g. nothing prevents userspace from changing the gpa->hva mapping after the
> initial registration.  Not that that's likely to happen (or not break the guest),
> but my point is that random checks on the backing PFN really have no meaning in
> KVM unless KVM can guarantee that the PFN is stable for the duration of its use.
>
> And conversely, the GHCB doesn't require the GHCB to be shared until the first
> use.  E.g. arguably KVM should fully check the usability of the GPA, but the
> GHCB spec disallows that.  And I honestly can't see why SNP is special with
> respect to the GHCB.  ES guests will explode just as badly if the GPA points at
> garbage.
>
> I guess I'm not against the check, but it feels extremely arbitrary.
>
>> +			gfn = GHCB_MSR_GPA_REG_ERROR;
>> +		else
>> +			svm->ghcb_registered_gpa = gfn_to_gpa(gfn);
>> +
>> +		set_ghcb_msr_bits(svm, gfn, GHCB_MSR_GPA_REG_GFN_MASK,
>> +				  GHCB_MSR_GPA_REG_VALUE_POS);
>> +		set_ghcb_msr_bits(svm, GHCB_MSR_GPA_REG_RESP, GHCB_MSR_INFO_MASK,
>> +				  GHCB_MSR_INFO_POS);
>> +		break;
>> +	}
>>  	case GHCB_MSR_TERM_REQ: {
>>  		u64 reason_set, reason_code;
>>  
>> @@ -2970,6 +2989,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>>  		return -EINVAL;
>>  	}
>>  
>> +	/* SEV-SNP guest requires that the GHCB GPA must be registered */
>> +	if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
>> +		vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
> I saw this a few other place.  vcpu_unimpl() is not the right API.  KVM supports
> the guest request, the problem is that the GHCB spec _requires_ KVM to terminate
> the guest in this case.

What is the preferred method to log it so that someone debugging know
what went wrong.

thanks


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-16 22:16     ` Brijesh Singh
@ 2021-07-17  0:46       ` Sean Christopherson
  2021-07-19 12:55         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-17  0:46 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 3:09 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> +			e = snp_lookup_page_in_rmptable(region->pages[i], &level);
> >> +			if (unlikely(!e))
> >> +				continue;
> >> +
> >> +			/* If its not a guest assigned page then skip it. */
> >> +			if (!rmpentry_assigned(e))
> >> +				continue;
> >> +
> >> +			/* Is the page part of a 2MB RMP entry? */
> >> +			if (level == PG_LEVEL_2M) {
> >> +				val.pagesize = RMP_PG_SIZE_2M;
> >> +				pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> >> +			} else {
> >> +				val.pagesize = RMP_PG_SIZE_4K;
> > This raises yet more questions (for me) as to the interaction between Page-Size
> > and Hyperivsor-Owned flags in the RMP.  It also raises questions on the correctness
> > of zeroing the RMP entry if KVM_SEV_SNP_LAUNCH_START (in the previous patch).
> 
> I assume you mean the LAUNCH_UPDATE because that's when we need to
> perform the RMPUPDATE.

Doh, yes.

> The hypervisor owned means all zero in the RMP entry.

Figured out where I went wrong after reading the RMPUDPATE pseudocode.  RMPUPDATE
takes the page size as a parameter even though it unconditionally zeros the page
size flag in the RMP entry for unassigned pages.

A wrapper around rmpupdate() would definitely help, e.g. (though level might need
to be an "int" to avoid a bunch of casts).

  int rmp_make_shared(u64 pfn, enum pg_level level);

Wrappers for "private" and "firmware" would probably be helpful too.  And if you
do that, I think you can bury both "struct rmpupdate", rmpupdate(), and
X86_TO_RMP_PG_LEVEL() in arch/x86/kernel/sev.c.  snp_set_rmptable_state() might
need some refactoring to avoid three booleans, but I guess maybe that could be
an exception?  Not sure.  Anyways, was thinking something like:

  int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid);
  int rmp_make_firmware(u64 pfn);

It would consolidate a bit of code, and more importantly it would give visual
cues to the reader, e.g. it's easy to overlook "val = {0}" meaning "make shared".

Side topic, what happens if a firmware entry is configured with page_size=1?

And one architectural question: what prevents a malicious VMM from punching a 4k
shared page into a 2mb private page?  E.g.

  rmpupdate(1 << 20, [private, 2mb]);
  rmpupdate(1 << 20 + 4096, [shared, 4kb]);

I don't see any checks in the pseudocode that will detect this, and presumably the
whole point of a 2mb private RMP entry is to not have to go walk the individual
4kb entries on a private access.

  NEW_RMP = READ_MEM.o [NEW_RMP_PTR]

  IF ((NEW_RMP.PAGE_SIZE == 2MB) && (SYSTEM_PA[20:12] != 0))  <-- not taken, 4kb entry
          EAX = FAIL_INPUT
          EXIT

  IF (!NEW_RMP.ASSIGNED && (NEW_RMP.IMMUTABLE || (NEW_RMP.ASID != 0))  <-- not taken, new entry valid
          EAX = FAIL_INPUT
          EXIT

  RMP_ENTRY_PA = RMP_BASE + 0x4000 + (SYSTEM_PA / 0x1000) * 16
  IF (RMP_ENTRY_PA > RMP_END)
          EAX = FAIL_INPUT
          EXIT

  // System address must have an RMP entry
  OLD_RMP = READ_MEM_PA.o [RMP_ENTRY_PA]
  IF (OLD_RMP.IMMUTABLE) <-- passes, private entry not immutable
          EAX = FAIL_PERMISSION
          EXIT

  IF (NEW_RMP.PAGE_SIZE == 4KB)
          IF ((SYSTEM_PA[20:12] == 0) && (OLD_RMP.PAGE_SIZE == 2MB)) <- not taken, PA[12] == 1
                  EAX = FAIL_OVERLAP
                  EXIT
  ELSE
          IF (Any 4KB RMP entry with (RMP.ASSIGNED == 1) exists in 2MB region)
                  EAX = FAIL_OVERLAP
                  EXIT
          ELSE
                  FOR (I = 1; I < 512, I++) {
                          temp_RMP = 0
                          temp_RMP.ASSIGNED = NEW_RMP.ASSIGNED
                          WRITE_MEM.o [RMP_ENTRY_PA + I * 16] = temp_RMP;
                  }

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-17  0:46       ` Sean Christopherson
@ 2021-07-19 12:55         ` Brijesh Singh
  2021-07-19 17:18           ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 12:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/16/21 7:46 PM, Sean Christopherson wrote:

> takes the page size as a parameter even though it unconditionally zeros the page
> size flag in the RMP entry for unassigned pages.
>
> A wrapper around rmpupdate() would definitely help, e.g. (though level might need
> to be an "int" to avoid a bunch of casts).
>
>   int rmp_make_shared(u64 pfn, enum pg_level level);
>
> Wrappers for "private" and "firmware" would probably be helpful too.  And if you
> do that, I think you can bury both "struct rmpupdate", rmpupdate(), and
> X86_TO_RMP_PG_LEVEL() in arch/x86/kernel/sev.c.  snp_set_rmptable_state() might
> need some refactoring to avoid three booleans, but I guess maybe that could be
> an exception?  Not sure.  Anyways, was thinking something like:
>
>   int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid);
>   int rmp_make_firmware(u64 pfn);
>
> It would consolidate a bit of code, and more importantly it would give visual
> cues to the reader, e.g. it's easy to overlook "val = {0}" meaning "make shared".

Okay, I will add helper to make things easier. One case where we will
need to directly call the rmpupdate() is during the LAUNCH_UPDATE
command. In that case the page is private and its immutable bit is also
set. This is because the firmware makes change to the page, and we are
required to set the immutable bit before the call.


>
> Side topic, what happens if a firmware entry is configured with page_size=1?

Its not any different from the guest requesting a page private with the
page_size=1. Some firmware commands require the page_size=0, and others
can work with page_size=1 or page_size=0.


>
> And one architectural question: what prevents a malicious VMM from punching a 4k
> shared page into a 2mb private page?  E.g.
>
>   rmpupdate(1 << 20, [private, 2mb]);
>   rmpupdate(1 << 20 + 4096, [shared, 4kb]);
>
> I don't see any checks in the pseudocode that will detect this, and presumably the
> whole point of a 2mb private RMP entry is to not have to go walk the individual
> 4kb entries on a private access.

I believe pseudo-code is not meant to be exactly accurate and
comprehensive, but it is intended to summarize the HW behavior and
explain what can cause the different fault cases. In the real design we
may have a separate checks to catch the above issue. I just tested on
the hardware to ensure that HW correctly detects the above error
condition. However, in this case we are missing  a significant check (at
least the check that the 2M region is not already assigned). I have
raised the concern with the hardware team to look into updating the APM.
thank you so much for the bringing this up.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  2021-07-16 21:00   ` Sean Christopherson
@ 2021-07-19 14:19     ` Brijesh Singh
  2021-07-19 18:55       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 14:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/16/21 4:00 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
> 
> I can live with e.g. GHCB_MSR_PSC_REQ, but I'd strongly prefer to spell this out,
> e.g. __snp_handle_page_state_change() or whatever.  I had a hell of a time figuring
> out what PSC was the first time I saw it in some random context.

Based on the previous review feedback I renamed from 
__snp_handle_page_state_change to __snp_handle_psc(). I will see what 
others say and based on that will rename accordingly.

> 
>> +{
>> +	struct kvm *kvm = vcpu->kvm;
>> +	int rc, tdp_level;
>> +	kvm_pfn_t pfn;
>> +	gpa_t gpa_end;
>> +
>> +	gpa_end = gpa + page_level_size(level);
>> +
>> +	while (gpa < gpa_end) {
>> +		/*
>> +		 * Get the pfn and level for the gpa from the nested page table.
>> +		 *
>> +		 * If the TDP walk failed, then its safe to say that we don't have a valid
>> +		 * mapping for the gpa in the nested page table. Create a fault to map the
>> +		 * page is nested page table.
>> +		 */
>> +		if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) {
>> +			pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level);
>> +			if (is_error_noslot_pfn(pfn))
>> +				goto out;
>> +
>> +			if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level))
>> +				goto out;
>> +		}
>> +
>> +		/* Adjust the level so that we don't go higher than the backing page level */
>> +		level = min_t(size_t, level, tdp_level);
>> +
>> +		write_lock(&kvm->mmu_lock);
> 
> Retrieving the PFN and level outside of mmu_lock is not correct.  Because the
> pages are pinned and the VMM is not malicious, it will function as intended, but
> it is far from correct.
> 

Good point, I should have retrieved the pfn and level inside the lock.

> The overall approach also feels wrong, e.g. a guest won't be able to convert a
> 2mb chunk back to a 2mb large page if KVM mapped the GPA as a 4kb page in the
> past (from a different conversion).
> 

Maybe I am missing something, I am not able to follow 'guest won't be 
able to convert a 2mb chunk back to a 2mb large page'. The page-size 
used inside the guest have to relationship with the RMP/NPT page-size. 
e.g, a guest can validate the page range as a 4k and still map the page 
range as a 2mb or 1gb in its pagetable.


> I'd also strongly prefer to have a common flow between SNP and TDX for converting
> between shared/prviate.
> 
> I'll circle back to this next week, it'll probably take a few hours of staring
> to figure out a solution, if a common one for SNP+TDX is even possible.
> 

Sounds good.

>> +
>> +		switch (op) {
>> +		case SNP_PAGE_STATE_SHARED:
>> +			rc = snp_make_page_shared(vcpu, gpa, pfn, level);
>> +			break;
>> +		case SNP_PAGE_STATE_PRIVATE:
>> +			rc = snp_make_page_private(vcpu, gpa, pfn, level);
>> +			break;
>> +		default:
>> +			rc = -EINVAL;
>> +			break;
>> +		}
>> +
>> +		write_unlock(&kvm->mmu_lock);
>> +
>> +		if (rc) {
>> +			pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n",
>> +					   op, gpa, pfn, level, rc);
>> +			goto out;
>> +		}
>> +
>> +		gpa = gpa + page_level_size(level);
>> +	}
>> +
>> +out:
>> +	return rc;
>> +}
>> +
>>   static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>>   {
>>   	struct vmcb_control_area *control = &svm->vmcb->control;
>> @@ -2941,6 +3063,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>>   				  GHCB_MSR_INFO_POS);
>>   		break;
>>   	}
>> +	case GHCB_MSR_PSC_REQ: {
>> +		gfn_t gfn;
>> +		int ret;
>> +		u8 op;
>> +
>> +		gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS);
>> +		op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS);
>> +
>> +		ret = __snp_handle_psc(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K);
>> +
>> +		/* If failed to change the state then spec requires to return all F's */
> 
> That doesn't mesh with what I could find:
> 
>    o 0x015 – SNP Page State Change Response
>      ▪ GHCBData[63:32] – Error code
>      ▪ GHCBData[31:12] – Reserved, must be zero
>    Written by the hypervisor in response to a Page State Change request. Any non-
>    zero value for the error code indicates that the page state change was not
>    successful.
> 
> And if "all Fs" is indeed the error code, 'int ret' probably only works by luck
> since the return value is a 64-bit value, where as ret is a 32-bit signed int.
> 
>> +		if (ret)
>> +			ret = -1;
> 
> Uh, this is fubar.   You've created a shadow of 'ret', i.e. the outer ret is likely
> uninitialized.
> 

Ah, let me fix it in next rev.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 34/40] KVM: SVM: Add support to handle Page State Change VMGEXIT
  2021-07-16 21:14   ` Sean Christopherson
@ 2021-07-19 14:24     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 14:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/16/21 4:14 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> +static unsigned long snp_handle_psc(struct vcpu_svm *svm, struct ghcb *ghcb)
>> +{
>> +	struct kvm_vcpu *vcpu = &svm->vcpu;
>> +	int level, op, rc = PSC_UNDEF_ERR;
>> +	struct snp_psc_desc *info;
>> +	struct psc_entry *entry;
>> +	gpa_t gpa;
>> +
>> +	if (!sev_snp_guest(vcpu->kvm))
>> +		goto out;
>> +
>> +	if (!setup_vmgexit_scratch(svm, true, sizeof(ghcb->save.sw_scratch))) {
>> +		pr_err("vmgexit: scratch area is not setup.\n");
>> +		rc = PSC_INVALID_HDR;
>> +		goto out;
>> +	}
>> +
>> +	info = (struct snp_psc_desc *)svm->ghcb_sa;
>> +	entry = &info->entries[info->hdr.cur_entry];
> 
> Grabbing "entry" here is unnecessary and confusing.

Noted.

> 
>> +
>> +	if ((info->hdr.cur_entry >= VMGEXIT_PSC_MAX_ENTRY) ||
>> +	    (info->hdr.end_entry >= VMGEXIT_PSC_MAX_ENTRY) ||
>> +	    (info->hdr.cur_entry > info->hdr.end_entry)) {
> 
> There's a TOCTOU bug here if the guest uses the GHCB instead of a scratch area.
> If the guest uses the scratch area, then KVM makes a full copy into kernel memory.
> But if the guest uses the GHCB, then KVM maps the GHCB into kernel address space
> but doesn't make a full copy, i.e. the guest can modify the data while it's being
> processed by KVM.
> 
Sure, I can make a full copy of the page-state change buffer.


> IIRC, Peter and I discussed the sketchiness of the GHCB mapping offline a few
> times, but determined that there were no existing SEV-ES bugs because the guest
> could only submarine its own emulation request.  But here, it could coerce KVM
> into running off the end of a buffer.
> 
> I think you can get away with capturing cur_entry/end_entry locally, though
> copying the GHCB would be more robust.  That would also make the code a bit
> prettier, e.g.
> 
> 	cur = info->hdr.cur_entry;
> 	end = info->hdr.end_entry;
> 
>> +		rc = PSC_INVALID_ENTRY;
>> +		goto out;
>> +	}
>> +
>> +	while (info->hdr.cur_entry <= info->hdr.end_entry) {
> 
> Make this a for loop?

Sure, I can use the for loop. IIRC, in previous review feedbacks I got 
the feeling that while() was preferred in the part1 so I used the 
similar approach here.

> 
> 	for ( ; cur_entry < end_entry; cur_entry++)
> 
>> +		entry = &info->entries[info->hdr.cur_entry];
> 
> Does this need array_index_nospec() treatment?
> 

I don't think so.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-16 22:48     ` Brijesh Singh
@ 2021-07-19 16:54       ` Sean Christopherson
  2021-07-19 18:29         ` Brijesh Singh
  2021-07-21 17:53         ` Marc Orr
  0 siblings, 2 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 16:54 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 3:18 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> +	data->gctx_paddr = __psp_pa(sev->snp_context);
> >> +	ret = sev_issue_cmd(kvm, SEV_CMD_SNP_LAUNCH_FINISH, data, &argp->error);
> > Shouldn't KVM unwind everything it did if LAUNCH_FINISH fails?  And if that's
> > not possible, take steps to make the VM unusable?
> 
> Well, I am not sure if VM need to unwind. If the command fail but VMM decide
> to ignore the error then VMRUN will probably fail and user will get the KVM
> shutdown event. The LAUNCH_FINISH command finalizes the VM launch process,
> the firmware will probably not load the memory encryption keys until it moves
> to the running state.

Within reason, KVM needs to provide consistent, deterministic behavior.  Yes, more
than likely failure at this point will be fatal to the VM, but that doesn't justify
leaving the VM in a random/bogus state.  In addition to being a poor ABI, it also
makes it more difficult to reason about what is/isn't possible in KVM.

> >> +	 */
> >> +	if (sev_snp_guest(vcpu->kvm)) {
> >> +		struct rmpupdate e = {};
> >> +		int rc;
> >> +
> >> +		rc = rmpupdate(virt_to_page(svm->vmsa), &e);
> > So why does this not need to go through snp_page_reclaim()?
> 
> As I said in previous comments that by default all the memory is in the
> hypervisor state. if the rmpupdate() failed that means nothing is changed in
> the RMP and there is no need to reclaim. The reclaim is required only if the
> pages are assigned in the RMP table.

I wasn't referring to RMPUPDATE failing here (or anywhere).  This is the vCPU free
path, which I think means the svm->vmsa page was successfully updated in the RMP
during LAUNCH_UPDATE.  snp_launch_update_vmsa() goes through snp_page_reclaim()
on LAUNCH_UPDATE failure, whereas this happy path does not.  Is there some other
transition during teardown that obviastes the need for reclaim?  If so, a comment
to explain that would be very helpful.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-19 12:55         ` Brijesh Singh
@ 2021-07-19 17:18           ` Sean Christopherson
  2021-07-19 18:34             ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 17:18 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 7:46 PM, Sean Christopherson wrote:
> 
> > takes the page size as a parameter even though it unconditionally zeros the page
> > size flag in the RMP entry for unassigned pages.
> >
> > A wrapper around rmpupdate() would definitely help, e.g. (though level might need
> > to be an "int" to avoid a bunch of casts).
> >
> >   int rmp_make_shared(u64 pfn, enum pg_level level);
> >
> > Wrappers for "private" and "firmware" would probably be helpful too.  And if you
> > do that, I think you can bury both "struct rmpupdate", rmpupdate(), and
> > X86_TO_RMP_PG_LEVEL() in arch/x86/kernel/sev.c.  snp_set_rmptable_state() might
> > need some refactoring to avoid three booleans, but I guess maybe that could be
> > an exception?  Not sure.  Anyways, was thinking something like:
> >
> >   int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, int asid);
> >   int rmp_make_firmware(u64 pfn);
> >
> > It would consolidate a bit of code, and more importantly it would give visual
> > cues to the reader, e.g. it's easy to overlook "val = {0}" meaning "make shared".
> 
> Okay, I will add helper to make things easier. One case where we will
> need to directly call the rmpupdate() is during the LAUNCH_UPDATE
> command. In that case the page is private and its immutable bit is also
> set. This is because the firmware makes change to the page, and we are
> required to set the immutable bit before the call.

Or do "int rmp_make_firmware(u64 pfn, bool immutable)"?

> > And one architectural question: what prevents a malicious VMM from punching a 4k
> > shared page into a 2mb private page?  E.g.
> >
> >   rmpupdate(1 << 20, [private, 2mb]);
> >   rmpupdate(1 << 20 + 4096, [shared, 4kb]);
> >
> > I don't see any checks in the pseudocode that will detect this, and presumably the
> > whole point of a 2mb private RMP entry is to not have to go walk the individual
> > 4kb entries on a private access.
> 
> I believe pseudo-code is not meant to be exactly accurate and
> comprehensive, but it is intended to summarize the HW behavior and
> explain what can cause the different fault cases. In the real design we
> may have a separate checks to catch the above issue. I just tested on
> the hardware to ensure that HW correctly detects the above error
> condition. However, in this case we are missing a significant check (at
> least the check that the 2M region is not already assigned). I have
> raised the concern with the hardware team to look into updating the APM.

Thanks!  While you have their ear, please emphasive the importance of the pseudocode
for us software folks.  It's perfectly ok to omit or gloss over microarchitectural
details, but ISA pseudocode is often the source of truth for behavior that is
architecturally visible.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-19 16:54       ` Sean Christopherson
@ 2021-07-19 18:29         ` Brijesh Singh
  2021-07-19 19:14           ` Sean Christopherson
  2021-07-21 17:53         ` Marc Orr
  1 sibling, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 18:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 11:54 AM, Sean Christopherson wrote:
>> As I said in previous comments that by default all the memory is in the
>> hypervisor state. if the rmpupdate() failed that means nothing is changed in
>> the RMP and there is no need to reclaim. The reclaim is required only if the
>> pages are assigned in the RMP table.
> 
> I wasn't referring to RMPUPDATE failing here (or anywhere).  This is the vCPU free
> path, which I think means the svm->vmsa page was successfully updated in the RMP
> during LAUNCH_UPDATE.  snp_launch_update_vmsa() goes through snp_page_reclaim()
> on LAUNCH_UPDATE failure, whereas this happy path does not.  Is there some other
> transition during teardown that obviastes the need for reclaim?  If so, a comment
> to explain that would be very helpful.
> 

In this patch, the sev_free_vcpu() hunk takes care of reclaiming the 
vmsa pages before releasing it. I think it will make it more obvious 
after I add a helper so that we don't depend on user reading the comment 
block to see what its doing.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-19 17:18           ` Sean Christopherson
@ 2021-07-19 18:34             ` Brijesh Singh
  2021-07-19 19:03               ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 18:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 12:18 PM, Sean Christopherson wrote:
>>
>> Okay, I will add helper to make things easier. One case where we will
>> need to directly call the rmpupdate() is during the LAUNCH_UPDATE
>> command. In that case the page is private and its immutable bit is also
>> set. This is because the firmware makes change to the page, and we are
>> required to set the immutable bit before the call.
> 
> Or do "int rmp_make_firmware(u64 pfn, bool immutable)"?
> 

That's not what we need.

We need 'rmp_make_private() + immutable' all in one RMPUPDATE.  Here is 
the snippet from SNP_LAUNCH_UPDATE.


+	/* Transition the page state to pre-guest */
+	memset(&e, 0, sizeof(e));
+	e.assigned = 1;
+	e.gpa = gpa;
+	e.asid = sev_get_asid(kvm);
+	e.immutable = true;
+	e.pagesize = X86_TO_RMP_PG_LEVEL(level);
+	ret = rmpupdate(inpages[i], &e);

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  2021-07-19 14:19     ` Brijesh Singh
@ 2021-07-19 18:55       ` Sean Christopherson
  2021-07-19 19:15         ` Brijesh Singh
  2021-08-13 16:32         ` Borislav Petkov
  0 siblings, 2 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 18:55 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 4:00 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> > > +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level)
> > 
> > I can live with e.g. GHCB_MSR_PSC_REQ, but I'd strongly prefer to spell this out,
> > e.g. __snp_handle_page_state_change() or whatever.  I had a hell of a time figuring
> > out what PSC was the first time I saw it in some random context.
> 
> Based on the previous review feedback I renamed from
> __snp_handle_page_state_change to __snp_handle_psc(). I will see what others
> say and based on that will rename accordingly.

I've no objection to using PSC for enums and whatnot, and I'll happily defer to
Boris for functions in the core kernel and guest, but for KVM I'd really like to
spell out the name for the two or so main handler functions.

> > > +	while (gpa < gpa_end) {
> > > +		/*
> > > +		 * Get the pfn and level for the gpa from the nested page table.
> > > +		 *
> > > +		 * If the TDP walk failed, then its safe to say that we don't have a valid
> > > +		 * mapping for the gpa in the nested page table. Create a fault to map the
> > > +		 * page is nested page table.
> > > +		 */
> > > +		if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) {
> > > +			pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level);
> > > +			if (is_error_noslot_pfn(pfn))
> > > +				goto out;
> > > +
> > > +			if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level))
> > > +				goto out;
> > > +		}
> > > +
> > > +		/* Adjust the level so that we don't go higher than the backing page level */
> > > +		level = min_t(size_t, level, tdp_level);
> > > +
> > > +		write_lock(&kvm->mmu_lock);
> > 
> > Retrieving the PFN and level outside of mmu_lock is not correct.  Because the
> > pages are pinned and the VMM is not malicious, it will function as intended, but
> > it is far from correct.
> 
> Good point, I should have retrieved the pfn and level inside the lock.
> 
> > The overall approach also feels wrong, e.g. a guest won't be able to convert a
> > 2mb chunk back to a 2mb large page if KVM mapped the GPA as a 4kb page in the
> > past (from a different conversion).
> > 
> 
> Maybe I am missing something, I am not able to follow 'guest won't be able
> to convert a 2mb chunk back to a 2mb large page'. The page-size used inside
> the guest have to relationship with the RMP/NPT page-size. e.g, a guest can
> validate the page range as a 4k and still map the page range as a 2mb or 1gb
> in its pagetable.

The proposed code walks KVM's TDP and adjusts the RMP level to be the min of the
guest+host levels.  Once KVM has installed a 4kb TDP SPTE, that walk will find
the 4kb TDP SPTE and thus operate on the RMP at a 4kb granularity.  To allow full
restoration of 2mb PTE+SPTE+RMP, KVM needs to zap the 4kb SPTE(s) at some point
to allow rebuilding a 2mb SPTE.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-19 18:34             ` Brijesh Singh
@ 2021-07-19 19:03               ` Sean Christopherson
  2021-07-19 19:14                 ` Sean Christopherson
  2021-07-19 19:37                 ` Brijesh Singh
  0 siblings, 2 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 19:03 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Brijesh Singh wrote:
> 
> On 7/19/21 12:18 PM, Sean Christopherson wrote:
> > > 
> > > Okay, I will add helper to make things easier. One case where we will
> > > need to directly call the rmpupdate() is during the LAUNCH_UPDATE
> > > command. In that case the page is private and its immutable bit is also
> > > set. This is because the firmware makes change to the page, and we are
> > > required to set the immutable bit before the call.
> > 
> > Or do "int rmp_make_firmware(u64 pfn, bool immutable)"?
> 
> That's not what we need.
> 
> We need 'rmp_make_private() + immutable' all in one RMPUPDATE.  Here is the
> snippet from SNP_LAUNCH_UPDATE.

Ah, not firmwrare, gotcha.  But we can still use a helper, e.g. an inner
double-underscore helper, __rmp_make_private().

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-19 19:03               ` Sean Christopherson
@ 2021-07-19 19:14                 ` Sean Christopherson
  2021-07-19 19:37                 ` Brijesh Singh
  1 sibling, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 19:14 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Sean Christopherson wrote:
> On Mon, Jul 19, 2021, Brijesh Singh wrote:
> > 
> > On 7/19/21 12:18 PM, Sean Christopherson wrote:
> > > > 
> > > > Okay, I will add helper to make things easier. One case where we will
> > > > need to directly call the rmpupdate() is during the LAUNCH_UPDATE
> > > > command. In that case the page is private and its immutable bit is also
> > > > set. This is because the firmware makes change to the page, and we are
> > > > required to set the immutable bit before the call.
> > > 
> > > Or do "int rmp_make_firmware(u64 pfn, bool immutable)"?
> > 
> > That's not what we need.
> > 
> > We need 'rmp_make_private() + immutable' all in one RMPUPDATE.  Here is the
> > snippet from SNP_LAUNCH_UPDATE.
> 
> Ah, not firmwrare, gotcha.  But we can still use a helper, e.g. an inner
> double-underscore helper, __rmp_make_private().

Hmm, looking at it again, I think I also got confused by the comment for the VMSA
page:

	/* Transition the VMSA page to a firmware state. */
 	e.assigned = 1;
	e.immutable = 1;
	e.asid = sev->asid;
	e.gpa = -1;
	e.pagesize = RMP_PG_SIZE_4K;

Unlike __snp_alloc_firmware_pages() in the CCP code, the VMSA is associated with
the guest's ASID, just not a GPA.  I.e. the VMSA is more of a specialized guest
private page, as opposed to a dedicated firmware page.  I.e. a __rmp_make_private()
and/or rmp_make_private_immutable() definitely seems like a good idea.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-19 18:29         ` Brijesh Singh
@ 2021-07-19 19:14           ` Sean Christopherson
  2021-07-19 19:49             ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 19:14 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Brijesh Singh wrote:
> 
> 
> On 7/19/21 11:54 AM, Sean Christopherson wrote:
> > > As I said in previous comments that by default all the memory is in the
> > > hypervisor state. if the rmpupdate() failed that means nothing is changed in
> > > the RMP and there is no need to reclaim. The reclaim is required only if the
> > > pages are assigned in the RMP table.
> > 
> > I wasn't referring to RMPUPDATE failing here (or anywhere).  This is the vCPU free
> > path, which I think means the svm->vmsa page was successfully updated in the RMP
> > during LAUNCH_UPDATE.  snp_launch_update_vmsa() goes through snp_page_reclaim()
> > on LAUNCH_UPDATE failure, whereas this happy path does not.  Is there some other
> > transition during teardown that obviastes the need for reclaim?  If so, a comment
> > to explain that would be very helpful.
> > 
> 
> In this patch, the sev_free_vcpu() hunk takes care of reclaiming the vmsa
> pages before releasing it. I think it will make it more obvious after I add
> a helper so that we don't depend on user reading the comment block to see
> what its doing.

Where?  I feel like I'm missing something.  The only change to sev_free_vcpu() I
see is that addition of the rmpupdate(), I don't see any reclaim path.

@@ -2346,8 +2454,25 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)

        if (vcpu->arch.guest_state_protected)
                sev_flush_guest_memory(svm, svm->vmsa, PAGE_SIZE);
+
+       /*
+        * If its an SNP guest, then VMSA was added in the RMP entry as a guest owned page.
+        * Transition the page to hyperivosr state before releasing it back to the system.
+        */
+       if (sev_snp_guest(vcpu->kvm)) {
+               struct rmpupdate e = {};
+               int rc;
+
+               rc = rmpupdate(virt_to_page(svm->vmsa), &e);
+               if (rc) {
+                       pr_err("Failed to release SNP guest VMSA page (rc %d), leaking it\n", rc);
+                       goto skip_vmsa_free;
+               }
+       }
+
        __free_page(virt_to_page(svm->vmsa));

+skip_vmsa_free:
        if (svm->ghcb_sa_free)
                kfree(svm->ghcb_sa);
 }

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 33/40] KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  2021-07-19 18:55       ` Sean Christopherson
@ 2021-07-19 19:15         ` Brijesh Singh
  2021-08-13 16:32         ` Borislav Petkov
  1 sibling, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 19:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 1:55 PM, Sean Christopherson wrote:
> 
> I've no objection to using PSC for enums and whatnot, and I'll happily defer to
> Boris for functions in the core kernel and guest, but for KVM I'd really like to
> spell out the name for the two or so main handler functions.

Noted.


>>
>> Maybe I am missing something, I am not able to follow 'guest won't be able
>> to convert a 2mb chunk back to a 2mb large page'. The page-size used inside
>> the guest have to relationship with the RMP/NPT page-size. e.g, a guest can
>> validate the page range as a 4k and still map the page range as a 2mb or 1gb
>> in its pagetable.
> 
> The proposed code walks KVM's TDP and adjusts the RMP level to be the min of the
> guest+host levels.  Once KVM has installed a 4kb TDP SPTE, that walk will find
> the 4kb TDP SPTE and thus operate on the RMP at a 4kb granularity.  To allow full
> restoration of 2mb PTE+SPTE+RMP, KVM needs to zap the 4kb SPTE(s) at some point
> to allow rebuilding a 2mb SPTE.
> 

Ah I see. In that case, SNP firmware provides a command 
"SNP_PAGE_UNMASH" that can be used by the hypervisor to combines the 
multiple 4k entry into a single 2mb without affecting the validation.

-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-19 19:03               ` Sean Christopherson
  2021-07-19 19:14                 ` Sean Christopherson
@ 2021-07-19 19:37                 ` Brijesh Singh
  2021-07-20 16:40                   ` Sean Christopherson
  1 sibling, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 19:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 2:03 PM, Sean Christopherson wrote:
> On Mon, Jul 19, 2021, Brijesh Singh wrote:
>>
>> On 7/19/21 12:18 PM, Sean Christopherson wrote:
>>>>
>>>> Okay, I will add helper to make things easier. One case where we will
>>>> need to directly call the rmpupdate() is during the LAUNCH_UPDATE
>>>> command. In that case the page is private and its immutable bit is also
>>>> set. This is because the firmware makes change to the page, and we are
>>>> required to set the immutable bit before the call.
>>>
>>> Or do "int rmp_make_firmware(u64 pfn, bool immutable)"?
>>
>> That's not what we need.
>>
>> We need 'rmp_make_private() + immutable' all in one RMPUPDATE.  Here is the
>> snippet from SNP_LAUNCH_UPDATE.
> 
> Ah, not firmwrare, gotcha.  But we can still use a helper, e.g. an inner
> double-underscore helper, __rmp_make_private().
> 

In that case we are basically passing the all the fields defined in the 
'struct rmpupdate' as individual arguments. How about something like this:

* core kernel exports the rmpupdate()
* the include/linux/sev.h header file defines the helper functions

   int rmp_make_private(u64 pfn, u64 gpa, int psize, int asid)
   int rmp_make_firmware(u64 pfn, int psize);
   int rmp_make_shared(u64 pfn, int psize);

In most of the case above 3 helpers are good. If driver finds that the 
above helper does not fit its need (such as SNP_LAUNCH_UPDATE) then call 
the rmpupdate() without going through the helper.

-Brijesh



^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-19 19:14           ` Sean Christopherson
@ 2021-07-19 19:49             ` Brijesh Singh
  2021-07-19 20:13               ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 19:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 2:14 PM, Sean Christopherson wrote:

> 
> Where?  I feel like I'm missing something.  The only change to sev_free_vcpu() I
> see is that addition of the rmpupdate(), I don't see any reclaim path.

Clearing of the immutable bit (aka reclaim) is done by the firmware 
after the command was successful. See the section 8.14.2.1 of the 
SEV-SNP spec[1].

   The firmware encrypts the page with the VEK in place. The firmware
   sets the RMP.VMSA of the page to 1. The firmware sets the VMPL
   permissions for the page and transitions the page to Guest-Valid.

The Guest-Valid state means the immutable bit is cleared.  In this case,
the hypervisor just need to make the page shared and that's what the 
sev_free_vcpu() does to ensure that page is transitioned from the 
Guest-Valid to Hypervisor.

[1] https://www.amd.com/system/files/TechDocs/56860.pdf

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 32/40] KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
  2021-07-17  0:44     ` Brijesh Singh
@ 2021-07-19 20:04       ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 20:04 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 3:45 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> +	/* SEV-SNP guest requires that the GHCB GPA must be registered */
> >> +	if (sev_snp_guest(svm->vcpu.kvm) && !ghcb_gpa_is_registered(svm, ghcb_gpa)) {
> >> +		vcpu_unimpl(&svm->vcpu, "vmgexit: GHCB GPA [%#llx] is not registered.\n", ghcb_gpa);
> > I saw this a few other place.  vcpu_unimpl() is not the right API.  KVM supports
> > the guest request, the problem is that the GHCB spec _requires_ KVM to terminate
> > the guest in this case.
> 
> What is the preferred method to log it so that someone debugging know
> what went wrong.

Using the kernel log is probably a bad choice in general for this error.  Because
this and the other GHCB GPA sanity checks can be triggered from the guest, any
kernel logging needs to be ratelimited.  Ratelimiting is problematic because it
means some errors may not be logged; that's quite unlikely in this case, but it's
less than ideal.

The other issue is that KVM can't dump the RIP because guest state is encrypted,
e.g. KVM can provide the task PID, but that's it.

The best solution I can think of at the moment would be some form of
KVM_EXIT_INTERNAL_ERROR, i.e. kick out to userspace with a meaningful error code
and the bad GPA so that userspace can take action.

I believe Jim also has some thoughts on how to improve "logging" of guest errors,
but he's on vacation for a few weeks.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 26/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  2021-07-19 19:49             ` Brijesh Singh
@ 2021-07-19 20:13               ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 20:13 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Brijesh Singh wrote:
> 
> On 7/19/21 2:14 PM, Sean Christopherson wrote:
> > 
> > Where?  I feel like I'm missing something.  The only change to sev_free_vcpu() I
> > see is that addition of the rmpupdate(), I don't see any reclaim path.
> 
> Clearing of the immutable bit (aka reclaim) is done by the firmware after
> the command was successful.

Ah, which is why the failure path has to do manual reclaim of the immutable page.
Thanks!

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 22/40] KVM: SVM: Add KVM_SNP_INIT command
  2021-07-16 21:25     ` Brijesh Singh
@ 2021-07-19 20:24       ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 20:24 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 2:33 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 3fd9a7e9d90c..989a64aa1ae5 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1678,6 +1678,9 @@ enum sev_cmd_id {
> >>  	/* Guest Migration Extension */
> >>  	KVM_SEV_SEND_CANCEL,
> >>  
> >> +	/* SNP specific commands */
> >> +	KVM_SEV_SNP_INIT = 256,
> > Is there any meaning behind '256'?  If not, why skip a big chunk?  I wouldn't be
> > concerned if it weren't for KVM_SEV_NR_MAX, whose existence arguably implies that
> > 0-KVM_SEV_NR_MAX-1 are all valid SEV commands.
> 
> In previous patches, Peter highlighted that we should keep some gap
> between the SEV/ES and SNP to leave room for legacy SEV/ES expansion. I
> was not sure how many we need to reserve without knowing what will come
> in the future; especially recently some of the command additional  are
> not linked to the firmware. I am okay to reduce the gap or remove the
> gap all together.

Unless the numbers themselves have meaning, which I don't think they do, I vote
to keep the arbitrary numbers contiguous.  KVM_SEV_NR_MAX makes me nervous, and
there are already cases of related commands being discontiguous, e.g. KVM_SEND_CANCEL.

Peter or Paolo, any thoughts?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-07-16 22:00     ` Brijesh Singh
@ 2021-07-19 20:51       ` Sean Christopherson
  2021-07-19 21:34         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 20:51 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/16/21 3:01 PM, Sean Christopherson wrote:
> > I'm having a bit of deja vu...  This flow needs to hold kvm->srcu to do a memslot
> > lookup.
> >
> > That said, IMO having KVM do the hva->gpa is not a great ABI.  The memslots are
> > completely arbitrary (from a certain point of view) and have no impact on the
> > validity of the memory pinning or PSP command.  E.g. a memslot update while this
> > code is in-flight would be all kinds of weird.
> >
> > In other words, make userspace provide both the hva (because it's sadly needed
> > to pin memory) as well as the target gpa.  That prevents KVM from having to deal
> > with memslot lookups and also means that userspace can issue the command before
> > configuring the memslots (though I've no idea if that's actually feasible for
> > any userspace VMM).
> 
> The operation happen during the guest creation time so I was not sure if
> memslot will be updated while we are executing this command. But I guess
> its possible that a VMM may run different thread which may update
> memslot while another thread calls the encryption. I'll let userspace
> provide both the HVA and GPA as you recommended.

I'm not worried about a well-behaved userspace VMM, I'm worried about the code
KVM has to carry to guard against a misbehaving VMM.
 
> >> +			ret = -EINVAL;
> >> +			goto e_unpin;
> >> +		}
> >> +
> >> +		psize = page_level_size(level);
> >> +		pmask = page_level_mask(level);
> > Is there any hope of this path supporting 2mb/1gb pages in the not-too-distant
> > future?  If not, then I vote to do away with the indirection and just hardcode
> > 4kg sizes in the flow.  I.e. if this works on 4kb chunks, make that obvious.
> 
> No plans to do 1g/2mb in this path. I will make that obvious by
> hardcoding it.
> 
> 
> >> +		gpa = gpa & pmask;
> >> +
> >> +		/* Transition the page state to pre-guest */
> >> +		memset(&e, 0, sizeof(e));
> >> +		e.assigned = 1;
> >> +		e.gpa = gpa;
> >> +		e.asid = sev_get_asid(kvm);
> >> +		e.immutable = true;
> >> +		e.pagesize = X86_TO_RMP_PG_LEVEL(level);
> >> +		ret = rmpupdate(inpages[i], &e);
> > What happens if userspace pulls a stupid and assigns the same page to multiple
> > SNP guests?  Does RMPUPDATE fail?  Can one RMPUPDATE overwrite another?
> 
> The RMPUPDATE is available to the hv and it can call anytime with
> whatever it want. The important thing is the RMPUPDATE + PVALIDATE
> combination is what locks the page. In this case, PSP firmware updates
> the RMP table and also validates the page.
> 
> If someone else attempts to issue another RMPUPDATE then Validated bit
> will be cleared and page is no longer used as a private. Access to
> unvalidated page will cause #VC.

Hmm, and there's no indication on success that the previous entry was assigned?
Adding a tracepoint in rmpupdate() to allow tracking transitions is probably a
good idea, otherwise debugging RMP violations and/or unexpected #VC is going to
be painful.

And/or if the kernel/KVM behavior is to never reassign directly and reading an RMP
entry isn't prohibitively expensive, then we could add a sanity check that the RMP
is unassigned and reject rmpupdate() if the page is already assigned.  Probably
not worth it if the overhead is noticeable, but it could be nice to have if things
go sideways.

> >> +e_unpin:
> >> +  /* Content of memory is updated, mark pages dirty */
> >> +  memset(&e, 0, sizeof(e));
> >> +  for (i = 0; i < npages; i++) {
> >> +          set_page_dirty_lock(inpages[i]);
> >> +          mark_page_accessed(inpages[i]);
> >> +
> >> +          /*
> >> +           * If its an error, then update RMP entry to change page ownership
> >> +           * to the hypervisor.
> >> +           */
> >> +          if (ret)
> >> +                  rmpupdate(inpages[i], &e);
> > This feels wrong since it's purging _all_ RMP entries, not just those that were
> > successfully modified.  And maybe add a RMP "reset" helper, e.g. why is zeroing
> > the RMP entry the correct behavior?
> 
> By default all the pages are hypervior owned (i.e zero). If the
> LAUNCH_UPDATE was successful then page should have transition from the
> hypervisor owned to guest valid. By zero'ing it are reverting it back to
> hypevisor owned.
>
> I agree that I optimize it to clear the modified entries only and leave
> everything else as a default.

To be clear, it's not just an optimization.  Pages that haven't yet been touched
may be already owned by a different VM (or even this VM).  I.e. "reverting" those
pages would actually result in a form of corruption.  It's somewhat of a moot point
because assigning a single page to multiple guests is going to be fatal anyways,
but potentially making a bug worse by introducing even more noise/confusion is not
good.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-07-19 20:51       ` Sean Christopherson
@ 2021-07-19 21:34         ` Brijesh Singh
  2021-07-19 21:36           ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 21:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 3:51 PM, Sean Christopherson wrote:
> 
> Hmm, and there's no indication on success that the previous entry was assigned?
> Adding a tracepoint in rmpupdate() to allow tracking transitions is probably a
> good idea, otherwise debugging RMP violations and/or unexpected #VC is going to
> be painful.
> 

Absolutely agree. It's in my TODO list for v5. I have been using my 
private debug patches with all those trace debug and will try to pull 
some of those in v5.

> And/or if the kernel/KVM behavior is to never reassign directly and reading an RMP
> entry isn't prohibitively expensive, then we could add a sanity check that the RMP
> is unassigned and reject rmpupdate() if the page is already assigned.  Probably
> not worth it if the overhead is noticeable, but it could be nice to have if things
> go sideways.
> 

In later patches you see that during the page-state change, I do try to 
read RMP entry to detect some of these condition and warn user about 
them. The GHCB specification lets the hypervisor choose how it wants to 
handle the case in guest wanting to add the previously validated page.

> 
> To be clear, it's not just an optimization.  Pages that haven't yet been touched
> may be already owned by a different VM (or even this VM).  I.e. "reverting" those
> pages would actually result in a form of corruption.  It's somewhat of a moot point
> because assigning a single page to multiple guests is going to be fatal anyways,
> but potentially making a bug worse by introducing even more noise/confusion is not
> good.
> 

As you said, if a process is assigning the same page to multiple VMs 
then its fatal but I agree that we should do the right thing from the 
kernel ioctl handling. I will just clear the RMP entry for the pages 
which we touched.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 24/40] KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  2021-07-19 21:34         ` Brijesh Singh
@ 2021-07-19 21:36           ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-19 21:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 4:34 PM, Brijesh Singh wrote:
> 
> 
> On 7/19/21 3:51 PM, Sean Christopherson wrote:
>>
>> Hmm, and there's no indication on success that the previous entry was 
>> assigned?

I missed commenting on this.

Yes, there is no hint that page was previously assigned or validated.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event Brijesh Singh
@ 2021-07-19 22:50   ` Sean Christopherson
  2021-07-20 14:37     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 22:50 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> Version 2 of GHCB specification added the support two SNP Guest Request
> Message NAE event. The events allows for an SEV-SNP guest to make request
> to the SEV-SNP firmware through hypervisor using the SNP_GUEST_REQUEST
> API define in the SEV-SNP firmware specification.

IIUC, this snippet in the spec means KVM can't restrict what requests are made
by the guests.  If so, that makes it difficult to detect/ratelimit a misbehaving
guest, and also limits our options if there are firmware issues (hopefully there
aren't).  E.g. ratelimiting a guest after KVM has explicitly requested it to
migrate is not exactly desirable.

  The hypervisor cannot alter the messages without detection nor read the
  plaintext of the messages.

> The SNP_GUEST_REQUEST requires two unique pages, one page for the request
> and one page for the response. The response page need to be in the firmware
> state. The GHCB specification says that both the pages need to be in the
> hypervisor state but before executing the SEV-SNP command the response page
> need to be in the firmware state.
 
...

> Now that KVM supports all the VMGEXIT NAEs required for the base SEV-SNP
> feature, set the hypervisor feature to advertise it.

It would helpful if this changelog listed the Guest Requests that are required
for "base" SNP, e.g. to provide some insight as to why we care about guest
requests.

>  static int snp_bind_asid(struct kvm *kvm, int *error)
> @@ -1618,6 +1631,12 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  	if (rc)
>  		goto e_free_context;
>  
> +	/* Used for rate limiting SNP guest message request, use the default settings */
> +	ratelimit_default_init(&sev->snp_guest_msg_rs);

Is this exposed to userspace in any way?  This feels very much like a knob that
needs to be configurable per-VM.

Also, what are the estimated latencies of a guest request?  If the worst case
latency is >200ms, a default ratelimit frequency of 5hz isn't going to do a whole
lot.

> +static void snp_handle_guest_request(struct vcpu_svm *svm, struct ghcb *ghcb,
> +				     gpa_t req_gpa, gpa_t resp_gpa)
> +{
> +	struct sev_data_snp_guest_request data = {};
> +	struct kvm_vcpu *vcpu = &svm->vcpu;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_sev_info *sev;
> +	int rc, err = 0;
> +
> +	if (!sev_snp_guest(vcpu->kvm)) {
> +		rc = -ENODEV;
> +		goto e_fail;
> +	}
> +
> +	sev = &to_kvm_svm(kvm)->sev_info;
> +
> +	if (!__ratelimit(&sev->snp_guest_msg_rs)) {
> +		pr_info_ratelimited("svm: too many guest message requests\n");
> +		rc = -EAGAIN;

What guarantee do we have that the guest actually understands -EAGAIN?  Ditto
for -EINVAL returned by snp_build_guest_buf().  AFAICT, our options are to return
one of the error codes defined in "Table 95. Status Codes for SNP_GUEST_REQUEST"
of the firmware ABI, kill the guest, or ratelimit the guest without returning
control to the guest.

> +		goto e_fail;
> +	}
> +
> +	rc = snp_build_guest_buf(svm, &data, req_gpa, resp_gpa);
> +	if (rc)
> +		goto e_fail;
> +
> +	sev = &to_kvm_svm(kvm)->sev_info;
> +
> +	mutex_lock(&kvm->lock);

Question on the VMPCK sequences.  The firmware ABI says:

   Each guest has four VMPCKs ... Each message contains a sequence number per
   VMPCK. The sequence number is incremented with each message sent. Messages
   sent by the guest to the firmware and by the firmware to the guest must be
   delivered in order. If not, the firmware will reject subsequent messages ...

Does that mean there are four independent sequences, i.e. four streams the guest
can use "concurrently", or does it mean the overall freshess/integrity check is
composed from four VMPCK sequences, all of which must be correct for the message
to be valid?

If it's the latter, then a traditional mutex isn't really necessary because the
guest must implement its own serialization, e.g. it's own mutex or whatever, to
ensure there is at most one request in-flight at any given time.  And on the KVM
side it means KVM can simpy reject requests if there is already an in-flight
request.  It might also give us more/better options for ratelimiting?

> +	rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err);
> +	if (rc) {
> +		mutex_unlock(&kvm->lock);

I suspect you reused this pattern from other, more complex code, but here it's
overkill.  E.g.

	if (!rc)
		rc = kvm_write_guest(kvm, resp_gpa, sev->snp_resp_page, PAGE_SIZE);
	else if (err)
		rc = err;

	mutex_unlock(&kvm->lock);

	ghcb_set_sw_exit_info_2(ghcb, rc);

> +		/* If we have a firmware error code then use it. */
> +		if (err)
> +			rc = err;
> +
> +		goto e_fail;
> +	}
> +
> +	/* Copy the response after the firmware returns success. */
> +	rc = kvm_write_guest(kvm, resp_gpa, sev->snp_resp_page, PAGE_SIZE);
> +
> +	mutex_unlock(&kvm->lock);
> +
> +e_fail:
> +	ghcb_set_sw_exit_info_2(ghcb, rc);
> +}
> +
> +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, struct ghcb *ghcb,
> +					 gpa_t req_gpa, gpa_t resp_gpa)
> +{
> +	struct sev_data_snp_guest_request req = {};
> +	struct kvm_vcpu *vcpu = &svm->vcpu;
> +	struct kvm *kvm = vcpu->kvm;
> +	unsigned long data_npages;
> +	struct kvm_sev_info *sev;
> +	unsigned long err;
> +	u64 data_gpa;
> +	int rc;
> +
> +	if (!sev_snp_guest(vcpu->kvm)) {
> +		rc = -ENODEV;
> +		goto e_fail;
> +	}
> +
> +	sev = &to_kvm_svm(kvm)->sev_info;
> +
> +	if (!__ratelimit(&sev->snp_guest_msg_rs)) {
> +		pr_info_ratelimited("svm: too many guest message requests\n");
> +		rc = -EAGAIN;
> +		goto e_fail;
> +	}
> +
> +	if (!sev->snp_certs_data) {
> +		pr_err("svm: certs data memory is not allocated\n");
> +		rc = -EFAULT;

Another instance where the kernel's error numbers will not suffice.

> +		goto e_fail;
> +	}
> +
> +	data_gpa = ghcb_get_rax(ghcb);
> +	data_npages = ghcb_get_rbx(ghcb);

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 35/40] KVM: Add arch hooks to track the host write to guest memory
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 35/40] KVM: Add arch hooks to track the host write to guest memory Brijesh Singh
@ 2021-07-19 23:30   ` Sean Christopherson
  2021-07-20 15:15     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-19 23:30 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The kvm_write_guest{_page} and kvm_vcpu_write_guest{_page} are used by
> the hypevisor to write to the guest memory. The kvm_vcpu_map() and
> kvm_map_gfn() are used by the hypervisor to map the guest memory and
> and access it later.
> 
> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> either be a private or shared. A write from the hypervisor goes through
> the RMP checks. If hardware sees that hypervisor is attempting to write
> to a guest private page, then it triggers an RMP violation (i.e, #PF with
> RMP bit set).
> 
> Enhance the KVM guest write helpers to invoke an architecture specific
> hooks (kvm_arch_write_gfn_{begin,end}) to track the write access from the
> hypervisor.
> 
> When SEV-SNP is enabled, the guest uses the PAGE_STATE vmgexit to ask the
> hypervisor to change the page state from shared to private or vice versa.
> While changing the page state to private, use the
> kvm_host_write_track_is_active() to check whether the page is being
> tracked for the host write access (i.e either mapped or kvm_write_guest
> is in progress). If its tracked, then do not change the page state.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---

...

> @@ -3468,3 +3489,33 @@ int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
>  
>  	return min_t(uint32_t, level, max_level);
>  }
> +
> +void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	struct rmpentry *e;
> +	int level, rc;
> +	kvm_pfn_t pfn;
> +
> +	if (!sev_snp_guest(kvm))
> +		return;
> +
> +	pfn = gfn_to_pfn(kvm, gfn);
> +	if (is_error_noslot_pfn(pfn))
> +		return;
> +
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
> +	if (unlikely(!e))
> +		return;
> +
> +	/*
> +	 * A hypervisor should never write to the guest private page. A write to the
> +	 * guest private will cause an RMP violation. If the guest page is private,
> +	 * then make it shared.

NAK on converting RMP entries in response to guest accesses.  Corrupting guest
data (due to dropping the "validated" flag) on a rogue/incorrect guest emulation
request or misconfigured PV feature is double ungood.  The potential kernel panic
below isn't much better.

And I also don't think we need this heavyweight flow for user access, e.g.
__copy_to_user(), just eat the RMP violation #PF like all other #PFs and exit
to userspace with -EFAULT.

kvm_vcpu_map() and friends might need the manual lookup, at least initially, but
in an ideal world that would be naturally handled by gup(), e.g. by unmapping
guest private memory or whatever approach TDX ends up employing to avoid #MCs.

> +	 */
> +	if (rmpentry_assigned(e)) {
> +		pr_err("SEV-SNP: write to guest private gfn %llx\n", gfn);
> +		rc = snp_make_page_shared(kvm_get_vcpu(kvm, 0),
> +				gfn << PAGE_SHIFT, pfn, PG_LEVEL_4K);
> +		BUG_ON(rc != 0);
> +	}
> +}

...

> +void kvm_arch_write_gfn_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	update_gfn_track(slot, gfn, KVM_PAGE_TRACK_WRITE, 1);

Tracking only writes isn't correct, as KVM reads to guest private memory will
return garbage.  Pulling the rug out from under KVM reads won't fail as
spectacularly as writes (at least not right away), but they'll still fail.  I'm
actually ok reading garbage if the guest screws up, but KVM needs consistent
semantics.

Good news is that per-gfn tracking is probably overkill anyways.  As mentioned
above, user access don't need extra magic, they either fail or they don't.

For kvm_vcpu_map(), one thought would be to add a "short-term" map variant that
is not allowed to be retained across VM-Entry, and then use e.g. SRCU to block
PSC requests until there are no consumers.

> +	if (kvm_x86_ops.write_page_begin)
> +		kvm_x86_ops.write_page_begin(kvm, slot, gfn);
> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault Brijesh Singh
@ 2021-07-20  0:10   ` Sean Christopherson
  2021-07-20 17:55     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20  0:10 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to
> resolve the RMP violation encountered during the NPT table walk.

Heh, please elaborate on exactly what that recommendation is.  A recommendation
isn't exactly architectural, i.e. is subject to change :-)

And, do we have to follow the APM's recommendation?  Specifically, can KVM treat
#NPF RMP violations as guest errors, or is that not allowed by the GHCB spec?
I.e. can we mandate accesses be preceded by page state change requests?  It would
simplify KVM (albeit not much of a simplificiation) and would also make debugging
easier since transitions would require an explicit guest request and guest bugs
would result in errors instead of random corruption/weirdness.

> index 46323af09995..117e2e08d7ed 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1399,6 +1399,9 @@ struct kvm_x86_ops {
>  
>  	void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
>  	void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
> +
> +	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
> +			int level, u64 error_code);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e60f54455cdc..b6a676ba1862 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5096,6 +5096,18 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  }
>  
> +static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
> +{
> +	kvm_pfn_t pfn;
> +	int level;
> +
> +	if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level)))
> +		return RET_PF_RETRY;
> +
> +	kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code);
> +	return RET_PF_RETRY;
> +}
> +
>  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>  		       void *insn, int insn_len)
>  {
> @@ -5112,6 +5124,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>  			goto emulate;
>  	}
>  
> +	if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) {
> +		r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code);

Adding a kvm_x86_ops hook is silly, there's literally one path, npf_interception()
that can encounter RMP violations.  Just invoke snp_handle_rmp_page_fault() from
there.  That works even if kvm_mmu_get_tdp_walk() stays around since it was
exported earlier.

> +		if (r == RET_PF_RETRY)
> +			return 1;
> +		else
> +			return r;
> +	}
> +
>  	if (r == RET_PF_INVALID) {
>  		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
>  					  lower_32_bits(error_code), false);
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 839cf321c6dd..53a60edc810e 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3519,3 +3519,60 @@ void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn
>  		BUG_ON(rc != 0);
>  	}
>  }
> +
> +int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
> +			      int level, u64 error_code)
> +{
> +	struct rmpentry *e;
> +	int rlevel, rc = 0;
> +	bool private;
> +	gfn_t gfn;
> +
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rlevel);
> +	if (!e)
> +		return 1;
> +
> +	private = !!(error_code & PFERR_GUEST_ENC_MASK);
> +
> +	/*
> +	 * See APM section 15.36.11 on how to handle the RMP fault for the large pages.

Please do not punt the reader to the APM for things like this.  It's ok when there
are gory details about CPU behavior that aren't worth commenting, but under no
circumstance should KVM's software implementation be "documented" in a CPU spec.

> +	 *
> +	 *  npt	     rmp    access      action
> +	 *  --------------------------------------------------
> +	 *  4k       2M     C=1       psmash
> +	 *  x        x      C=1       if page is not private then add a new RMP entry
> +	 *  x        x      C=0       if page is private then make it shared
> +	 *  2M       4k     C=x       zap
> +	 */
> +	if ((error_code & PFERR_GUEST_SIZEM_MASK) ||
> +	    ((level == PG_LEVEL_4K) && (rlevel == PG_LEVEL_2M) && private)) {
> +		rc = snp_rmptable_psmash(vcpu, pfn);
> +		goto zap_gfn;
> +	}
> +
> +	/*
> +	 * If it's a private access, and the page is not assigned in the RMP table, create a
> +	 * new private RMP entry.
> +	 */
> +	if (!rmpentry_assigned(e) && private) {
> +		rc = snp_make_page_private(vcpu, gpa, pfn, PG_LEVEL_4K);
> +		goto zap_gfn;
> +	}
> +
> +	/*
> +	 * If it's a shared access, then make the page shared in the RMP table.
> +	 */
> +	if (rmpentry_assigned(e) && !private)
> +		rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K);

Hrm, this really feels like it needs to be protected by mmu_lock.  Functionally,
it might all work out in the end after enough RMP violations, but it's extremely
difficult to reason about and probably even more difficult if multiple vCPUs end
up fighting over a gfn.

My gut reaction is that this is also backwards, i.e. KVM should update the RMP
to match its TDP SPTEs, not the other way around.

The one big complication is that the TDP MMU only takes mmu_lock for read.  A few
options come to mind but none of them are all that pretty.  I'll wait to hear back
on whether or not we can make PSC request mandatory before thinking too hard on
this one.

> +zap_gfn:
> +	/*
> +	 * Now that we have updated the RMP pagesize, zap the existing rmaps for
> +	 * large entry ranges so that nested page table gets rebuilt with the updated RMP
> +	 * pagesize.
> +	 */
> +	gfn = gpa_to_gfn(gpa) & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
> +	kvm_zap_gfn_range(vcpu->kvm, gfn, gfn + 512);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  2021-07-19 22:50   ` Sean Christopherson
@ 2021-07-20 14:37     ` Brijesh Singh
  2021-07-20 16:28       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 14:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 5:50 PM, Sean Christopherson wrote:
...
> 
> IIUC, this snippet in the spec means KVM can't restrict what requests are made
> by the guests.  If so, that makes it difficult to detect/ratelimit a misbehaving
> guest, and also limits our options if there are firmware issues (hopefully there
> aren't).  E.g. ratelimiting a guest after KVM has explicitly requested it to
> migrate is not exactly desirable.
> 

The guest message page contains a message header followed by the 
encrypted payload. So, technically KVM can peek into the message header 
format to determine the message request type. If needed, we can 
ratelimit based on the message type.

In the current series we don't support migration etc so I decided to 
ratelimit unconditionally.

...
> 
>> Now that KVM supports all the VMGEXIT NAEs required for the base SEV-SNP
>> feature, set the hypervisor feature to advertise it.
> 
> It would helpful if this changelog listed the Guest Requests that are required
> for "base" SNP, e.g. to provide some insight as to why we care about guest
> requests.
> 

Sure, I'll add more.


>>   static int snp_bind_asid(struct kvm *kvm, int *error)
>> @@ -1618,6 +1631,12 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>   	if (rc)
>>   		goto e_free_context;
>>   
>> +	/* Used for rate limiting SNP guest message request, use the default settings */
>> +	ratelimit_default_init(&sev->snp_guest_msg_rs);
> 
> Is this exposed to userspace in any way?  This feels very much like a knob that
> needs to be configurable per-VM.
> 

It's not exposed to the userspace and I am not sure if userspace care 
about this knob.


> Also, what are the estimated latencies of a guest request?  If the worst case
> latency is >200ms, a default ratelimit frequency of 5hz isn't going to do a whole
> lot.
> 

The latency will depend on what else is going in the system at the time 
the request comes to the hypervisor. Access to the PSP is serialized so 
other parallel PSP command execution will contribute to the latency.

...
>> +
>> +	if (!__ratelimit(&sev->snp_guest_msg_rs)) {
>> +		pr_info_ratelimited("svm: too many guest message requests\n");
>> +		rc = -EAGAIN;
> 
> What guarantee do we have that the guest actually understands -EAGAIN?  Ditto
> for -EINVAL returned by snp_build_guest_buf().  AFAICT, our options are to return
> one of the error codes defined in "Table 95. Status Codes for SNP_GUEST_REQUEST"
> of the firmware ABI, kill the guest, or ratelimit the guest without returning
> control to the guest.
> 

Yes, let me look into passing one of the status code defined in the spec.

>> +		goto e_fail;
>> +	}
>> +
>> +	rc = snp_build_guest_buf(svm, &data, req_gpa, resp_gpa);
>> +	if (rc)
>> +		goto e_fail;
>> +
>> +	sev = &to_kvm_svm(kvm)->sev_info;
>> +
>> +	mutex_lock(&kvm->lock);
> 
> Question on the VMPCK sequences.  The firmware ABI says:
> 
>     Each guest has four VMPCKs ... Each message contains a sequence number per
>     VMPCK. The sequence number is incremented with each message sent. Messages
>     sent by the guest to the firmware and by the firmware to the guest must be
>     delivered in order. If not, the firmware will reject subsequent messages ...
> 
> Does that mean there are four independent sequences, i.e. four streams the guest
> can use "concurrently", or does it mean the overall freshess/integrity check is
> composed from four VMPCK sequences, all of which must be correct for the message
> to be valid?
> 

There are four independent sequence counter and in theory guest can use 
them concurrently. But the access to the PSP must be serialized. 
Currently, the guest driver uses the VMPCK0 key to communicate with the PSP.


> If it's the latter, then a traditional mutex isn't really necessary because the
> guest must implement its own serialization, e.g. it's own mutex or whatever, to
> ensure there is at most one request in-flight at any given time.  

The guest driver uses the its own serialization to ensure that there is 
*exactly* one request in-flight.

The mutex used here is to protect the KVM's internal firmware response 
buffer.


And on the KVM
> side it means KVM can simpy reject requests if there is already an in-flight
> request.  It might also give us more/better options for ratelimiting?
> 

I don't think we should be running into this scenario unless there is a 
bug in the guest kernel. The guest kernel support and CCP driver both 
ensure that request to the PSP is serialized.

In normal operation we may see 1 to 2 quest requests for the entire 
guest lifetime. I am thinking first request maybe for the attestation 
report and second maybe to derive keys etc. It may change slightly when 
we add the migration command; I have not looked into a great detail yet.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 35/40] KVM: Add arch hooks to track the host write to guest memory
  2021-07-19 23:30   ` Sean Christopherson
@ 2021-07-20 15:15     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 15:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 6:30 PM, Sean Christopherson wrote:
...>
> NAK on converting RMP entries in response to guest accesses.  Corrupting guest
> data (due to dropping the "validated" flag) on a rogue/incorrect guest emulation
> request or misconfigured PV feature is double ungood.  The potential kernel panic
> below isn't much better.
> 

I also debated myself whether its okay to transition the page state to 
shared to complete the write operation. I am good with removing the 
converting RMP entries from the patch, and that will also remove the 
kernel panic code.


> And I also don't think we need this heavyweight flow for user access, e.g.
> __copy_to_user(), just eat the RMP violation #PF like all other #PFs and exit
> to userspace with -EFAULT.
>

Yes, we could improve the __copy_to_user() to eat the RMP violation. I 
was tempted to go on that path but struggled to find a strong reason for 
it and was not sure if that accepted. I can add that support in next rev.



> kvm_vcpu_map() and friends might need the manual lookup, at least initially, 

Yes, the enhancement to the __copy_to_user() does not solve this problem 
and for it we need to do the manually lookup.


but
> in an ideal world that would be naturally handled by gup(), e.g. by unmapping
> guest private memory or whatever approach TDX ends up employing to avoid #MCs.

> 
>> +	 */
>> +	if (rmpentry_assigned(e)) {
>> +		pr_err("SEV-SNP: write to guest private gfn %llx\n", gfn);
>> +		rc = snp_make_page_shared(kvm_get_vcpu(kvm, 0),
>> +				gfn << PAGE_SHIFT, pfn, PG_LEVEL_4K);
>> +		BUG_ON(rc != 0);
>> +	}
>> +}
> 
> ...
> 
>> +void kvm_arch_write_gfn_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
>> +{
>> +	update_gfn_track(slot, gfn, KVM_PAGE_TRACK_WRITE, 1);
> 
> Tracking only writes isn't correct, as KVM reads to guest private memory will
> return garbage.  Pulling the rug out from under KVM reads won't fail as
> spectacularly as writes (at least not right away), but they'll still fail.  I'm
> actually ok reading garbage if the guest screws up, but KVM needs consistent
> semantics.
> 
> Good news is that per-gfn tracking is probably overkill anyways.  As mentioned
> above, user access don't need extra magic, they either fail or they don't.
> 
> For kvm_vcpu_map(), one thought would be to add a "short-term" map variant that
> is not allowed to be retained across VM-Entry, and then use e.g. SRCU to block
> PSC requests until there are no consumers.
> 

Sounds good to me, i will add the support in the next rev.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  2021-07-20 14:37     ` Brijesh Singh
@ 2021-07-20 16:28       ` Sean Christopherson
  2021-07-20 18:21         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 16:28 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Tue, Jul 20, 2021, Brijesh Singh wrote:
> 
> On 7/19/21 5:50 PM, Sean Christopherson wrote:
> ...
> > 
> > IIUC, this snippet in the spec means KVM can't restrict what requests are made
> > by the guests.  If so, that makes it difficult to detect/ratelimit a misbehaving
> > guest, and also limits our options if there are firmware issues (hopefully there
> > aren't).  E.g. ratelimiting a guest after KVM has explicitly requested it to
> > migrate is not exactly desirable.
> > 
> 
> The guest message page contains a message header followed by the encrypted
> payload. So, technically KVM can peek into the message header format to
> determine the message request type. If needed, we can ratelimit based on the
> message type.

Ah, I got confused by this code in snp_build_guest_buf():

	data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);

I was thinking that setting the C-bit meant the memory was guest private, but
that's setting the C-bit for the HPA, which is correct since KVM installs guest
memory with C-bit=1 in the NPT, i.e. encrypts shared memory with the host key.

Tangetially related question, is it correct to say that the host can _read_ memory
from a page that is assigned=1, but has asid=0?  I.e. KVM can read the response
page in order to copy it into the guest, even though it is a firmware page?

	/* Copy the response after the firmware returns success. */
	rc = kvm_write_guest(kvm, resp_gpa, sev->snp_resp_page, PAGE_SIZE);

> In the current series we don't support migration etc so I decided to
> ratelimit unconditionally.

Since KVM can peek at the request header, KVM should flat out disallow requests
that KVM doesn't explicitly support.  E.g. migration requests should not be sent
to the PSP.

One concern though: How does the guest query what requests are supported?  This
snippet implies there's some form of enumeration:

  Note: This guest message may be removed in future versions as it is redundant
  with the CPUID page in SNP_LAUNCH_UPDATE (see Section 8.14).

But all I can find is a "Message Version" in "Table 94. Message Type Encodings",
which implies that request support is all or nothing for a given version.  That
would be rather unfortunate as KVM has no way to tell the guest that something
is unsupported :-(

> > Is this exposed to userspace in any way?  This feels very much like a knob that
> > needs to be configurable per-VM.
> 
> It's not exposed to the userspace and I am not sure if userspace care about
> this knob.

Userspace definitely cares, otherwise the system would need to be rebooted just to
tune the ratelimiting.  And userspace may want to disable ratelimiting entirely,
e.g. if the entire system is dedicated to a single VM.

> > Also, what are the estimated latencies of a guest request?  If the worst case
> > latency is >200ms, a default ratelimit frequency of 5hz isn't going to do a whole
> > lot.
> > 
> 
> The latency will depend on what else is going in the system at the time the
> request comes to the hypervisor. Access to the PSP is serialized so other
> parallel PSP command execution will contribute to the latency.

I get that it will be variable, but what are some ballpark latencies?  E.g. what's
the latency of the slowest command without PSP contention?

> > Question on the VMPCK sequences.  The firmware ABI says:
> > 
> >     Each guest has four VMPCKs ... Each message contains a sequence number per
> >     VMPCK. The sequence number is incremented with each message sent. Messages
> >     sent by the guest to the firmware and by the firmware to the guest must be
> >     delivered in order. If not, the firmware will reject subsequent messages ...
> > 
> > Does that mean there are four independent sequences, i.e. four streams the guest
> > can use "concurrently", or does it mean the overall freshess/integrity check is
> > composed from four VMPCK sequences, all of which must be correct for the message
> > to be valid?
> > 
> 
> There are four independent sequence counter and in theory guest can use them
> concurrently. But the access to the PSP must be serialized.

Technically that's not required from the guest's perspective, correct?  The guest
only cares about the sequence numbers for a given VMPCK, e.g. it can have one
in-flight request per VMPCK and expect that to work, even without fully serializing
its own requests.

Out of curiosity, why 4 VMPCKs?  It seems completely arbitrary.

> Currently, the guest driver uses the VMPCK0 key to communicate with the PSP.
> 
> 
> > If it's the latter, then a traditional mutex isn't really necessary because the
> > guest must implement its own serialization, e.g. it's own mutex or whatever, to
> > ensure there is at most one request in-flight at any given time.
> 
> The guest driver uses the its own serialization to ensure that there is
> *exactly* one request in-flight.

But KVM can't rely on that because it doesn't control the guest, e.g. it may be
running a non-Linux guest.

> The mutex used here is to protect the KVM's internal firmware response
> buffer.

Ya, where I was going with my question was that if the guest was architecturally
restricted to a single in-flight request, then KVM could do something like this
instead of taking kvm->lock (bad pseudocode):

	if (test_and_set(sev->guest_request)) {
		rc = AEAD_OFLOW;
		goto fail;
	}

	<do request>

	clear_bit(...)

I.e. multiple in-flight requests can't work because the guest can guarantee
ordering between vCPUs.  But, because the guest can theoretically have up to four
in-flight requests, it's not that simple.

The reason I'm going down this path is that taking kvm->lock inside vcpu->mutex
violates KVM's locking rules, i.e. is susceptibl to deadlocks.  Per kvm/locking.rst,

  - kvm->lock is taken outside vcpu->mutex

That means a different mutex is needed to protect the guest request pages.

	
> > And on the KVM side it means KVM can simpy reject requests if there is
> > already an in-flight request.  It might also give us more/better options
> > for ratelimiting?
> 
> I don't think we should be running into this scenario unless there is a bug
> in the guest kernel. The guest kernel support and CCP driver both ensure
> that request to the PSP is serialized.

Again, what the Linux kernel does is irrelevant.  What matters is what is
architecturally allowed.

> In normal operation we may see 1 to 2 quest requests for the entire guest
> lifetime. I am thinking first request maybe for the attestation report and
> second maybe to derive keys etc. It may change slightly when we add the
> migration command; I have not looked into a great detail yet.



^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-19 19:37                 ` Brijesh Singh
@ 2021-07-20 16:40                   ` Sean Christopherson
  2021-07-20 18:23                     ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 16:40 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Mon, Jul 19, 2021, Brijesh Singh wrote:
> 
> On 7/19/21 2:03 PM, Sean Christopherson wrote:
> > On Mon, Jul 19, 2021, Brijesh Singh wrote:
> > Ah, not firmwrare, gotcha.  But we can still use a helper, e.g. an inner
> > double-underscore helper, __rmp_make_private().
> 
> In that case we are basically passing the all the fields defined in the
> 'struct rmpupdate' as individual arguments.

Yes, but (a) not _all_ fields, (b) it would allow hiding "struct rmpupdate", and
(c) this is much friendlier to readers:

	__rmp_make_private(pfn, gpa, PG_LEVEL_4K, svm->asid, true);

than:

	rmpupdate(&rmpupdate);

For the former, I can see in a single line of code that KVM is creating a 4k
private, immutable guest page.  With the latter, I need to go hunt down all code
that modifies rmpupdate to understand what the code is doing.

> How about something like this:
> 
> * core kernel exports the rmpupdate()
> * the include/linux/sev.h header file defines the helper functions
> 
>   int rmp_make_private(u64 pfn, u64 gpa, int psize, int asid)

I think we'll want s/psize/level, i.e. make it more obvious clear that the input
is PG_LEVEL_*.  

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault
  2021-07-20  0:10   ` Sean Christopherson
@ 2021-07-20 17:55     ` Brijesh Singh
  2021-07-20 22:31       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 17:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/19/21 7:10 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to
>> resolve the RMP violation encountered during the NPT table walk.
> 
> Heh, please elaborate on exactly what that recommendation is.  A recommendation
> isn't exactly architectural, i.e. is subject to change :-)

I will try to expand it :)

> 
> And, do we have to follow the APM's recommendation?  

Yes, unless we want to be very strict on what a guest can do.


Specifically, can KVM treat
> #NPF RMP violations as guest errors, or is that not allowed by the GHCB spec?

The GHCB spec does not say anything about the #NPF RMP violation error. 
And not all #NPF RMP is a guest error (mainly those size mismatch etc).

> I.e. can we mandate accesses be preceded by page state change requests?  

This is a good question, the GHCB spec does not enforce that a guest 
*must* use page state. If the page state changes is not done by the 
guest then it will cause #NPF and its up to the hypervisor to decide on 
what it wants to do.


It would
> simplify KVM (albeit not much of a simplificiation) and would also make debugging
> easier since transitions would require an explicit guest request and guest bugs
> would result in errors instead of random corruption/weirdness.
> 

I am good with enforcing this from the KVM. But the question is, what 
fault we should inject in the guest when KVM detects that guest has 
issued the page state change.


>> index 46323af09995..117e2e08d7ed 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1399,6 +1399,9 @@ struct kvm_x86_ops {
>>   
>>   	void (*write_page_begin)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
>>   	void (*write_page_end)(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
>> +
>> +	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
>> +			int level, u64 error_code);
>>   };
>>   
>>   struct kvm_x86_nested_ops {
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index e60f54455cdc..b6a676ba1862 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5096,6 +5096,18 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>>   	write_unlock(&vcpu->kvm->mmu_lock);
>>   }
>>   
>> +static int handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
>> +{
>> +	kvm_pfn_t pfn;
>> +	int level;
>> +
>> +	if (unlikely(!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &level)))
>> +		return RET_PF_RETRY;
>> +
>> +	kvm_x86_ops.handle_rmp_page_fault(vcpu, gpa, pfn, level, error_code);
>> +	return RET_PF_RETRY;
>> +}
>> +
>>   int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>>   		       void *insn, int insn_len)
>>   {
>> @@ -5112,6 +5124,14 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>>   			goto emulate;
>>   	}
>>   
>> +	if (unlikely(error_code & PFERR_GUEST_RMP_MASK)) {
>> +		r = handle_rmp_page_fault(vcpu, cr2_or_gpa, error_code);
> 
> Adding a kvm_x86_ops hook is silly, there's literally one path, npf_interception()
> that can encounter RMP violations.  Just invoke snp_handle_rmp_page_fault() from
> there.  That works even if kvm_mmu_get_tdp_walk() stays around since it was
> exported earlier.
> 

Noted.



>> +
>> +	/*
>> +	 * If it's a shared access, then make the page shared in the RMP table.
>> +	 */
>> +	if (rmpentry_assigned(e) && !private)
>> +		rc = snp_make_page_shared(vcpu, gpa, pfn, PG_LEVEL_4K);
> 
> Hrm, this really feels like it needs to be protected by mmu_lock.  Functionally,
> it might all work out in the end after enough RMP violations, but it's extremely
> difficult to reason about and probably even more difficult if multiple vCPUs end
> up fighting over a gfn.
> 

Lets see what's your thought on enforcing the page state change for the 
KVM. If we want the guest to issue the page state change before the 
access then this case will simply need to inject an error in the guest 
and we can remove all of it.

> My gut reaction is that this is also backwards, i.e. KVM should update the RMP
> to match its TDP SPTEs, not the other way around.
> 
> The one big complication is that the TDP MMU only takes mmu_lock for read.  A few
> options come to mind but none of them are all that pretty.  I'll wait to hear back
> on whether or not we can make PSC request mandatory before thinking too hard on
> this one.
> 


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  2021-07-07 18:35 ` [PATCH Part2 RFC v4 20/40] KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe Brijesh Singh
  2021-07-14 13:35   ` Marc Orr
@ 2021-07-20 18:02   ` Sean Christopherson
  2021-08-03 14:38     ` Brijesh Singh
  1 sibling, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 18:02 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

IMO, the CPU behavior is a bug, even if the behavior is working as intended for
the microarchitecture.  I.e. this should be treated as an erratum.

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> When SEV-SNP is globally enabled on a system, the VMRUN instruction
> performs additional security checks on AVIC backing, VMSA, and VMCB page.
> On a successful VMRUN, these pages are marked "in-use" by the
> hardware in the RMP entry

There's a lot of "noise" in this intro.  That the CPU does additional checks at
VMRUN isn't all that interesting, what's relevant is that the CPU tags the
associated RMP entry with a special flag.  And IIUC, it does that for _all_ VMs,
not just SNP VMs.

Also, what happens if the pages aren't covered by the RMP?  Table 15-41 states
that the VMCB, AVIC, and VMSA for non-SNP guests need to be Hypervisor-Owned,
but doesn't explicitly require them to be RMP covered.  On the other hand, it
does state that the VMSA for an SNP guest must be Guest-Owned and RMP-Covered.
That implies that the Hypervisor-Owned pages do not need to contained within the
RMP, but how does that work if the CPU is setting a magic flag in the RMP?  Does
VMRUN explode?  Does the CPU corrupt random memory?

Is the in-use flag visible to software?  We've already established that "struct
rmpentry" is microarchitectural, so why not document it in the PPR?  It could be
useful info for debugging unexpected RMP violations, even if the flag isn't stable.

Are there other possible collisions with the in-use flag?  The APM states that
the in-use flag results in RMPUPDATE failing with FAIL_INUSE.  That's the same
error code that's returned if two CPUs attempt RMPUPDATE on the same entry.  That
implies that the RMPUPDATE also sets the in-use flag.  If that's true, then isn't
it possible that the spurious RMP violation #PF could happen if the kernel accesses
a hugepage at the same time a CPU is doing RMPUPDATE on the associated 2mb-aligned
entry?

> and any attempt to modify the RMP entry for these pages will result in
> page-fault (RMP violation check).

Again, not that relevant since KVM isn't attempting to modify the RMP entry.
I've no objection to mentioning this behavior in passing, but it should not be
the focal point of the intro.

> While performing the RMP check, hardware will try to create a 2MB TLB
> entry for the large page accesses. When it does this, it first reads
> the RMP for the base of 2MB region and verifies that all this memory is
> safe. If AVIC backing, VMSA, and VMCB memory happen to be the base of
> 2MB region, then RMP check will fail because of the "in-use" marking for
> the base entry of this 2MB region.

There's a critical piece missing here, which is why an RMP violation is thrown
on "in-use" pages.  E.g. are any translations problematic, or just writable
translations?  It may not affect the actual KVM workaround, but knowing exactly
what goes awry is important.

> e.g.
> 
> 1. A VMCB was allocated on 2MB-aligned address.
> 2. The VMRUN instruction marks this RMP entry as "in-use".
> 3. Another process allocated some other page of memory that happened to be
>    within the same 2MB region.
> 4. That process tried to write its page using physmap.

Please explicitly call out the relevance of the physmap.  IIUC, only the physmap,
a.k.a. direct map, is problematic because that's the only scenario where a large
page can overlap one of the magic pages.  That should be explicitly stated.

> If the physmap entry in step #4 uses a large (1G/2M) page, then the

Be consistent with 2MB vs. 2M, i.e. choose one.

> hardware will attempt to create a 2M TLB entry. The hardware will find
> that the "in-use" bit is set in the RMP entry (because it was a
> VMCB page) and will cause an RMP violation check.

So what happens if the problematic page isn't 2mb aligned?  The lack of an RMP
violation on access implies that the hypervisor can bypass the in-use restriction
and create a 2mb hugepage, i.e. access the in-use page.  Same question for if the
TLB entry exists before the page is marked in-use, which also begs the question
of why the in-use flag is being checked at all on RMP lookups.

> See APM2 section 15.36.12 for more information on VMRUN checks when
> SEV-SNP is globally active.
> 
> A generic allocator can return a page which are 2M aligned and will not
> be safe to be used when SEV-SNP is globally enabled.

> Add a snp_safe_alloc_page() helper that can be used for allocating the SNP
> safe memory. The helper allocated 2 pages and splits them into order-1
> allocation. It frees one page and keeps one of the page which is not 2M
> aligned.

I know it's personal preference as to whether to lead with the solution or the
problem statement, but in this case it would be very helpful to at least provide
a brief summary early on so that the reader has some idea of where the changelog
is headed.  As is, the actual change is buried after a big pile of hardware
details.

E.g. something like this

  Implement a workaround for an SNP erratum where the CPU will incorrectly
  signal an RMP violation #PF if a hugepage (2mb or 1gb) collides with the
  RMP entry of a VMCB, VMSA, or AVIC backing page.

  When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
  backing   pages as "in-use" in the RMP after a successful VMRUN.  This is
  done for _all_   VMs, not just SNP-Active VMs.

  If the hypervisor accesses an in-use page through a writable translation,
  the CPU will throw an RMP violation #PF.  On early SNP hardware, if an
  in-use page is 2mb aligned and software accesses any part of the associated
  2mb region with a hupage, the CPU will incorrectly treat the entire 2mb
  region as in-use and signal a spurious RMP violation #PF.

  <gory details on the workaround>

> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/lapic.c            |  5 ++++-
>  arch/x86/kvm/svm/sev.c          | 27 +++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          | 16 ++++++++++++++--
>  arch/x86/kvm/svm/svm.h          |  1 +
>  5 files changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 55efbacfc244..188110ab2c02 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1383,6 +1383,7 @@ struct kvm_x86_ops {
>  	int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
>  
>  	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
> +	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index c0ebef560bd1..d4c77f66d7d5 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2441,7 +2441,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns)
>  
>  	vcpu->arch.apic = apic;
>  
> -	apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> +	if (kvm_x86_ops.alloc_apic_backing_page)
> +		apic->regs = kvm_x86_ops.alloc_apic_backing_page(vcpu);

This can be a static_call().

> +	else
> +		apic->regs = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>  	if (!apic->regs) {
>  		printk(KERN_ERR "malloc apic regs error for vcpu %x\n",
>  		       vcpu->vcpu_id);
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index b8505710c36b..411ed72f63af 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2692,3 +2692,30 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>  		break;
>  	}
>  }
> +
> +struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long pfn;
> +	struct page *p;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> +		return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +
> +	p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> +	if (!p)
> +		return NULL;
> +
> +	/* split the page order */
> +	split_page(p, 1);
> +
> +	/* Find a non-2M aligned page */

This isn't "finding" anything, it's identifying which of the two pages is
_guaranteed_ to be unaligned.  The whole function needs a much bigger comment to
explain what's going on.

> +	pfn = page_to_pfn(p);
> +	if (IS_ALIGNED(__pfn_to_phys(pfn), PMD_SIZE)) {
> +		pfn++;
> +		__free_page(p);
> +	} else {
> +		__free_page(pfn_to_page(pfn + 1));
> +	}
> +
> +	return pfn_to_page(pfn);
> +}

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  2021-07-20 16:28       ` Sean Christopherson
@ 2021-07-20 18:21         ` Brijesh Singh
  2021-07-20 22:09           ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 18:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/20/21 11:28 AM, Sean Christopherson wrote:

> 
> Ah, I got confused by this code in snp_build_guest_buf():
> 
> 	data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT);
> 
> I was thinking that setting the C-bit meant the memory was guest private, but
> that's setting the C-bit for the HPA, which is correct since KVM installs guest
> memory with C-bit=1 in the NPT, i.e. encrypts shared memory with the host key.
> 
> Tangetially related question, is it correct to say that the host can _read_ memory
> from a page that is assigned=1, but has asid=0?  I.e. KVM can read the response
> page in order to copy it into the guest, even though it is a firmware page?
> 

Yes. The firmware page means that x86 cannot write to it; the read is 
still allowed.


> 	/* Copy the response after the firmware returns success. */
> 	rc = kvm_write_guest(kvm, resp_gpa, sev->snp_resp_page, PAGE_SIZE);
> 
>> In the current series we don't support migration etc so I decided to
>> ratelimit unconditionally.
> 
> Since KVM can peek at the request header, KVM should flat out disallow requests
> that KVM doesn't explicitly support.  E.g. migration requests should not be sent
> to the PSP.
> 

That is acceptable.


> One concern though: How does the guest query what requests are supported?  This
> snippet implies there's some form of enumeration:
> 
>    Note: This guest message may be removed in future versions as it is redundant
>    with the CPUID page in SNP_LAUNCH_UPDATE (see Section 8.14).
> 
> But all I can find is a "Message Version" in "Table 94. Message Type Encodings",
> which implies that request support is all or nothing for a given version.  That
> would be rather unfortunate as KVM has no way to tell the guest that something
> is unsupported :-(
> 

The firmware supports all the commands listed in the spec. The HV 
support is always going to be behind what a firmware or hardware is 
capable of doing. As per the spec is concerned, it say

   The firmware checks that MSG_TYPE is a valid message type. The
   firmware then checks that MSG_SIZE is large enough to hold the
   indicated message type at the indicated message version. If
   not, the firmware returns INVALID_PARAM.

So, a hypervisor could potentially send the INVALID_PARAMS to indicate 
that guest that a message type is not supported.


>>> Is this exposed to userspace in any way?  This feels very much like a knob that
>>> needs to be configurable per-VM.
>>
>> It's not exposed to the userspace and I am not sure if userspace care about
>> this knob.
> 
> Userspace definitely cares, otherwise the system would need to be rebooted just to
> tune the ratelimiting.  And userspace may want to disable ratelimiting entirely,
> e.g. if the entire system is dedicated to a single VM.

Ok.

> 
>>> Also, what are the estimated latencies of a guest request?  If the worst case
>>> latency is >200ms, a default ratelimit frequency of 5hz isn't going to do a whole
>>> lot.
>>>
>>
>> The latency will depend on what else is going in the system at the time the
>> request comes to the hypervisor. Access to the PSP is serialized so other
>> parallel PSP command execution will contribute to the latency.
> 
> I get that it will be variable, but what are some ballpark latencies?  E.g. what's
> the latency of the slowest command without PSP contention?
> 

In my single VM, I am seeing latency of guest request to be around ~35ms.

>>> Question on the VMPCK sequences.  The firmware ABI says:
>>>
>>>      Each guest has four VMPCKs ... Each message contains a sequence number per
>>>      VMPCK. The sequence number is incremented with each message sent. Messages
>>>      sent by the guest to the firmware and by the firmware to the guest must be
>>>      delivered in order. If not, the firmware will reject subsequent messages ...
>>>
>>> Does that mean there are four independent sequences, i.e. four streams the guest
>>> can use "concurrently", or does it mean the overall freshess/integrity check is
>>> composed from four VMPCK sequences, all of which must be correct for the message
>>> to be valid?
>>>
>>
>> There are four independent sequence counter and in theory guest can use them
>> concurrently. But the access to the PSP must be serialized.
> 
> Technically that's not required from the guest's perspective, correct?  

Correct.

The guest
> only cares about the sequence numbers for a given VMPCK, e.g. it can have one
> in-flight request per VMPCK and expect that to work, even without fully serializing
> its own requests.
> 
> Out of curiosity, why 4 VMPCKs?  It seems completely arbitrary.
> 

I believe the thought process was by providing 4 keys it can provide 
flexibility for each VMPL levels to use a different keys (if they wish). 
The firmware does not care about the vmpl level during the guest request 
handling, it just want to know which key is used for encrypting the 
payload so that he can decrypt and provide the  response for it.


>> Currently, the guest driver uses the VMPCK0 key to communicate with the PSP.
>>
>>
>>> If it's the latter, then a traditional mutex isn't really necessary because the
>>> guest must implement its own serialization, e.g. it's own mutex or whatever, to
>>> ensure there is at most one request in-flight at any given time.
>>
>> The guest driver uses the its own serialization to ensure that there is
>> *exactly* one request in-flight.
> 
> But KVM can't rely on that because it doesn't control the guest, e.g. it may be
> running a non-Linux guest.
>

Yes, KVM should not rely on it. I mentioned that mainly because you said 
that guest must implement its own serialization. In the case of KVM, the 
CCP driver ensure that the command sent to the PSP is serialized.


>> The mutex used here is to protect the KVM's internal firmware response
>> buffer.
> 
> Ya, where I was going with my question was that if the guest was architecturally
> restricted to a single in-flight request, then KVM could do something like this
> instead of taking kvm->lock (bad pseudocode):
> 
> 	if (test_and_set(sev->guest_request)) {
> 		rc = AEAD_OFLOW;
> 		goto fail;
> 	}
> 
> 	<do request>
> 
> 	clear_bit(...)
> 
> I.e. multiple in-flight requests can't work because the guest can guarantee
> ordering between vCPUs.  But, because the guest can theoretically have up to four
> in-flight requests, it's not that simple.
> 
> The reason I'm going down this path is that taking kvm->lock inside vcpu->mutex
> violates KVM's locking rules, i.e. is susceptibl to deadlocks.  Per kvm/locking.rst,
> 
>    - kvm->lock is taken outside vcpu->mutex
> 
> That means a different mutex is needed to protect the guest request pages.
> 

Ah, I see your point on the locking. From architecturally a guest can 
issue multiple requests in parallel. It sounds like having a separate 
lock to protect the guest request pages makes sense.


-Brijesh

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 25/40] KVM: SVM: Reclaim the guest pages when SEV-SNP VM terminates
  2021-07-20 16:40                   ` Sean Christopherson
@ 2021-07-20 18:23                     ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 18:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/20/21 11:40 AM, Sean Christopherson wrote:
> On Mon, Jul 19, 2021, Brijesh Singh wrote:
>>
>> On 7/19/21 2:03 PM, Sean Christopherson wrote:
>>> On Mon, Jul 19, 2021, Brijesh Singh wrote:
>>> Ah, not firmwrare, gotcha.  But we can still use a helper, e.g. an inner
>>> double-underscore helper, __rmp_make_private().
>>
>> In that case we are basically passing the all the fields defined in the
>> 'struct rmpupdate' as individual arguments.
> 
> Yes, but (a) not _all_ fields, (b) it would allow hiding "struct rmpupdate", and
> (c) this is much friendlier to readers:
> 
> 	__rmp_make_private(pfn, gpa, PG_LEVEL_4K, svm->asid, true);
> 
> than:
> 
> 	rmpupdate(&rmpupdate);
> 

Ok.

> For the former, I can see in a single line of code that KVM is creating a 4k
> private, immutable guest page.  With the latter, I need to go hunt down all code
> that modifies rmpupdate to understand what the code is doing.
> 
>> How about something like this:
>>
>> * core kernel exports the rmpupdate()
>> * the include/linux/sev.h header file defines the helper functions
>>
>>    int rmp_make_private(u64 pfn, u64 gpa, int psize, int asid)
> 
> I think we'll want s/psize/level, i.e. make it more obvious clear that the input
> is PG_LEVEL_*.
> 

ok, I will stick to x86 PG_LEVEL_*

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  2021-07-16 20:41     ` Brijesh Singh
@ 2021-07-20 19:38       ` Sean Christopherson
  2021-07-20 20:06         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 19:38 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> On 7/16/21 2:19 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> > Another option would be to drop the kvm_x86_ops hooks entirely and call
> > snp_lookup_page_in_rmptable() directly from MMU code.  That would require tracking
> > that a VM is SNP-enabled in arch code, but I'm pretty sure info has already bled
> > into common KVM in one form or another.
> 
> I would prefer this as it eliminates some of the other unnecessary call
> sites. Unfortunately, currently there is no generic way to know if its
> an SEV guest (outside the svm/*).  So far there was no need as such but
> with SNP having such information would help. Should we extend the
> 'struct kvm' to include a new field that can be used to determine the
> guest type. Something like
> 
> enum {
> 
>    GUEST_TYPE_SEV,
> 
>    GUEST_TYPE_SEV_ES,
> 
>    GUEST_TYPE_SEV_SNP,
> 
> };
> 
> struct kvm {
> 
>    ...
> 
>   u64 enc_type;
> 
> };
> 
> bool kvm_guest_enc_type(struct kvm *kvm, enum type); {
> 
>     return !!kvm->enc_type & type;
> 
> }
> 
> The mmu.c can then call kvm_guest_enc_type() to check if its SNP guest
> and use the SNP lookup directly to determine the pagesize.

The other option is to use vm_type, which TDX is already planning on leveraging.
Paolo raised the question of whether or not the TDX type could be reused for SNP.
We should definitely sort that out before merging either series.  I'm personally
in favor of separating TDX and SNP, it seems inevitable that common code will
want to differentiate between the two.

https://lkml.kernel.org/r/8eb87cd52a89d957af03f93a9ece5634426a7757.1625186503.git.isaku.yamahata@intel.com

> > As the APM is currently worded, this is wrong, and the whole "tdp_max_page_level"
> > name is wrong.  As noted above, the Page-Size bullet points states that 2mb/1gb
> > pages in the NPT _must_ have RMP.page_size=1, and 4kb pages in the NPT _must_
> > have RMP.page_size=0.  That means that the RMP adjustment is not a constraint,
> > it's an exact requirement.  Specifically, if the RMP is a 2mb page then KVM must
> > install a 2mb (or 1gb) page.  Maybe it works because KVM will PSMASH the RMP
> > after installing a bogus 4kb NPT and taking an RMP violation, but that's a very
> > convoluted and sub-optimal solution.
> 
> This is why I was passing the preferred max_level in the pre-fault
> handle then later query the npt level; use the npt level in the RMP to
> make sure they are in sync.
> 
> There is yet another reason why we can't avoid the PSMASH after doing
> everything to ensure that NPT and RMP are in sync. e.g if NPT and RMP
> are programmed with 2mb size but the guest tries to PVALIDATE the page
> as a 4k. In that case, we will see #NPF with page size mismatch and have
> to perform psmash.

Boo, there's no way to communicate to the guest that it's doing PVALIDATE wrong
is there?

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP
  2021-07-20 19:38       ` Sean Christopherson
@ 2021-07-20 20:06         ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 20:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh



On 7/20/21 2:38 PM, Sean Christopherson wrote:
...

> 
> The other option is to use vm_type, which TDX is already planning on leveraging.
> Paolo raised the question of whether or not the TDX type could be reused for SNP.
> We should definitely sort that out before merging either series.  I'm personally
> in favor of separating TDX and SNP, it seems inevitable that common code will
> want to differentiate between the two.

Yes, I did saw that and it seems much better.

> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F8eb87cd52a89d957af03f93a9ece5634426a7757.1625186503.git.isaku.yamahata%40intel.com&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Cb658fcf339234fd9030d08d94bb5edf1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637624067039647374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tdVALNsQGer0Z69%2FyYaRqWYZvH27k%2BmgHdslQlJ7qlU%3D&amp;reserved=0
> 

...

>>
>> There is yet another reason why we can't avoid the PSMASH after doing
>> everything to ensure that NPT and RMP are in sync. e.g if NPT and RMP
>> are programmed with 2mb size but the guest tries to PVALIDATE the page
>> as a 4k. In that case, we will see #NPF with page size mismatch and have
>> to perform psmash.
> 
> Boo, there's no way to communicate to the guest that it's doing PVALIDATE wrong
> is there?
> 

if the guest chooses smaller page-size then we don't have any means to 
notify the guest; the hardware will cause an #NPF and its up to the 
hypervisor to resolve the fault.

However, if the guest attempts to validate with the larger page-level 
(e.g guest using 2mb and RMP entry was 4k) then PVALIDATE will return 
SIZEMISMATCH error to the guest.

thanks

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers
  2021-07-16 17:22       ` Brijesh Singh
@ 2021-07-20 22:06         ` Sean Christopherson
  2021-07-20 23:10           ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 22:06 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Fri, Jul 16, 2021, Brijesh Singh wrote:
> 
> On 7/15/21 2:28 PM, Brijesh Singh wrote:
> >> Looking at the future patches, dump_rmpentry() is the only power user,
> >> e.g.  everything else mostly looks at "assigned" and "level" (and one
> >> ratelimited warn on "validated" in snp_make_page_shared(), but I suspect
> >> that particular check can and should be dropped).
> >
> > Yes, we need "assigned" and "level" and other entries are mainly for
> > the debug purposes.
> >
> For the debug purposes, we would like to dump additional RMP entries. If
> we go with your proposed function then how do we get those information
> in the dump_rmpentry()?

As suggested below, move dump_rmpentry() into sev.c so that it can use the
microarchitectural version.  For debug, I'm pretty that's what we'll want anyways,
e.g. dump the raw value along with the meaning of various bits.

> How about if we provide two functions; the first
> function provides architectural format and second provides the raw
> values which can be used by the dump_rmpentry() helper.
> 
> struct rmpentry *snp_lookup_rmpentry(unsigned long paddr, int *level);
> 
> The 'struct rmpentry' uses the format defined in APM Table 15-36.
> 
> struct _rmpentry *_snp_lookup_rmpentry(unsigned long paddr, int *level);
> 
> The 'struct _rmpentry' will use include the PPR definition (basically
> what we have today in this patch).
> 
> Thoughts ?

Why define an architectural "struct rmpentry"?  IIUC, there isn't a true
architectural RMP entry; the APM defines architectural fields but doesn't define
a layout.  Functionally, making up our own struct isn't a problem, I just don't
see the point since all use cases only care about Assigned and Page-Size, and
we can do them a favor by translating Page-Size to X86_PG_LEVEL.

> >> /*
> >>   * Returns 1 if the RMP entry is assigned, 0 if it exists but is not
> >>   * assigned, and -errno if there is no corresponding RMP entry.
> >>   */
> >> int snp_lookup_rmpentry(struct page *page, int *level)
> >> {
> >>     unsigned long phys = page_to_pfn(page) << PAGE_SHIFT;
> >>     struct rmpentry *entry, *large_entry;
> >>     unsigned long vaddr;
> >>
> >>     if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> >>         return -ENXIO;
> >>
> >>     vaddr = rmptable_start + rmptable_page_offset(phys);
> >>     if (unlikely(vaddr > rmptable_end))
> >>         return -EXNIO;
> >>
> >>     entry = (struct rmpentry *)vaddr;
> >>
> >>     /* Read a large RMP entry to get the correct page level used in
> >> RMP entry. */
> >>     vaddr = rmptable_start + rmptable_page_offset(phys & PMD_MASK);
> >>     large_entry = (struct rmpentry *)vaddr;
> >>     *level = RMP_TO_X86_PG_LEVEL(rmpentry_pagesize(large_entry));
> >>
> >>     return !!entry->assigned;
> >> }
> >>
> >>
> >> And then move dump_rmpentry() (or add a helper) in sev.c so that "struct
> >> rmpentry" can be declared in sev.c.

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 38/40] KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  2021-07-20 18:21         ` Brijesh Singh
@ 2021-07-20 22:09           ` Sean Christopherson
  0 siblings, 0 replies; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 22:09 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Tue, Jul 20, 2021, Brijesh Singh wrote:
> 
> On 7/20/21 11:28 AM, Sean Christopherson wrote:
> > Out of curiosity, why 4 VMPCKs?  It seems completely arbitrary.
> > 
> 
> I believe the thought process was by providing 4 keys it can provide
> flexibility for each VMPL levels to use a different keys (if they wish). The
> firmware does not care about the vmpl level during the guest request
> handling, it just want to know which key is used for encrypting the payload
> so that he can decrypt and provide the  response for it.

Ah, I forgot about VMPLs.  That makes sense.

Thanks!

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault
  2021-07-20 17:55     ` Brijesh Singh
@ 2021-07-20 22:31       ` Sean Christopherson
  2021-07-20 23:53         ` Brijesh Singh
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-20 22:31 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Tue, Jul 20, 2021, Brijesh Singh wrote:
> 
> On 7/19/21 7:10 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> > > Follow the recommendation from APM2 section 15.36.10 and 15.36.11 to
> > > resolve the RMP violation encountered during the NPT table walk.
> > 
> > Heh, please elaborate on exactly what that recommendation is.  A recommendation
> > isn't exactly architectural, i.e. is subject to change :-)
> 
> I will try to expand it :)
> 
> > 
> > And, do we have to follow the APM's recommendation?
> 
> Yes, unless we want to be very strict on what a guest can do.
> 
> > Specifically, can KVM treat #NPF RMP violations as guest errors, or is that
> > not allowed by the GHCB spec?
> 
> The GHCB spec does not say anything about the #NPF RMP violation error. And
> not all #NPF RMP is a guest error (mainly those size mismatch etc).
> 
> > I.e. can we mandate accesses be preceded by page state change requests?
> 
> This is a good question, the GHCB spec does not enforce that a guest *must*
> use page state. If the page state changes is not done by the guest then it
> will cause #NPF and its up to the hypervisor to decide on what it wants to
> do.

Drat.  Is there any hope of pushing through a GHCB change to require the guest
to use PSC?

> > It would simplify KVM (albeit not much of a simplificiation) and would also
> > make debugging easier since transitions would require an explicit guest
> > request and guest bugs would result in errors instead of random
> > corruption/weirdness.
> 
> I am good with enforcing this from the KVM. But the question is, what fault
> we should inject in the guest when KVM detects that guest has issued the
> page state change.

Injecting a fault, at least from KVM, isn't an option since there's no architectural
behavior we can leverage.  E.g. a guest that isn't enlightened enough to properly
use PSC isn't going to do anything useful with a #MC or #VC.

Sadly, as is I think our only options are to either automatically convert RMP
entries as need, or to punt the exit to userspace.  Maybe we could do both, e.g.
have a module param to control the behavior?  The problem with punting to userspace
is that KVM would also need a way for userspace to fix the issue, otherwise we're
just taking longer to kill the guest :-/

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 05/40] x86/sev: Add RMP entry lookup helpers
  2021-07-20 22:06         ` Sean Christopherson
@ 2021-07-20 23:10           ` Brijesh Singh
  0 siblings, 0 replies; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 23:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/20/21 5:06 PM, Sean Christopherson wrote:
> On Fri, Jul 16, 2021, Brijesh Singh wrote:
>> On 7/15/21 2:28 PM, Brijesh Singh wrote:
>>>> Looking at the future patches, dump_rmpentry() is the only power user,
>>>> e.g.  everything else mostly looks at "assigned" and "level" (and one
>>>> ratelimited warn on "validated" in snp_make_page_shared(), but I suspect
>>>> that particular check can and should be dropped).
>>> Yes, we need "assigned" and "level" and other entries are mainly for
>>> the debug purposes.
>>>
>> For the debug purposes, we would like to dump additional RMP entries. If
>> we go with your proposed function then how do we get those information
>> in the dump_rmpentry()?
> As suggested below, move dump_rmpentry() into sev.c so that it can use the
> microarchitectural version.  For debug, I'm pretty that's what we'll want anyways,
> e.g. dump the raw value along with the meaning of various bits.


Based on other feedbacks, I am not sure if we need to dump the RMP
entry;  In other feedback we agreed to unmap the pages from the direct
map while adding them in the RMP table, so, if anyone attempts to access
those pages they will now get the page-not-present instead of the RMP
violation.

thanks




^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 37/40] KVM: SVM: Add support to handle the RMP nested page fault
  2021-07-20 22:31       ` Sean Christopherson
@ 2021-07-20 23:53         ` Brijesh Singh
  2021-07-21 20:15           ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Brijesh Singh @ 2021-07-20 23:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: brijesh.singh, x86, linux-kernel, kvm, linux-efi,
	platform-driver-x86, linux-coco, linux-mm, linux-crypto,
	Thomas Gleixner, Ingo Molnar, Joerg Roedel, Tom Lendacky,
	H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Andy Lutomirski, Dave Hansen,
	Sergio Lopez, Peter Gonda, Peter Zijlstra, Srinivas Pandruvada,
	David Rientjes, Dov Murik, Tobin Feldman-Fitzthum,
	Borislav Petkov, Michael Roth, Vlastimil Babka, tony.luck,
	npmccallum, brijesh.ksingh


On 7/20/21 5:31 PM, Sean Christopherson wrote:
...
>> This is a good question, the GHCB spec does not enforce that a guest *must*
>> use page state. If the page state changes is not done by the guest then it
>> will cause #NPF and its up to the hypervisor to decide on what it wants to
>> do.
> Drat.  Is there any hope of pushing through a GHCB change to require the guest
> to use PSC?

Well, I am not sure if we can push it through GHCB. Other hypervisor
also need to agree to it. We need to define them some architectural way
for hypervisor to detect the violation and notify guest about it.


>>> It would simplify KVM (albeit not much of a simplificiation) and would also
>>> make debugging easier since transitions would require an explicit guest
>>> request and guest bugs would result in errors instead of random
>>> corruption/weirdness.
>> I am good with enforcing this from the KVM. But the question is, what fault
>> we should inject in the guest when KVM detects that guest has issued the
>> page state change.
> Injecting a fault, at least from KVM, isn't an option since there's no architectural
> behavior we can leverage.  E.g. a guest that isn't enlightened enough to properly
> use PSC isn't going to do anything useful with a #MC or #VC.
>
> Sadly, as is I think our only options are to either automatically convert RMP
> entries as need, or to punt the exit to userspace.  Maybe we could do both, e.g.
> have a module param to control the behavior?  The problem with punting to userspace
> is that KVM would also need a way for userspace to fix the issue, otherwise we're
> just taking longer to kill the guest :-/
>
I think we should automatically convert the RMP entries at time, its
possible that non Linux guest may access the page without going through
the PSC.

thanks


^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event Brijesh Singh
@ 2021-07-21  0:01   ` Sean Christopherson
  2021-07-21 17:47     ` Tom Lendacky
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-21  0:01 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
> guests to create and start APs on their own.

The changelog really needs to clarify that this doesn't allow the guest to create
arbitrary vCPUs.  The GHCB uses CREATE/DESTROY terminology, but this patch and its
comments/documentation should very clearly call out that KVM's implementation is
more along the line of vCPU online/offline.

It should also be noted that KVM still onlines APs by default.  That also raises
the question of whether or not KVM should support creating an offlined vCPU.
E.g. several of the use cases I'm aware of want to do something along the lines
of creating a VM with the max number of theoretical vCPUs, but in most instances
only run a handful of vCPUs.  That's a fair amount of potential memory savings
if the max theoretical number of vCPUs is high.

> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
> so as to avoid updating the VMSA pointer while the vCPU is running.
> 
> For CREATE
>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>   specified APIC ID. The GPA is saved in the svm struct of the target
>   vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the
>   vCPU and then the vCPU is kicked.
> 
> For CREATE_ON_INIT:
>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>   specified APIC ID the next time an INIT is performed. The GPA is saved
>   in the svm struct of the target vCPU.
> 
> For DESTROY:
>   The guest indicates it wishes to stop the vCPU. The GPA is cleared from
>   the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
>   to vCPU and then the vCPU is kicked.
> 
> 
> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as
> a result of the event or as a result of an INIT. The handler sets the vCPU
> to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the
> vCPU as not runnable. Any previous VMSA pages that were installed as
> part of an SEV-SNP AP Creation NAE event are un-pinned. If a new VMSA is
> to be installed, the VMSA guest page is pinned and set as the VMSA in the
> vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new
> VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the
> vCPU state is left as KVM_MP_STATE_UNINITIALIZED to prevent it from being
> run.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/include/asm/svm.h      |   3 +
>  arch/x86/kvm/svm/sev.c          | 133 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          |   7 +-
>  arch/x86/kvm/svm/svm.h          |  16 +++-
>  arch/x86/kvm/x86.c              |  11 ++-
>  6 files changed, 170 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 117e2e08d7ed..881e05b3f74e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -91,6 +91,7 @@
>  #define KVM_REQ_MSR_FILTER_CHANGED	KVM_ARCH_REQ(29)
>  #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
>  	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(31)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -1402,6 +1403,8 @@ struct kvm_x86_ops {
>  
>  	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
>  			int level, u64 error_code);
> +
> +	void (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 5e72faa00cf2..6634a952563e 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -220,6 +220,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  #define SVM_SEV_FEATURES_DEBUG_SWAP		BIT(5)
>  #define SVM_SEV_FEATURES_PREVENT_HOST_IBS	BIT(6)
>  #define SVM_SEV_FEATURES_BTB_ISOLATION		BIT(7)
> +#define SVM_SEV_FEATURES_INT_INJ_MODES			\
> +	(SVM_SEV_FEATURES_RESTRICTED_INJECTION |	\
> +	 SVM_SEV_FEATURES_ALTERNATE_INJECTION)
>  
>  struct vmcb_seg {
>  	u16 selector;
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index d8ad6dd58c87..95f5d25b4f08 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -582,6 +582,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  
>  static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>  {
> +	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
>  	struct sev_es_save_area *save = svm->vmsa;
>  
>  	/* Check some debug related fields before encrypting the VMSA */
> @@ -625,6 +626,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>  	if (sev_snp_guest(svm->vcpu.kvm))
>  		save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
>  
> +	/*
> +	 * Save the VMSA synced SEV features. For now, they are the same for
> +	 * all vCPUs, so just save each time.
> +	 */
> +	sev->sev_features = save->sev_features;
> +
>  	return 0;
>  }
>  
> @@ -2682,6 +2689,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>  		if (!ghcb_sw_scratch_is_valid(ghcb))
>  			goto vmgexit_err;
>  		break;
> +	case SVM_VMGEXIT_AP_CREATION:
> +		if (!ghcb_rax_is_valid(ghcb))
> +			goto vmgexit_err;
> +		break;
>  	case SVM_VMGEXIT_NMI_COMPLETE:
>  	case SVM_VMGEXIT_AP_HLT_LOOP:
>  	case SVM_VMGEXIT_AP_JUMP_TABLE:
> @@ -3395,6 +3406,121 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  	return ret;
>  }
>  
> +void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	kvm_pfn_t pfn;
> +
> +	mutex_lock(&svm->snp_vmsa_mutex);
> +
> +	vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
> +
> +	/* Clear use of the VMSA in the sev_es_init_vmcb() path */
> +	svm->vmsa_pa = 0;
> +
> +	/* Clear use of the VMSA from the VMCB */
> +	svm->vmcb->control.vmsa_pa = 0;

PA=0 is not an invalid address.  I don't care what value the GHCB uses for
"invalid GPA", KVM should always use INVALID_PAGE to track an invalid physical
address.

> +	/* Un-pin previous VMSA */
> +	if (svm->snp_vmsa_pfn) {
> +		kvm_release_pfn_dirty(svm->snp_vmsa_pfn);

Oof, I was wondering why KVM tracks three versions of VMSA.  Actually, I'm still
wondering why there are three versions.  Aren't snp_vmsa_pfn and vmsa_pa tracking
the same thing?  Ah, finally figured it out.  vmsa_pa points at svm->vmsa by
default.  Blech.

> +		svm->snp_vmsa_pfn = 0;
> +	}
> +
> +	if (svm->snp_vmsa_gpa) {

This is bogus, GPA=0 is perfectly valid.  As above, use INVALID_PAGE.  A comment
explaining that the vCPU is offline when VMSA is invalid would also be helpful.

> +		/* Validate that the GPA is page aligned */
> +		if (!PAGE_ALIGNED(svm->snp_vmsa_gpa))

This needs to be moved to the VMGEXIT, and it should use page_address_valid() so
that KVM also checks for a legal GPA.

> +			goto e_unlock;
> +
> +		/*
> +		 * The VMSA is referenced by thy hypervisor physical address,

s/thy/the, although converting to archaic English could be hilarious...

> +		 * so retrieve the PFN and pin it.
> +		 */
> +		pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(svm->snp_vmsa_gpa));
> +		if (is_error_pfn(pfn))
> +			goto e_unlock;

Silently ignoring the guest request is bad behavior, at worst KVM should exit to
userspace with an emulation error.

> +
> +		svm->snp_vmsa_pfn = pfn;
> +
> +		/* Use the new VMSA in the sev_es_init_vmcb() path */
> +		svm->vmsa_pa = pfn_to_hpa(pfn);
> +		svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
> +
> +		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +	} else {
> +		vcpu->arch.pv.pv_unhalted = false;

Shouldn't the RUNNABLE path also clear pv_unhalted?

> +		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;

What happens if userspace calls kvm_arch_vcpu_ioctl_set_mpstate, or even worse
the guest sends INIT-SIPI?  Unless I'm mistaken, either case will cause KVM to
run the vCPU with vmcb->control.vmsa_pa==0.

My initial reaction is that the "offline" case needs a new mp_state, or maybe
just use KVM_MP_STATE_STOPPED.

> +	}
> +
> +e_unlock:
> +	mutex_unlock(&svm->snp_vmsa_mutex);
> +}
> +
> +static void sev_snp_ap_creation(struct vcpu_svm *svm)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
> +	struct kvm_vcpu *vcpu = &svm->vcpu;
> +	struct kvm_vcpu *target_vcpu;
> +	struct vcpu_svm *target_svm;
> +	unsigned int request;
> +	unsigned int apic_id;
> +	bool kick;
> +
> +	request = lower_32_bits(svm->vmcb->control.exit_info_1);
> +	apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
> +
> +	/* Validate the APIC ID */
> +	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
> +	if (!target_vcpu)
> +		return;

KVM should not silently ignore bad requests, this needs to return an error to the
guest.

> +
> +	target_svm = to_svm(target_vcpu);
> +
> +	kick = true;

This is wrong, e.g. KVM will kick the target vCPU even if the request fails.
I suspect the correct behavior would be to:

  1. do all sanity checks
  2. take the necessary lock(s)
  3. modify target vCPU state
  4. kick target vCPU unless request==SVM_VMGEXIT_AP_CREATE_ON_INIT

> +	mutex_lock(&target_svm->snp_vmsa_mutex);

This seems like it's missing a big pile of sanity checks.  E.g. KVM should reject
SVM_VMGEXIT_AP_CREATE if the target vCPU is already "created", including the case
where it was "created_on_init" but hasn't yet received INIT-SIPI.

> +
> +	target_svm->snp_vmsa_gpa = 0;
> +	target_svm->snp_vmsa_update_on_init = false;
> +
> +	/* Interrupt injection mode shouldn't change for AP creation */
> +	if (request < SVM_VMGEXIT_AP_DESTROY) {
> +		u64 sev_features;
> +
> +		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
> +		sev_features ^= sev->sev_features;
> +		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {

Why is only INT_INJ_MODES checked?  The new comment in sev_es_sync_vmsa() explicitly
states that sev_features are the same for all vCPUs, but that's not enforced here.
At a bare minimum I would expect this to sanity check SVM_SEV_FEATURES_SNP_ACTIVE.

> +			vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
> +				    vcpu->arch.regs[VCPU_REGS_RAX]);
> +			goto out;
> +		}
> +	}
> +
> +	switch (request) {
> +	case SVM_VMGEXIT_AP_CREATE_ON_INIT:

Out of curiosity, what's the use case for this variant?  I assume the guest has
to preconfigure the VMSA and ensure the target vCPU's RIP points at something
sane anyways, otherwise the hypervisor could attack the guest by immediately
attempting to run the deferred vCPU.  At that point, a guest could simply use an
existing mechanism to put the target vCPU into a holding pattern.

> +		kick = false;
> +		target_svm->snp_vmsa_update_on_init = true;
> +		fallthrough;
> +	case SVM_VMGEXIT_AP_CREATE:
> +		target_svm->snp_vmsa_gpa = svm->vmcb->control.exit_info_2;

The incoming GPA needs to be checked for validity, at least as much possible.
E.g. the PAGE_ALIGNED() check should be done here and be morphed to a synchronous
error for the guest, not a silent "oops, didn't run your vCPU".

> +		break;
> +	case SVM_VMGEXIT_AP_DESTROY:
> +		break;
> +	default:
> +		vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
> +			    request);
> +		break;
> +	}
> +
> +out:
> +	mutex_unlock(&target_svm->snp_vmsa_mutex);
> +
> +	if (kick) {
> +		kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
> +		kvm_vcpu_kick(target_vcpu);
> +	}
> +}
> +
>  int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -3523,6 +3649,11 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  		ret = 1;
>  		break;
>  	}
> +	case SVM_VMGEXIT_AP_CREATION:
> +		sev_snp_ap_creation(svm);
> +
> +		ret = 1;
> +		break;
>  	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
>  		vcpu_unimpl(vcpu,
>  			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
> @@ -3597,6 +3728,8 @@ void sev_es_create_vcpu(struct vcpu_svm *svm)
>  	set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
>  					    GHCB_VERSION_MIN,
>  					    sev_enc_bit));
> +
> +	mutex_init(&svm->snp_vmsa_mutex);
>  }
>  
>  void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 74bc635c9608..078a569c85a8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1304,7 +1304,10 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	svm->spec_ctrl = 0;
>  	svm->virt_spec_ctrl = 0;
>  
> -	if (!init_event) {
> +	if (init_event && svm->snp_vmsa_update_on_init) {

This can race with sev_snp_ap_creation() since the new snp_vmsa_mutex isn't held.
There needs to be smp_rmb() and smp_wmb() barriers to ensure correct ordering
between snp_vmsa_update_on_init and consuming the new VMSA gpa.  And of course
sev_snp_ap_creation() needs to have correct ordering, e.g. as is this code can
see snp_vmsa_update_on_init=true before the new snp_vmsa_gpa is set.

> +		svm->snp_vmsa_update_on_init = false;
> +		sev_snp_update_protected_guest_state(vcpu);
> +	} else {
>  		vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
>  				       MSR_IA32_APICBASE_ENABLE;
>  		if (kvm_vcpu_is_reset_bsp(vcpu))

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 39/40] KVM: SVM: Use a VMSA physical address variable for populating VMCB
  2021-07-07 18:36 ` [PATCH Part2 RFC v4 39/40] KVM: SVM: Use a VMSA physical address variable for populating VMCB Brijesh Singh
@ 2021-07-21  0:20   ` Sean Christopherson
  2021-07-21 16:26     ` Tom Lendacky
  0 siblings, 1 reply; 176+ messages in thread
From: Sean Christopherson @ 2021-07-21  0:20 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, Tom Lendacky, H. Peter Anvin, Ard Biesheuvel,
	Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Andy Lutomirski, Dave Hansen, Sergio Lopez, Peter Gonda,
	Peter Zijlstra, Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On Wed, Jul 07, 2021, Brijesh Singh wrote:
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> In preparation to support SEV-SNP AP Creation, use a variable that holds
> the VMSA physical address rather than converting the virtual address.
> This will allow SEV-SNP AP Creation to set the new physical address that
> will be used should the vCPU reset path be taken.

I'm pretty sure adding vmsa_pa is unnecessary.  The next patch sets svm->vmsa_pa
and vmcb->control.vmsa_pa as a pair.  And for the existing code, my proposed
patch to emulate INIT on shutdown would eliminate the one path that zeros the
VMCB[1].  That series patch also drops the init_vmcb() in svm_create_vcpu()[2].

Assuming there are no VMCB shenanigans I'm missing, sev_es_init_vmcb() can do

	if (!init_event)
		svm->vmcb->control.vmsa_pa = __pa(svm->vmsa);

And while I'm thinking of it, the next patch should ideally free svm->vmsa when
the the guest configures a new VMSA for the vCPU.

[1] https://lkml.kernel.org/r/20210713163324.627647-45-seanjc@google.com
[2] https://lkml.kernel.org/r/20210713163324.627647-10-seanjc@google.com

> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/kvm/svm/sev.c | 5 ++---
>  arch/x86/kvm/svm/svm.c | 9 ++++++++-
>  arch/x86/kvm/svm/svm.h | 1 +
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 4cb4c1d7e444..d8ad6dd58c87 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3553,10 +3553,9 @@ void sev_es_init_vmcb(struct vcpu_svm *svm)
>  
>  	/*
>  	 * An SEV-ES guest requires a VMSA area that is a separate from the
> -	 * VMCB page. Do not include the encryption mask on the VMSA physical
> -	 * address since hardware will access it using the guest key.
> +	 * VMCB page.
>  	 */
> -	svm->vmcb->control.vmsa_pa = __pa(svm->vmsa);
> +	svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
>  
>  	/* Can't intercept CR register access, HV can't modify CR registers */
>  	svm_clr_intercept(svm, INTERCEPT_CR0_READ);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 32e35d396508..74bc635c9608 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1379,9 +1379,16 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>  	svm->vmcb01.ptr = page_address(vmcb01_page);
>  	svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
>  
> -	if (vmsa_page)
> +	if (vmsa_page) {
>  		svm->vmsa = page_address(vmsa_page);
>  
> +		/*
> +		 * Do not include the encryption mask on the VMSA physical
> +		 * address since hardware will access it using the guest key.
> +		 */
> +		svm->vmsa_pa = __pa(svm->vmsa);
> +	}
> +
>  	svm->guest_state_loaded = false;
>  
>  	svm_switch_vmcb(svm, &svm->vmcb01);
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9fcfc0a51737..285d9b97b4d2 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -177,6 +177,7 @@ struct vcpu_svm {
>  
>  	/* SEV-ES support */
>  	struct sev_es_save_area *vmsa;
> +	hpa_t vmsa_pa;
>  	struct ghcb *ghcb;
>  	struct kvm_host_map ghcb_map;
>  	bool received_first_sipi;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 39/40] KVM: SVM: Use a VMSA physical address variable for populating VMCB
  2021-07-21  0:20   ` Sean Christopherson
@ 2021-07-21 16:26     ` Tom Lendacky
  0 siblings, 0 replies; 176+ messages in thread
From: Tom Lendacky @ 2021-07-21 16:26 UTC (permalink / raw)
  To: Sean Christopherson, Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Andy Lutomirski,
	Dave Hansen, Sergio Lopez, Peter Gonda, Peter Zijlstra,
	Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/20/21 7:20 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> In preparation to support SEV-SNP AP Creation, use a variable that holds
>> the VMSA physical address rather than converting the virtual address.
>> This will allow SEV-SNP AP Creation to set the new physical address that
>> will be used should the vCPU reset path be taken.
> 
> I'm pretty sure adding vmsa_pa is unnecessary.  The next patch sets svm->vmsa_pa
> and vmcb->control.vmsa_pa as a pair.  And for the existing code, my proposed
> patch to emulate INIT on shutdown would eliminate the one path that zeros the
> VMCB[1].  That series patch also drops the init_vmcb() in svm_create_vcpu()[2].
> 
> Assuming there are no VMCB shenanigans I'm missing, sev_es_init_vmcb() can do
> 
> 	if (!init_event)
> 		svm->vmcb->control.vmsa_pa = __pa(svm->vmsa);

That will require passing init_event through to init_vmcb and successive
functions and ensuring that there isn't a path that could cause it to not
be set after it should no longer be used. This is very simple at the
moment, but maybe can be re-worked once all of the other changes you
mention are integrated.

Thanks,
Tom

> 
> And while I'm thinking of it, the next patch should ideally free svm->vmsa when
> the the guest configures a new VMSA for the vCPU.
> 
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210713163324.627647-45-seanjc%40google.com&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7Cef81e5604f5242262b6908d94bdd5b32%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637624236352681486%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O3LKXhVLqNuT1PpCNzkjG8Vho7wfMEibFgGbZkoFlMk%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210713163324.627647-10-seanjc%40google.com&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7Cef81e5604f5242262b6908d94bdd5b32%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637624236352681486%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rn6zZZbGEnN4Hd60Mg3EsPU3fIaoBHdA3jTluiDRvpo%3D&amp;reserved=0
> 
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/kvm/svm/sev.c | 5 ++---
>>  arch/x86/kvm/svm/svm.c | 9 ++++++++-
>>  arch/x86/kvm/svm/svm.h | 1 +
>>  3 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 4cb4c1d7e444..d8ad6dd58c87 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -3553,10 +3553,9 @@ void sev_es_init_vmcb(struct vcpu_svm *svm)
>>  
>>  	/*
>>  	 * An SEV-ES guest requires a VMSA area that is a separate from the
>> -	 * VMCB page. Do not include the encryption mask on the VMSA physical
>> -	 * address since hardware will access it using the guest key.
>> +	 * VMCB page.
>>  	 */
>> -	svm->vmcb->control.vmsa_pa = __pa(svm->vmsa);
>> +	svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
>>  
>>  	/* Can't intercept CR register access, HV can't modify CR registers */
>>  	svm_clr_intercept(svm, INTERCEPT_CR0_READ);
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 32e35d396508..74bc635c9608 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -1379,9 +1379,16 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>>  	svm->vmcb01.ptr = page_address(vmcb01_page);
>>  	svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
>>  
>> -	if (vmsa_page)
>> +	if (vmsa_page) {
>>  		svm->vmsa = page_address(vmsa_page);
>>  
>> +		/*
>> +		 * Do not include the encryption mask on the VMSA physical
>> +		 * address since hardware will access it using the guest key.
>> +		 */
>> +		svm->vmsa_pa = __pa(svm->vmsa);
>> +	}
>> +
>>  	svm->guest_state_loaded = false;
>>  
>>  	svm_switch_vmcb(svm, &svm->vmcb01);
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index 9fcfc0a51737..285d9b97b4d2 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -177,6 +177,7 @@ struct vcpu_svm {
>>  
>>  	/* SEV-ES support */
>>  	struct sev_es_save_area *vmsa;
>> +	hpa_t vmsa_pa;
>>  	struct ghcb *ghcb;
>>  	struct kvm_host_map ghcb_map;
>>  	bool received_first_sipi;
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 176+ messages in thread

* Re: [PATCH Part2 RFC v4 40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event
  2021-07-21  0:01   ` Sean Christopherson
@ 2021-07-21 17:47     ` Tom Lendacky
  2021-07-21 19:52       ` Sean Christopherson
  0 siblings, 1 reply; 176+ messages in thread
From: Tom Lendacky @ 2021-07-21 17:47 UTC (permalink / raw)
  To: Sean Christopherson, Brijesh Singh
  Cc: x86, linux-kernel, kvm, linux-efi, platform-driver-x86,
	linux-coco, linux-mm, linux-crypto, Thomas Gleixner, Ingo Molnar,
	Joerg Roedel, H. Peter Anvin, Ard Biesheuvel, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Andy Lutomirski,
	Dave Hansen, Sergio Lopez, Peter Gonda, Peter Zijlstra,
	Srinivas Pandruvada, David Rientjes, Dov Murik,
	Tobin Feldman-Fitzthum, Borislav Petkov, Michael Roth,
	Vlastimil Babka, tony.luck, npmccallum, brijesh.ksingh

On 7/20/21 7:01 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
>> guests to create and start APs on their own.
> 
> The changelog really needs to clarify that this doesn't allow the guest to create
> arbitrary vCPUs.  The GHCB uses CREATE/DESTROY terminology, but this patch and its
> comments/documentation should very clearly call out that KVM's implementation is
> more along the line of vCPU online/offline.

Will update.

> 
> It should also be noted that KVM still onlines APs by default.  That also raises
> the question of whether or not KVM should support creating an offlined vCPU.
> E.g. several of the use cases I'm aware of want to do something along the lines
> of creating a VM with the max number of theoretical vCPUs, but in most instances
> only run a handful of vCPUs.  That's a fair amount of potential memory savings
> if the max theoretical number of vCPUs is high.
> 
>> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
>> so as to avoid updating the VMSA pointer while the vCPU is running.
>>
>> For CREATE
>>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>>   specified APIC ID. The GPA is saved in the svm struct of the target
>>   vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the
>>   vCPU and then the vCPU is kicked.
>>
>> For CREATE_ON_INIT:
>>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>>   specified APIC ID the next time an INIT is performed. The GPA is saved
>>   in the svm struct of the target vCPU.
>>
>> For DESTROY:
>>   The guest indicates it wishes to stop the vCPU. The GPA is cleared from
>>   the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
>>   to vCPU and then the vCPU is kicked.
>>
>>
>> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as
>> a result of the event or as a result of an INIT. The handler sets the vCPU
>> to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the
>> vCPU as not runnable. Any prev