kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH] KVM: SEV: Replace KVM_EXIT_VMGEXIT with KVM_EXIT_SNP_REQ_CERTS
  @ 2024-05-15  1:25  3% ` Michael Roth
  0 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-15  1:25 UTC (permalink / raw)
  To: seanjc
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, vkuznets,
	jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, papaluri

It's not clear if SNP guests will need any other SNP-specific userspace
exit types in the future, and if they do, it's not clear that they would
necessary be related to VMGEXIT events or something else entirely.

So, rather than trying to anticipate future use-cases and have a single
union structure to manage the associated parameters, just use a common
KVM_EXIT_SNP_* prefix, but otherwise treat these as separate events, and
go ahead and convert the only VMGEXIT type currently defined,
KVM_USER_VMGEXIT_REQ_CERTS, over to KVM_EXIT_SNP_REQ_CERTS.

Also, formally document that kvm_run->immediate_exit is guaranteed to
handle userspace IO completion callbacks before returning to userspace
with -EINTR, and use this mechanism to allow userspace to use this
mechanism as a means to know when an attestation request is complete and
the KVM_EXIT_SNP_REQ_CERTS event is fully-finished, allowing userspace
to at that point handle any certificate synchronization cleanup like
releasing file locks/etc.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
Hi Sean,

Here's an attempt to address your concerns regarding
KVM_EXIT_VMGEXIT/KVM_USER_VMGEXIT_REQ_CERTS. Please let me know what
you think.

The main gist of it is that it leverages kvm->immediate_edit rather than
a flag in the kvm_run exit struct to receive notification when the
attestation has been completed and the certificate is no longer needed.

Additionally it drops the confusing KVM_EXIT_VMGEXIT naming in favor of
the KVM_EXIT_SNP_* prefix, and rather than introduce infrastructure and
a union to handle other SNP-specific types in the future, it simply
defines KVM_EXIT_SNP_REQ_CERTS as a one-off event so we are free to
handle future SNP-specific/SNP-related userspace exits however makes
sense for future cases.

Thanks,

Mike

 Documentation/virt/kvm/api.rst | 64 +++++++++++-----------------------
 arch/x86/kvm/svm/sev.c         | 28 +++------------
 include/uapi/linux/kvm.h       | 31 ++++++++--------
 3 files changed, 43 insertions(+), 80 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 6ab8b5b7c64e..ea05c16f3438 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6381,6 +6381,11 @@ to avoid usage of KVM_SET_SIGNAL_MASK, which has worse scalability.
 Rather than blocking the signal outside KVM_RUN, userspace can set up
 a signal handler that sets run->immediate_exit to a non-zero value.
 
+Also note that any KVM_EXIT_* events that have associated completion
+callbacks that KVM needs to process when KVM_RUN is called will be
+processed *before* exiting again to userspace with -EINTR as a result
+of run->immediate_exit.
+
 This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available.
 
 ::
@@ -7069,50 +7074,24 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
 ::
 
-		/* KVM_EXIT_VMGEXIT */
-		struct kvm_user_vmgexit {
-  #define KVM_USER_VMGEXIT_REQ_CERTS		1
-			__u32 type; /* KVM_USER_VMGEXIT_* type */
-			union {
-				struct {
-					__u64 data_gpa;
-					__u64 data_npages;
-  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN   1
-  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY          2
-  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC       (1 << 31)
-					__u32 ret;
-  #define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE	BIT(0)
-					__u8 flags;
-  #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING	0
-  #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE		1
-					__u8 status;
-				} req_certs;
-			};
-		};
-
-
-If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
-has issued a VMGEXIT instruction (as documented by the AMD Architecture
-Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
-userspace. These are generally handled by the host kernel, but in some
-cases some aspects of handling a VMGEXIT are done in userspace.
-
-A kvm_user_vmgexit structure is defined to encapsulate the data to be
-sent to or returned by userspace. The type field defines the specific type
-of exit that needs to be serviced, and that type is used as a discriminator
-to determine which union type should be used for input/output.
-
-KVM_USER_VMGEXIT_REQ_CERTS
---------------------------
+		/* KVM_EXIT_SNP_REQ_CERTS */
+		struct {
+			__u64 data_gpa;
+			__u64 data_npages;
+  #define KVM_EXIT_SNP_REQ_CERTS_ERROR_INVALID_LEN   1
+  #define KVM_EXIT_SNP_REQ_CERTS_ERROR_BUSY          2
+  #define KVM_EXIT_SNP_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+			__u32 ret;
+		} snp_req_certs;
 
 When an SEV-SNP issues a guest request for an attestation report, it has the
 option of issuing it in the form an *extended* guest request when a
 certificate blob is returned alongside the attestation report so the guest
 can validate the endorsement key used by SNP firmware to sign the report.
 These certificates are managed by userspace and are requested via
-KVM_EXIT_VMGEXITs using the KVM_USER_VMGEXIT_REQ_CERTS type.
+KVM_EXIT_SNP exits using the KVM_EXIT_SNP_REQ_CERTS type.
 
-For the KVM_USER_VMGEXIT_REQ_CERTS type, the req_certs union type
+For the KVM_EXIT_SNP_REQ_CERTS type, the req_certs union type
 is used. The kernel will supply in 'data_gpa' the value the guest supplies
 via the RAX field of the GHCB when issuing extended guest requests.
 'data_npages' will similarly contain the value the guest supplies in RBX
@@ -7139,12 +7118,11 @@ this is for the VMM to obtain a shared or exclusive lock on the path the
 certificate blob file resides at before reading it and returning it to KVM,
 and that it continues to hold the lock until the attestation request is
 actually sent to firmware. To facilitate this, the VMM can set the
-KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE flag before returning the
-certificate blob, in which case another KVM_EXIT_VMGEXIT of type
-KVM_USER_VMGEXIT_REQ_CERTS will be sent to userspace with
-KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE being set in the status field to
-indicate the request is fully-completed and that any associated locks can be
-released.
+run->immediate_exit flag before returning the certificate blob, in which
+case KVM is guaranteed to complete the issuing of all pending IO completion
+callbacks before exiting to userspace with EINTR. At this point userspace
+can release any locks it may have taken when the KVM_EXIT_SNP_REQ_CERTS exit
+was originally received.
 
 Tools/libraries that perform updates to SNP firmware TCB values or endorsement
 keys (e.g. firmware interfaces such as SNP_COMMIT, SNP_SET_CONFIG, or
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6cf665c410b2..e6318bbd8a6a 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -4006,21 +4006,14 @@ static int snp_complete_ext_guest_req(struct kvm_vcpu *vcpu)
 	sev_ret_code fw_err = 0;
 	int vmm_ret;
 
-	vmm_ret = vcpu->run->vmgexit.req_certs.ret;
+	vmm_ret = vcpu->run->snp_req_certs.ret;
 	if (vmm_ret) {
 		if (vmm_ret == SNP_GUEST_VMM_ERR_INVALID_LEN)
 			vcpu->arch.regs[VCPU_REGS_RBX] =
-				vcpu->run->vmgexit.req_certs.data_npages;
+				vcpu->run->snp_req_certs.data_npages;
 		goto out;
 	}
 
-	/*
-	 * The request was completed on the previous completion callback and
-	 * this completion is only for the STATUS_DONE userspace notification.
-	 */
-	if (vcpu->run->vmgexit.req_certs.status == KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE)
-		goto out_resume;
-
 	control = &svm->vmcb->control;
 
 	if (__snp_handle_guest_req(kvm, control->exit_info_1,
@@ -4029,14 +4022,6 @@ static int snp_complete_ext_guest_req(struct kvm_vcpu *vcpu)
 
 out:
 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
-
-	if (vcpu->run->vmgexit.req_certs.flags & KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE) {
-		vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE;
-		vcpu->run->vmgexit.req_certs.flags = 0;
-		return 0; /* notify userspace of completion */
-	}
-
-out_resume:
 	return 1; /* resume guest */
 }
 
@@ -4060,12 +4045,9 @@ static int snp_begin_ext_guest_req(struct kvm_vcpu *vcpu)
 	 * Grab the certificates from userspace so that can be bundled with
 	 * attestation/guest requests.
 	 */
-	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
-	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_REQ_CERTS;
-	vcpu->run->vmgexit.req_certs.data_gpa = data_gpa;
-	vcpu->run->vmgexit.req_certs.data_npages = data_npages;
-	vcpu->run->vmgexit.req_certs.flags = 0;
-	vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING;
+	vcpu->run->exit_reason = KVM_EXIT_SNP_REQ_CERTS;
+	vcpu->run->snp_req_certs.data_gpa = data_gpa;
+	vcpu->run->snp_req_certs.data_npages = data_npages;
 	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_req;
 
 	return 0; /* forward request to userspace */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 106367d87189..8ebfc91dc967 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,22 +135,17 @@ struct kvm_xen_exit {
 	} u;
 };
 
-struct kvm_user_vmgexit {
-#define KVM_USER_VMGEXIT_REQ_CERTS		1
-	__u32 type; /* KVM_USER_VMGEXIT_* type */
+struct kvm_exit_snp {
+#define KVM_EXIT_SNP_REQ_CERTS		1
+	__u32 type; /* KVM_EXIT_SNP_* type */
 	union {
 		struct {
 			__u64 data_gpa;
 			__u64 data_npages;
-#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN   1
-#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY          2
-#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+#define KVM_EXIT_SNP_REQ_CERTS_ERROR_INVALID_LEN   1
+#define KVM_EXIT_SNP_REQ_CERTS_ERROR_BUSY          2
+#define KVM_EXIT_SNP_REQ_CERTS_ERROR_GENERIC       (1 << 31)
 			__u32 ret;
-#define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE	BIT(0)
-			__u8 flags;
-#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING	0
-#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE		1
-			__u8 status;
 		} req_certs;
 	};
 };
@@ -198,7 +193,7 @@ struct kvm_user_vmgexit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
-#define KVM_EXIT_VMGEXIT          40
+#define KVM_EXIT_SNP_REQUEST_CERTS 40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -454,8 +449,16 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
-		/* KVM_EXIT_VMGEXIT */
-		struct kvm_user_vmgexit vmgexit;
+		/* KVM_EXIT_SNP_REQ_CERTS */
+		struct {
+			__u64 data_gpa;
+			__u64 data_npages;
+#define KVM_EXIT_SNP_REQ_CERTS_ERROR_INVALID_LEN   1
+#define KVM_EXIT_SNP_REQ_CERTS_ERROR_BUSY          2
+#define KVM_EXIT_SNP_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+			__u32 ret;
+		} snp_req_certs;
+
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* [PULL 04/19] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
  @ 2024-05-10 21:10 10% ` Michael Roth
  2024-05-10 21:10  2% ` [PULL 10/19] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  2024-05-10 21:10  3% ` [PULL 18/19] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-10 21:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Sean Christopherson, Brijesh Singh, Ashish Kalra

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-6-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  28 ++-
 arch/x86/include/uapi/asm/kvm.h               |  11 ++
 arch/x86/kvm/svm/sev.c                        | 176 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 212 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 9677a0714a39..dd179e162a87 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -466,6 +466,30 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. It must be called prior to issuing
+KVM_SEV_SNP_LAUNCH_UPDATE or KVM_SEV_SNP_LAUNCH_FINISH;
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+                __u16 flags;            /* Must be zero. */
+                __u8 pad0[6];
+                __u64 pad1[4];
+        };
+
+See SNP_LAUNCH_START in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_start``.
+
 Device attribute API
 ====================
 
@@ -497,9 +521,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index d2ae5fcc0275..693a80ffe40a 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -697,6 +697,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START = 100,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -824,6 +827,14 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+	__u16 flags;
+	__u8 pad0[6];
+	__u64 pad1[4];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b3345d45b989..b372ae5c8c58 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -59,6 +60,21 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_API_MINOR	GENMASK_ULL(7, 0)
+#define SNP_POLICY_MASK_API_MAJOR	GENMASK_ULL(15, 8)
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_RSVD_MBO	BIT_ULL(17)
+#define SNP_POLICY_MASK_DEBUG		BIT_ULL(19)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+
+#define SNP_POLICY_MASK_VALID		(SNP_POLICY_MASK_API_MINOR	| \
+					 SNP_POLICY_MASK_API_MAJOR	| \
+					 SNP_POLICY_MASK_SMT		| \
+					 SNP_POLICY_MASK_RSVD_MBO	| \
+					 SNP_POLICY_MASK_DEBUG		| \
+					 SNP_POLICY_MASK_SINGLE_SOCKET)
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -69,6 +85,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -95,12 +113,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1998,6 +2021,106 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context)
+		return -EINVAL;
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.flags)
+		return -EINVAL;
+
+	if (params.policy & ~SNP_POLICY_MASK_VALID)
+		return -EINVAL;
+
+	/* Check for policy bits that must be set */
+	if (!(params.policy & SNP_POLICY_MASK_RSVD_MBO) ||
+	    !(params.policy & SNP_POLICY_MASK_SMT))
+		return -EINVAL;
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET)
+		return -EINVAL;
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("%s: SEV_CMD_SNP_LAUNCH_START firmware command failed, rc %d\n",
+			 __func__, rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("%s: Failed to bind ASID to SEV-SNP context, rc %d\n",
+			 __func__, rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -2021,6 +2144,15 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 		goto out;
 	}
 
+	/*
+	 * Once KVM_SEV_INIT2 initializes a KVM instance as an SNP guest, only
+	 * allow the use of SNP-specific commands.
+	 */
+	if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) {
+		r = -EPERM;
+		goto out;
+	}
+
 	switch (sev_cmd.id) {
 	case KVM_SEV_ES_INIT:
 		if (!sev_es_enabled) {
@@ -2085,6 +2217,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2280,6 +2415,31 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	/* Do the decommision, which will unbind the ASID from the SNP context */
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	up_write(&sev_deactivate_lock);
+
+	if (WARN_ONCE(ret, "Failed to release guest context, ret %d", ret))
+		return ret;
+
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2321,7 +2481,17 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		/*
+		 * Decomission handles unbinding of the ASID. If it fails for
+		 * some unexpected reason, just leak the ASID.
+		 */
+		if (snp_decommission_context(kvm))
+			return;
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 583e035d38f8..305772d36490 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -93,6 +93,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1


^ permalink raw reply related	[relevance 10%]

* [PULL 18/19] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
    2024-05-10 21:10 10% ` [PULL 04/19] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
  2024-05-10 21:10  2% ` [PULL 10/19] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
@ 2024-05-10 21:10  3% ` Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-10 21:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, linux-kernel, Sean Christopherson

Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but allows for
additional certificate data to be supplied via an additional
guest-supplied buffer to be used mainly for verifying the signature of
an attestation report as returned by firmware.

This certificate data is supplied by userspace, so unlike with
SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first
forwarded to userspace via a KVM_EXIT_VMGEXIT exit structure, and then
the firmware request is made after the certificate data has been fetched
from userspace.

Since there is a potential for race conditions where the
userspace-supplied certificate data may be out-of-sync relative to the
reported TCB or VLEK that firmware will use when signing attestation
reports, a hook is also provided so that userspace can be informed once
the attestation request is actually completed. See the updates to
Documentation/ for more details on these aspects.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-20-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/virt/kvm/api.rst | 87 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/sev.c         | 86 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h         |  3 ++
 include/uapi/linux/kvm.h       | 23 +++++++++
 4 files changed, 199 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..f3780ac98d56 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,93 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+  #define KVM_USER_VMGEXIT_REQ_CERTS		1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 data_gpa;
+					__u64 data_npages;
+  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN   1
+  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY          2
+  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+					__u32 ret;
+  #define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE	BIT(0)
+					__u8 flags;
+  #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING	0
+  #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE		1
+					__u8 status;
+				} req_certs;
+			};
+		};
+
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects of handling a VMGEXIT are done in userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+KVM_USER_VMGEXIT_REQ_CERTS
+--------------------------
+
+When an SEV-SNP issues a guest request for an attestation report, it has the
+option of issuing it in the form an *extended* guest request when a
+certificate blob is returned alongside the attestation report so the guest
+can validate the endorsement key used by SNP firmware to sign the report.
+These certificates are managed by userspace and are requested via
+KVM_EXIT_VMGEXITs using the KVM_USER_VMGEXIT_REQ_CERTS type.
+
+For the KVM_USER_VMGEXIT_REQ_CERTS type, the req_certs union type
+is used. The kernel will supply in 'data_gpa' the value the guest supplies
+via the RAX field of the GHCB when issuing extended guest requests.
+'data_npages' will similarly contain the value the guest supplies in RBX
+denoting the number of shared pages available to write the certificate
+data into.
+
+  - If the supplied number of pages is sufficient, userspace should write
+    the certificate data blob (in the format defined by the GHCB spec) in
+    the address indicated by 'data_gpa' and set 'ret' to 0.
+
+  - If the number of pages supplied is not sufficient, userspace must write
+    the required number of pages in 'data_npages' and then set 'ret' to 1.
+
+  - If userspace is temporarily unable to handle the request, 'ret' should
+    be set to 2 to inform the guest to retry later.
+
+  - If some other error occurred, userspace should set 'ret' to a non-zero
+    value that is distinct from the specific return values mentioned above.
+
+Generally some care needs be taken to keep the returned certificate data in
+sync with the actual endorsement key in use by firmware at the time the
+attestation request is sent to SNP firmware. The recommended scheme to do
+this is for the VMM to obtain a shared or exclusive lock on the path the
+certificate blob file resides at before reading it and returning it to KVM,
+and that it continues to hold the lock until the attestation request is
+actually sent to firmware. To facilitate this, the VMM can set the
+KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE flag before returning the
+certificate blob, in which case another KVM_EXIT_VMGEXIT of type
+KVM_USER_VMGEXIT_REQ_CERTS will be sent to userspace with
+KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE being set in the status field to
+indicate the request is fully-completed and that any associated locks can be
+released.
+
+Tools/libraries that perform updates to SNP firmware TCB values or endorsement
+keys (e.g. firmware interfaces such as SNP_COMMIT, SNP_SET_CONFIG, or
+SNP_VLEK_LOAD, see Documentation/virt/coco/sev-guest.rst for more details) in
+such a way that the certificate blob needs to be updated, should similarly
+take an exclusive lock on the certificate blob for the duration of any updates
+to firmware or the certificate blob contents to ensure that VMMs using the
+above scheme will not return certificate blob data that is out of sync with
+firmware.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 00d29d278f6e..398266bef2ca 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3297,6 +3297,11 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 		if (!sev_snp_guest(vcpu->kvm))
 			goto vmgexit_err;
 		break;
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+		if (!sev_snp_guest(vcpu->kvm) || !kvm_ghcb_rax_is_valid(svm) ||
+		    !kvm_ghcb_rbx_is_valid(svm))
+			goto vmgexit_err;
+		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
 		goto vmgexit_err;
@@ -3996,6 +4001,84 @@ static void snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp
 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
 }
 
+static int snp_complete_ext_guest_req(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	struct vmcb_control_area *control;
+	struct kvm *kvm = vcpu->kvm;
+	sev_ret_code fw_err = 0;
+	int vmm_ret;
+
+	vmm_ret = vcpu->run->vmgexit.req_certs.ret;
+	if (vmm_ret) {
+		if (vmm_ret == SNP_GUEST_VMM_ERR_INVALID_LEN)
+			vcpu->arch.regs[VCPU_REGS_RBX] =
+				vcpu->run->vmgexit.req_certs.data_npages;
+		goto out;
+	}
+
+	/*
+	 * The request was completed on the previous completion callback and
+	 * this completion is only for the STATUS_DONE userspace notification.
+	 */
+	if (vcpu->run->vmgexit.req_certs.status == KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE)
+		goto out_resume;
+
+	control = &svm->vmcb->control;
+
+	if (__snp_handle_guest_req(kvm, control->exit_info_1,
+				   control->exit_info_2, &fw_err))
+		vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+
+out:
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+
+	if (vcpu->run->vmgexit.req_certs.flags & KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE) {
+		vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE;
+		vcpu->run->vmgexit.req_certs.flags = 0;
+		return 0; /* notify userspace of completion */
+	}
+
+out_resume:
+	return 1; /* resume guest */
+}
+
+static int snp_begin_ext_guest_req(struct kvm_vcpu *vcpu)
+{
+	int vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+	struct vcpu_svm *svm = to_svm(vcpu);
+	unsigned long data_npages;
+	sev_ret_code fw_err;
+	gpa_t data_gpa;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		goto abort_request;
+
+	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
+	data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
+
+	if (!IS_ALIGNED(data_gpa, PAGE_SIZE))
+		goto abort_request;
+
+	/*
+	 * Grab the certificates from userspace so that can be bundled with
+	 * attestation/guest requests.
+	 */
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_REQ_CERTS;
+	vcpu->run->vmgexit.req_certs.data_gpa = data_gpa;
+	vcpu->run->vmgexit.req_certs.data_npages = data_npages;
+	vcpu->run->vmgexit.req_certs.flags = 0;
+	vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING;
+	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_req;
+
+	return 0; /* forward request to userspace */
+
+abort_request:
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+	return 1; /* resume guest */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -4274,6 +4357,9 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		snp_handle_guest_req(svm, control->exit_info_1, control->exit_info_2);
 		ret = 1;
 		break;
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+		ret = snp_begin_ext_guest_req(vcpu);
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 555c55f50298..97b3683ea324 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -309,6 +309,9 @@ struct vcpu_svm {
 
 	/* Guest GIF value, used when vGIF is not enabled */
 	bool guest_gif;
+
+	/* Transaction ID associated with SNP config updates */
+	u64 snp_transaction_id;
 };
 
 struct svm_cpu_data {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..106367d87189 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,26 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_REQ_CERTS		1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 data_gpa;
+			__u64 data_npages;
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN   1
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY          2
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+			__u32 ret;
+#define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE	BIT(0)
+			__u8 flags;
+#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING	0
+#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE		1
+			__u8 status;
+		} req_certs;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -178,6 +198,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +454,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* [PULL 10/19] KVM: SEV: Add support to handle RMP nested page faults
    2024-05-10 21:10 10% ` [PULL 04/19] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
@ 2024-05-10 21:10  2% ` Michael Roth
  2024-05-10 21:10  3% ` [PULL 18/19] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-10 21:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Sean Christopherson, Brijesh Singh, Ashish Kalra

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-12-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/include/asm/sev.h      |   3 +
 arch/x86/kvm/mmu.h              |   2 -
 arch/x86/kvm/mmu/mmu.c          |   1 +
 arch/x86/kvm/svm/sev.c          | 109 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  14 ++--
 arch/x86/kvm/svm/svm.h          |   3 +
 arch/x86/kvm/trace.h            |  31 +++++++++
 arch/x86/kvm/x86.c              |   1 +
 9 files changed, 159 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ea992f9b6e68..5830d42232da 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1945,6 +1945,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 93ed60080cfe..14394407245c 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2343c9f00e31..e3cb35b9396d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -251,8 +251,6 @@ static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
 	return __kvm_mmu_honors_guest_mtrrs(kvm_arch_has_noncoherent_dma(kvm));
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9c7ab06ce454..83025fe2d17d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6783,6 +6783,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 
 	return need_tlb_flush;
 }
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
 
 static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
 					   const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 46669431b53d..515bd6154a4b 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3465,6 +3465,23 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	int ret;
+
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	/*
+	 * PSMASH_FAIL_INUSE indicates another processor is modifying the
+	 * entry, so retry until that's no longer the case.
+	 */
+	do {
+		ret = psmash(pfn);
+	} while (ret == PSMASH_FAIL_INUSE);
+
+	return ret;
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4229,3 +4246,95 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out_no_trace;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K)
+		goto out;
+
+	ret = snp_rmptable_psmash(pfn);
+	if (ret) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have
+		 * raced with another process and the issue has already resolved
+		 * itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) &&
+		    assigned && rmp_level == PG_LEVEL_4K)
+			goto out;
+
+		pr_warn_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				    gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	trace_kvm_rmp_fault(vcpu, gpa, pfn, error_code, rmp_level, ret);
+out_no_trace:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 66d5e2e46a66..bdaf39571817 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2044,6 +2044,7 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
@@ -2061,10 +2062,15 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 		error_code |= PFERR_PRIVATE_ACCESS;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 32d37ef5f580..36b573427b85 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -728,6 +728,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -741,6 +742,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index c6b4b1728006..3531a187d5d9 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1834,6 +1834,37 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
 		  __entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
 );
 
+/*
+ * Tracepoint for #NPFs due to RMP faults.
+ */
+TRACE_EVENT(kvm_rmp_fault,
+	TP_PROTO(struct kvm_vcpu *vcpu, u64 gpa, u64 pfn, u64 error_code,
+		 int rmp_level, int psmash_ret),
+	TP_ARGS(vcpu, gpa, pfn, error_code, rmp_level, psmash_ret),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, vcpu_id)
+		__field(u64, gpa)
+		__field(u64, pfn)
+		__field(u64, error_code)
+		__field(int, rmp_level)
+		__field(int, psmash_ret)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id	= vcpu->vcpu_id;
+		__entry->gpa		= gpa;
+		__entry->pfn		= pfn;
+		__entry->error_code	= error_code;
+		__entry->rmp_level	= rmp_level;
+		__entry->psmash_ret	= psmash_ret;
+	),
+
+	TP_printk("vcpu %u gpa %016llx pfn 0x%llx error_code 0x%llx rmp_level %d psmash_ret %d",
+		  __entry->vcpu_id, __entry->gpa, __entry->pfn,
+		  __entry->error_code, __entry->rmp_level, __entry->psmash_ret)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be43c77315ce..60ebe3ee9118 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14003,6 +14003,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
 
 static int __init kvm_x86_init(void)
 {
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH 03/17] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
  2024-05-07 15:58  3% [PATCH v2 00/17] KVM: x86/mmu: Page fault and MMIO cleanups Paolo Bonzini
@ 2024-05-07 15:58  5% ` Paolo Bonzini
  2024-05-07 15:58  3% ` [PATCH 06/17] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero Paolo Bonzini
  1 sibling, 0 replies; 200+ results
From: Paolo Bonzini @ 2024-05-07 15:58 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Sean Christopherson

From: Sean Christopherson <seanjc@google.com>

Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
guests, as specified by the APM:

 * Bit 31 (RMP):   Set to 1 if the fault was caused due to an RMP check or a
                   VMPL check failure, 0 otherwise.
 * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
 * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
                   PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
 * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
                   check failure, 0 otherwise.

Note, the APM is *extremely* misleading, and strongly implies that the
above flags can _only_ be set for #NPF exits from SNP guests.  That is a
lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
flavor of SEV guest on SNP capable hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240228024147.41573-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a047480da5af..58bbcf76ad1e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,8 +261,12 @@ enum x86_intercept_stage;
 #define PFERR_FETCH_MASK	BIT(4)
 #define PFERR_PK_MASK		BIT(5)
 #define PFERR_SGX_MASK		BIT(15)
+#define PFERR_GUEST_RMP_MASK	BIT_ULL(31)
 #define PFERR_GUEST_FINAL_MASK	BIT_ULL(32)
 #define PFERR_GUEST_PAGE_MASK	BIT_ULL(33)
+#define PFERR_GUEST_ENC_MASK	BIT_ULL(34)
+#define PFERR_GUEST_SIZEM_MASK	BIT_ULL(35)
+#define PFERR_GUEST_VMPL_MASK	BIT_ULL(36)
 #define PFERR_IMPLICIT_ACCESS	BIT_ULL(48)
 
 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
-- 
2.43.0



^ permalink raw reply related	[relevance 5%]

* [PATCH 06/17] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
  2024-05-07 15:58  3% [PATCH v2 00/17] KVM: x86/mmu: Page fault and MMIO cleanups Paolo Bonzini
  2024-05-07 15:58  5% ` [PATCH 03/17] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF Paolo Bonzini
@ 2024-05-07 15:58  3% ` Paolo Bonzini
  1 sibling, 0 replies; 200+ results
From: Paolo Bonzini @ 2024-05-07 15:58 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Sean Christopherson, Kai Huang

From: Sean Christopherson <seanjc@google.com>

WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
as the error code for #PF is limited to 32 bits (and in practice, 16 bits
on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
(see kvm_vcpu_events.error_code), and is explicitly documented as being
preserved for intecerpted #PF in both the APM:

  The error code saved in EXITINFO1 is the same as would be pushed onto
  the stack by a non-intercepted #PF exception in protected mode.

and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
32-bit field.

Simply drop the upper bits if hardware provides garbage, as spurious
information should do no harm (though in all likelihood hardware is buggy
and the kernel is doomed).

Handling all upper 32 bits in the #PF path will allow moving the sanity
check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
PFERR_GUEST_ENC_MASK without running afoul of the sanity check.

Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
bit minimizes the probability of a collision with the "other" vendor,
without needing to plumb more bits through microcode.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 28b1e7657d7f..3609167ba30e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4501,6 +4501,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	if (WARN_ON_ONCE(fault_address >> 32))
 		return -EFAULT;
 #endif
+	/*
+	 * Legacy #PF exception only have a 32-bit error code.  Simply drop the
+	 * upper bits as KVM doesn't use them for #PF (because they are never
+	 * set), and to ensure there are no collisions with KVM-defined bits.
+	 */
+	if (WARN_ON_ONCE(error_code >> 32))
+		error_code = lower_32_bits(error_code);
 
 	/* Ensure the above sanity check also covers KVM-defined flags. */
 	BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));
-- 
2.43.0



^ permalink raw reply related	[relevance 3%]

* [PATCH v2 00/17] KVM: x86/mmu: Page fault and MMIO cleanups
@ 2024-05-07 15:58  3% Paolo Bonzini
  2024-05-07 15:58  5% ` [PATCH 03/17] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF Paolo Bonzini
  2024-05-07 15:58  3% ` [PATCH 06/17] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero Paolo Bonzini
  0 siblings, 2 replies; 200+ results
From: Paolo Bonzini @ 2024-05-07 15:58 UTC (permalink / raw)
  To: linux-kernel, kvm

This is an updated version of the series at
https://patchew.org/linux/20240228024147.41573-1-seanjc@google.com/
which has been used for SEV-SNP development, taking into account
all review comments.  Patch 8 is the only completely new patch
(and even then it had been posted together with the various
TDX/SNP prep series).

Here is an annotated git range-diff:

==============================================================================
KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation
- tweak commit message, add comment

    @@ Commit message
         Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
         triggers emulation of any kind, as KVM doesn't currently support emulating
         access to guest private memory.  Practically speaking, private faults and
    -    emulation are already mutually exclusive, but there are edge cases upon
    -    edge cases where KVM can return RET_PF_EMULATE, and adding one last check
    -    to harden against weird, unexpected combinations is inexpensive.
    +    emulation are already mutually exclusive, but there are many flow that
    +    can result in KVM returning RET_PF_EMULATE, and adding one last check
    +    to harden against weird, unexpected combinations and/or KVM bugs is
    +    inexpensive.
     
         Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
         Signed-off-by: Sean Christopherson <seanjc@google.com>
    @@ arch/x86/kvm/mmu/mmu_internal.h: static inline int kvm_mmu_do_page_fault(struct
      	else
      		r = vcpu->arch.mmu->page_fault(vcpu, &fault);
      
    ++	/*
    ++	 * Not sure what's happening, but punt to userspace and hope that
    ++	 * they can fix it by changing memory to shared, or they can
    ++	 * provide a better error.
    ++	 */
     +	if (r == RET_PF_EMULATE && fault.is_private) {
    ++		pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
     +		kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
     +		return -EFAULT;
     +	}


==============================================================================
KVM: x86: Remove separate "bit" defines for page fault error code masks
- do not use ilog2

    @@ Commit message
         just to see which flag corresponds to which bit is quite annoying, as is
         having to define two macros just to add recognition of a new flag.
     
    -    Use ilog2() to derive the bit in permission_fault(), the one function that
    -    actually needs the bit number (it does clever shifting to manipulate flags
    -    in order to avoid conditional branches).
    +    Use ternary operator to derive the bit in permission_fault(), the one
    +    function that actually needs the bit number as part of clever shifting
    +    to avoid conditional branches.  Generally the compiler is able to turn
    +    it into a conditional move, and if not it's not really a big deal.
     
         No functional change intended.
     
    @@ arch/x86/kvm/mmu.h: static inline u8 permission_fault(struct kvm_vcpu *vcpu, str
      	u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
      	bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
     -	int index = (pfec + (not_smap << PFERR_RSVD_BIT)) >> 1;
    -+	int index = (pfec + (not_smap << ilog2(PFERR_RSVD_MASK))) >> 1;
    ++	int index = (pfec | (not_smap ? PFERR_RSVD_MASK : 0)) >> 1;
      	u32 errcode = PFERR_PRESENT_MASK;
      	bool fault;
      
     @@ arch/x86/kvm/mmu.h: static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
    + 		pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
      
      		/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
    - 		offset = (pfec & ~1) +
    +-		offset = (pfec & ~1) +
     -			((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT));
    -+			((pte_access & PT_USER_MASK) << (ilog2(PFERR_RSVD_MASK) - PT_USER_SHIFT));
    ++		offset = (pfec & ~1) | ((pte_access & PT_USER_MASK) ? PFERR_RSVD_MASK : 0);
      
      		pkru_bits &= mmu->pkru_mask >> offset;
      		errcode |= -pkru_bits & PFERR_PK_MASK;

==============================================================================
KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
- match commit message to bits defined in header file

    @@ Commit message
         Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
         guests, as specified by the APM:
     
    +     * Bit 31 (RMP):   Set to 1 if the fault was caused due to an RMP check or a
    +                       VMPL check failure, 0 otherwise.
          * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
          * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
                            PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
          * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
                            check failure, 0 otherwise.
    -     * Bit 37 (SSS):   Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
    -                       enabled.
     
         Note, the APM is *extremely* misleading, and strongly implies that the
         above flags can _only_ be set for #NPF exits from SNP guests.  That is a


==============================================================================
KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
- add a description of PFERR_IMPLICIT_ACCESS

    @@ Commit message
         Add a compile-time assert in the legacy #PF handler to make sure that KVM-
         define flags are covered by its existing sanity check on the upper bits.
     
    +    Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
    +    are removing the comment that defined it.
    +
         Signed-off-by: Sean Christopherson <seanjc@google.com>
    +    Reviewed-by: Kai Huang <kai.huang@intel.com>
    +    Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
         Message-ID: <20240228024147.41573-8-seanjc@google.com>
    +    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    +
    + ## arch/x86/include/asm/kvm_host.h ##
    +@@ arch/x86/include/asm/kvm_host.h: enum x86_intercept_stage;
    + #define PFERR_GUEST_ENC_MASK	BIT_ULL(34)
    + #define PFERR_GUEST_SIZEM_MASK	BIT_ULL(35)
    + #define PFERR_GUEST_VMPL_MASK	BIT_ULL(36)
    ++
    ++/*
    ++ * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP checks
    ++ * when emulating instructions that triggers implicit access.
    ++ */
    + #define PFERR_IMPLICIT_ACCESS	BIT_ULL(48)
    ++#define PFERR_SYNTHETIC_MASK	(PFERR_IMPLICIT_ACCESS)
    + 
    + #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
    + 				 PFERR_WRITE_MASK |		\
     
      ## arch/x86/kvm/mmu/mmu.c ##
     @@ arch/x86/kvm/mmu/mmu.c: int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
    - 	if (WARN_ON_ONCE(error_code >> 32))
    - 		error_code = lower_32_bits(error_code);
    + 		return -EFAULT;
    + #endif
      
     +	/* Ensure the above sanity check also covers KVM-defined flags. */
     +	BUILD_BUG_ON(lower_32_bits(PFERR_SYNTHETIC_MASK));

- move in front of the other synthetic page fault error code patches

==============================================================================
KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
  commit message copy editing

    @@ Commit message
         and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
         32-bit field.
     
    -    Simply drop the upper bits of hardware provides garbage, as spurious
    +    Simply drop the upper bits if hardware provides garbage, as spurious
         information should do no harm (though in all likelihood hardware is buggy
         and the kernel is doomed).
     
    @@ Commit message
         which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
         PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
     
    -    Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
    +    Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
         and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
         bit minimizes the probability of a collision with the "other" vendor,
         without needing to plumb more bits through microcode.


==============================================================================
KVM: x86/mmu: Use synthetic page fault error code to indicate private faults
- patch reordering, no other changes

==============================================================================
KVM: x86/mmu: check for invalid async page faults involving private memory
- new patch coming from TDX/SNP prep series; test PFERR_PRIVATE_ACCESS, set arch.error_code

    @@ arch/x86/kvm/mmu/mmu.c: static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
      	arch.token = alloc_apf_token(vcpu);
     -	arch.gfn = gfn;
     +	arch.gfn = fault->gfn;
    ++	arch.error_code = fault->error_code;
      	arch.direct_map = vcpu->arch.mmu->root_role.direct;
      	arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
      
    @@ arch/x86/kvm/mmu/mmu.c: static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
      {
      	int r;
      
    -+	if (WARN_ON_ONCE(work->arch.error_code & PFERR_GUEST_ENC_MASK))
    ++	if (WARN_ON_ONCE(work->arch.error_code & PFERR_PRIVATE_ACCESS))
     +		return;
     +
      	if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||


==============================================================================
KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
- exit to userspace if the wrong case happens, test PFERR_PRIVATE_ACCESS

    @@ Commit message
     
      ## arch/x86/kvm/mmu/mmu.c ##
     @@ arch/x86/kvm/mmu/mmu.c: int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
    - 		error_code |= PFERR_PRIVATE_ACCESS;
      
      	r = RET_PF_INVALID;
    --	if (unlikely(error_code & PFERR_RSVD_MASK)) {
    -+	if (unlikely((error_code & PFERR_RSVD_MASK) &&
    -+		     !WARN_ON_ONCE(error_code & PFERR_GUEST_ENC_MASK))) {
    + 	if (unlikely(error_code & PFERR_RSVD_MASK)) {
    ++		if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
    ++			return -EFAULT;
    ++
      		r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
      		if (r == RET_PF_EMULATE)
      			goto emulate;

==============================================================================
KVM: x86/mmu: Move private vs. shared check above slot validity checks
- add comment about use of mmu_invalidate_seq

    @@ arch/x86/kvm/mmu/mmu.c: static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, stru
      		return kvm_faultin_pfn_private(vcpu, fault);
      
     @@ arch/x86/kvm/mmu/mmu.c: static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
    + {
    + 	int ret;
    + 
    ++	/*
    ++	 * Note that the mmu_invalidate_seq also serves to detect a concurrent
    ++	 * change in attributes.  is_page_fault_stale() will detect an
    ++	 * invalidation relate to fault->fn and resume the guest without
    ++	 * installing a mapping in the page tables.
    ++	 */
      	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
      	smp_rmb();
      
     +	/*
    -+	 * Check for a private vs. shared mismatch *after* taking a snapshot of
    -+	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
    -+	 * invalidation notifier.
    ++	 * Now that we have a snapshot of mmu_invalidate_seq we can check for a
    ++	 * private vs. shared mismatch.
     +	 */
     +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
     +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);



==============================================================================
KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()
- differences in moved code, range-diff is unreadable


==============================================================================
KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()
- remove unnecessary change

    @@ arch/x86/kvm/mmu/mmu.c: static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct
      		return kvm_handle_noslot_fault(vcpu, fault, access);
      
      	/*
    -
    - ## arch/x86/kvm/mmu/mmu_internal.h ##
    -@@ arch/x86/kvm/mmu/mmu_internal.h: struct kvm_page_fault {
    - 	/* The memslot containing gfn. May be NULL. */
    - 	struct kvm_memory_slot *slot;
    - 
    --	/* Outputs of kvm_faultin_pfn.  */
    -+	/* Outputs of kvm_faultin_pfn. */
    - 	unsigned long mmu_seq;
    - 	kvm_pfn_t pfn;
    - 	hva_t hva;


Isaku Yamahata (1):
  KVM: x86/mmu: Pass full 64-bit error code when handling page faults

Paolo Bonzini (1):
  KVM: x86/mmu: check for invalid async page faults involving private
    memory

Sean Christopherson (15):
  KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits
    emulation
  KVM: x86: Remove separate "bit" defines for page fault error code
    masks
  KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
  KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler
  KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are
    non-zero
  KVM: x86/mmu: Use synthetic page fault error code to indicate private
    faults
  KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page
    faults
  KVM: x86/mmu: Move private vs. shared check above slot validity checks
  KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC
    internal slots
  KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIO
  KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to
    kvm_faultin_pfn()
  KVM: x86/mmu: Handle no-slot faults at the beginning of
    kvm_faultin_pfn()
  KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot"
    faults
  KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
  KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create
    noslot pfns

 arch/x86/include/asm/kvm_host.h |  46 ++++----
 arch/x86/kvm/mmu.h              |   5 +-
 arch/x86/kvm/mmu/mmu.c          | 182 ++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h |  28 ++++-
 arch/x86/kvm/mmu/mmutrace.h     |   2 +-
 arch/x86/kvm/svm/svm.c          |   9 ++
 6 files changed, 174 insertions(+), 98 deletions(-)

-- 
2.43.0


^ permalink raw reply	[relevance 3%]

* Re: [PATCH 2/3] x86/bus_lock: Add support for AMD
  2024-05-06 16:24  0%   ` Jim Mattson
@ 2024-05-07  3:57  0%     ` Ravi Bangoria
  0 siblings, 0 replies; 200+ results
From: Ravi Bangoria @ 2024-05-07  3:57 UTC (permalink / raw)
  To: Jim Mattson
  Cc: tglx, mingo, bp, dave.hansen, seanjc, pbonzini, thomas.lendacky,
	hpa, rmk+kernel, peterz, james.morse, lukas.bulwahn, arjan,
	j.granados, sibs, nik.borisov, michael.roth, nikunj.dadhania,
	babu.moger, x86, kvm, linux-kernel, santosh.shukla,
	ananth.narayan, sandipan.das, Ravi Bangoria

On 06-May-24 9:54 PM, Jim Mattson wrote:
> On Sun, Apr 28, 2024 at 11:08 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>>
>> Upcoming AMD uarch will support Bus Lock Detect (called Bus Lock Trap
>> in AMD docs). Add support for the same in Linux. Bus Lock Detect is
>> enumerated with cpuid CPUID Fn0000_0007_ECX_x0 bit [24 / BUSLOCKTRAP].
>> It can be enabled through MSR_IA32_DEBUGCTLMSR. When enabled, hardware
>> clears DR6[11] and raises a #DB exception on occurrence of Bus Lock if
>> CPL > 0. More detail about the feature can be found in AMD APM[1].
>>
>> [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
>>      2023, Vol 2, 13.1.3.6 Bus Lock Trap
>>      https://bugzilla.kernel.org/attachment.cgi?id=304653
> 
> Is there any chance of getting something similar to Intel's "VMM
> bus-lock detection," which causes a trap-style VM-exit on a bus lock
> event?

You are probably asking about "Bus Lock Threshold". Please see
"15.14.5 Bus Lock Threshold" in the same doc. fwiw, Bus Lock Threshold
is of fault style.

Thanks,
Ravi

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/3] x86/bus_lock: Add support for AMD
  2024-04-29  6:06  4% ` [PATCH 2/3] x86/bus_lock: Add " Ravi Bangoria
  2024-05-06 16:05  0%   ` Tom Lendacky
@ 2024-05-06 16:24  0%   ` Jim Mattson
  2024-05-07  3:57  0%     ` Ravi Bangoria
  1 sibling, 1 reply; 200+ results
From: Jim Mattson @ 2024-05-06 16:24 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: tglx, mingo, bp, dave.hansen, seanjc, pbonzini, thomas.lendacky,
	hpa, rmk+kernel, peterz, james.morse, lukas.bulwahn, arjan,
	j.granados, sibs, nik.borisov, michael.roth, nikunj.dadhania,
	babu.moger, x86, kvm, linux-kernel, santosh.shukla,
	ananth.narayan, sandipan.das

On Sun, Apr 28, 2024 at 11:08 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> Upcoming AMD uarch will support Bus Lock Detect (called Bus Lock Trap
> in AMD docs). Add support for the same in Linux. Bus Lock Detect is
> enumerated with cpuid CPUID Fn0000_0007_ECX_x0 bit [24 / BUSLOCKTRAP].
> It can be enabled through MSR_IA32_DEBUGCTLMSR. When enabled, hardware
> clears DR6[11] and raises a #DB exception on occurrence of Bus Lock if
> CPL > 0. More detail about the feature can be found in AMD APM[1].
>
> [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
>      2023, Vol 2, 13.1.3.6 Bus Lock Trap
>      https://bugzilla.kernel.org/attachment.cgi?id=304653

Is there any chance of getting something similar to Intel's "VMM
bus-lock detection," which causes a trap-style VM-exit on a bus lock
event?

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/3] x86/bus_lock: Add support for AMD
  2024-04-29  6:06  4% ` [PATCH 2/3] x86/bus_lock: Add " Ravi Bangoria
@ 2024-05-06 16:05  0%   ` Tom Lendacky
  2024-05-06 16:24  0%   ` Jim Mattson
  1 sibling, 0 replies; 200+ results
From: Tom Lendacky @ 2024-05-06 16:05 UTC (permalink / raw)
  To: Ravi Bangoria, tglx, mingo, bp, dave.hansen, seanjc, pbonzini
  Cc: hpa, rmk+kernel, peterz, james.morse, lukas.bulwahn, arjan,
	j.granados, sibs, nik.borisov, michael.roth, nikunj.dadhania,
	babu.moger, x86, kvm, linux-kernel, santosh.shukla,
	ananth.narayan, sandipan.das

On 4/29/24 01:06, Ravi Bangoria wrote:
> Upcoming AMD uarch will support Bus Lock Detect (called Bus Lock Trap
> in AMD docs). Add support for the same in Linux. Bus Lock Detect is
> enumerated with cpuid CPUID Fn0000_0007_ECX_x0 bit [24 / BUSLOCKTRAP].
> It can be enabled through MSR_IA32_DEBUGCTLMSR. When enabled, hardware
> clears DR6[11] and raises a #DB exception on occurrence of Bus Lock if
> CPL > 0. More detail about the feature can be found in AMD APM[1].
> 
> [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
>       2023, Vol 2, 13.1.3.6 Bus Lock Trap
>       https://bugzilla.kernel.org/attachment.cgi?id=304653
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>   arch/x86/kernel/cpu/amd.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 39f316d50ae4..013d16479a24 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1058,6 +1058,8 @@ static void init_amd(struct cpuinfo_x86 *c)
>   
>   	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
>   	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> +
> +	bus_lock_init();

Can this call and the one in the intel.c be moved to common.c?

Thanks,
Tom

>   }
>   
>   #ifdef CONFIG_X86_32

^ permalink raw reply	[relevance 0%]

* [PATCH v2 2/5] KVM: SVM: Add Idle HLT intercept support
  @ 2024-05-01 14:54  3% ` Manali Shukla
  0 siblings, 0 replies; 200+ results
From: Manali Shukla @ 2024-05-01 14:54 UTC (permalink / raw)
  To: kvm, linux-kselftest
  Cc: pbonzini, seanjc, shuah, nikunj, thomas.lendacky, vkuznets,
	manali.shukla, bp

From: Manali Shukla <Manali.Shukla@amd.com>

Execution of the HLT instruction by a vCPU can be intercepted by the
hypervisor by setting the HLT-Intercept Bit in VMCB, thus resulting in
a VMEXIT. It can be possible that soon after the VMEXIT, hypervisor
observes that there are pending V_INTR and V_NMI events for the vCPU and
causes it to perform a VMRUN to service those events. In that case
VMEXIT is wasteful.

The Idle HLT intercept feature allows for the HLT instruction execution
by a vCPU to be intercepted by hypervisor only if there are no pending
V_INTR and V_NMI events for the vCPU. The Idle HLT intercept will not be
triggerred, when vCPU is expected to have pending events (V_INR and
V_NMI).

The feature allows the hypervisor to determine whether vCPU is idle and
reduces wasteful VMEXITs.

Details about Idle HLT intercept can be found in AMD APM [1].

[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April
     2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT).
     https://bugzilla.kernel.org/attachment.cgi?id=306250

Signed-off-by: Manali Shukla <Manali.Shukla@amd.com>
---
 arch/x86/include/asm/svm.h      |  1 +
 arch/x86/include/uapi/asm/svm.h |  2 ++
 arch/x86/kvm/svm/svm.c          | 11 ++++++++---
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 728c98175b9c..3a91928a4060 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -116,6 +116,7 @@ enum {
 	INTERCEPT_INVPCID,
 	INTERCEPT_MCOMMIT,
 	INTERCEPT_TLBSYNC,
+	INTERCEPT_IDLE_HLT = 166,
 };
 
 
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
index 80e1df482337..9910f86a2cef 100644
--- a/arch/x86/include/uapi/asm/svm.h
+++ b/arch/x86/include/uapi/asm/svm.h
@@ -95,6 +95,7 @@
 #define SVM_EXIT_CR14_WRITE_TRAP		0x09e
 #define SVM_EXIT_CR15_WRITE_TRAP		0x09f
 #define SVM_EXIT_INVPCID       0x0a2
+#define SVM_EXIT_IDLE_HLT      0x0a6
 #define SVM_EXIT_NPF           0x400
 #define SVM_EXIT_AVIC_INCOMPLETE_IPI		0x401
 #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS	0x402
@@ -223,6 +224,7 @@
 	{ SVM_EXIT_CR4_WRITE_TRAP,	"write_cr4_trap" }, \
 	{ SVM_EXIT_CR8_WRITE_TRAP,	"write_cr8_trap" }, \
 	{ SVM_EXIT_INVPCID,     "invpcid" }, \
+	{ SVM_EXIT_IDLE_HLT,     "idle-halt" }, \
 	{ SVM_EXIT_NPF,         "npf" }, \
 	{ SVM_EXIT_AVIC_INCOMPLETE_IPI,		"avic_incomplete_ipi" }, \
 	{ SVM_EXIT_AVIC_UNACCELERATED_ACCESS,   "avic_unaccelerated_access" }, \
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0f3b59da0d4a..223c670bf986 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1289,8 +1289,12 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 		svm_set_intercept(svm, INTERCEPT_MWAIT);
 	}
 
-	if (!kvm_hlt_in_guest(vcpu->kvm))
-		svm_set_intercept(svm, INTERCEPT_HLT);
+	if (!kvm_hlt_in_guest(vcpu->kvm)) {
+		if (cpu_feature_enabled(X86_FEATURE_IDLE_HLT))
+			svm_set_intercept(svm, INTERCEPT_IDLE_HLT);
+		else
+			svm_set_intercept(svm, INTERCEPT_HLT);
+	}
 
 	control->iopm_base_pa = __sme_set(iopm_base);
 	control->msrpm_base_pa = __sme_set(__pa(svm->msrpm));
@@ -3291,6 +3295,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[SVM_EXIT_CR4_WRITE_TRAP]		= cr_trap,
 	[SVM_EXIT_CR8_WRITE_TRAP]		= cr_trap,
 	[SVM_EXIT_INVPCID]                      = invpcid_interception,
+	[SVM_EXIT_IDLE_HLT]			= kvm_emulate_halt,
 	[SVM_EXIT_NPF]				= npf_interception,
 	[SVM_EXIT_RSM]                          = rsm_interception,
 	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
@@ -3453,7 +3458,7 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
 		return interrupt_window_interception(vcpu);
 	else if (exit_code == SVM_EXIT_INTR)
 		return intr_interception(vcpu);
-	else if (exit_code == SVM_EXIT_HLT)
+	else if (exit_code == SVM_EXIT_HLT || exit_code == SVM_EXIT_IDLE_HLT)
 		return kvm_emulate_halt(vcpu);
 	else if (exit_code == SVM_EXIT_NPF)
 		return npf_interception(vcpu);
-- 
2.34.1


^ permalink raw reply related	[relevance 3%]

* [PATCH v15 05/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
  @ 2024-05-01  8:51  9% ` Michael Roth
  2024-05-01  8:52  2% ` [PATCH v15 11/20] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  2024-05-01  8:52  2% ` [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-01  8:51 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  28 ++-
 arch/x86/include/uapi/asm/kvm.h               |  11 ++
 arch/x86/kvm/svm/sev.c                        | 176 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 212 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 9677a0714a39..dd179e162a87 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -466,6 +466,30 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. It must be called prior to issuing
+KVM_SEV_SNP_LAUNCH_UPDATE or KVM_SEV_SNP_LAUNCH_FINISH;
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+                __u16 flags;            /* Must be zero. */
+                __u8 pad0[6];
+                __u64 pad1[4];
+        };
+
+See SNP_LAUNCH_START in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_start``.
+
 Device attribute API
 ====================
 
@@ -497,9 +521,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index d2ae5fcc0275..693a80ffe40a 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -697,6 +697,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START = 100,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -824,6 +827,14 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+	__u16 flags;
+	__u8 pad0[6];
+	__u64 pad1[4];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index be831e2c06eb..4676ce171aaa 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -59,6 +60,21 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_API_MINOR	GENMASK_ULL(7, 0)
+#define SNP_POLICY_MASK_API_MAJOR	GENMASK_ULL(15, 8)
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_RSVD_MBO	BIT_ULL(17)
+#define SNP_POLICY_MASK_DEBUG		BIT_ULL(19)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+
+#define SNP_POLICY_MASK_VALID		(SNP_POLICY_MASK_API_MINOR	| \
+					 SNP_POLICY_MASK_API_MAJOR	| \
+					 SNP_POLICY_MASK_SMT		| \
+					 SNP_POLICY_MASK_RSVD_MBO	| \
+					 SNP_POLICY_MASK_DEBUG		| \
+					 SNP_POLICY_MASK_SINGLE_SOCKET)
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -69,6 +85,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -95,12 +113,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1998,6 +2021,106 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context)
+		return -EINVAL;
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.flags)
+		return -EINVAL;
+
+	if (params.policy & ~SNP_POLICY_MASK_VALID)
+		return -EINVAL;
+
+	/* Check for policy bits that must be set */
+	if (!(params.policy & SNP_POLICY_MASK_RSVD_MBO) ||
+	    !(params.policy & SNP_POLICY_MASK_SMT))
+		return -EINVAL;
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET)
+		return -EINVAL;
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("%s: SEV_CMD_SNP_LAUNCH_START firmware command failed, rc %d\n",
+			 __func__, rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("%s: Failed to bind ASID to SEV-SNP context, rc %d\n",
+			 __func__, rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -2021,6 +2144,15 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 		goto out;
 	}
 
+	/*
+	 * Once KVM_SEV_INIT2 initializes a KVM instance as an SNP guest, only
+	 * allow the use of SNP-specific commands.
+	 */
+	if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) {
+		r = -EPERM;
+		goto out;
+	}
+
 	switch (sev_cmd.id) {
 	case KVM_SEV_ES_INIT:
 		if (!sev_es_enabled) {
@@ -2085,6 +2217,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2280,6 +2415,31 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	/* Do the decommision, which will unbind the ASID from the SNP context */
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	up_write(&sev_deactivate_lock);
+
+	if (WARN_ONCE(ret, "Failed to release guest context, ret %d", ret))
+		return ret;
+
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2321,7 +2481,17 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		/*
+		 * Decomission handles unbinding of the ASID. If it fails for
+		 * some unexpected reason, just leak the ASID.
+		 */
+		if (snp_decommission_context(kvm))
+			return;
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1407acf45a23..d175059fa7c8 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -93,6 +93,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1


^ permalink raw reply related	[relevance 9%]

* [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
    2024-05-01  8:51  9% ` [PATCH v15 05/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
  2024-05-01  8:52  2% ` [PATCH v15 11/20] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
@ 2024-05-01  8:52  2% ` Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-01  8:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick

Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but allows for
additional certificate data to be supplied via an additional
guest-supplied buffer to be used mainly for verifying the signature of
an attestation report as returned by firmware.

This certificate data is supplied by userspace, so unlike with
SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first
forwarded to userspace via a KVM_EXIT_VMGEXIT exit structure, and then
the firmware request is made after the certificate data has been fetched
from userspace.

Since there is a potential for race conditions where the
userspace-supplied certificate data may be out-of-sync relative to the
reported TCB or VLEK that firmware will use when signing attestation
reports, a hook is also provided so that userspace can be informed once
the attestation request is actually completed. See the updates to
Documentation/ for more details on these aspects.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 Documentation/virt/kvm/api.rst | 87 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/sev.c         | 86 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.h         |  3 ++
 include/uapi/linux/kvm.h       | 23 +++++++++
 4 files changed, 199 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..f3780ac98d56 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,93 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+  #define KVM_USER_VMGEXIT_REQ_CERTS		1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 data_gpa;
+					__u64 data_npages;
+  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN   1
+  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY          2
+  #define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+					__u32 ret;
+  #define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE	BIT(0)
+					__u8 flags;
+  #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING	0
+  #define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE		1
+					__u8 status;
+				} req_certs;
+			};
+		};
+
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects of handling a VMGEXIT are done in userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+KVM_USER_VMGEXIT_REQ_CERTS
+--------------------------
+
+When an SEV-SNP issues a guest request for an attestation report, it has the
+option of issuing it in the form an *extended* guest request when a
+certificate blob is returned alongside the attestation report so the guest
+can validate the endorsement key used by SNP firmware to sign the report.
+These certificates are managed by userspace and are requested via
+KVM_EXIT_VMGEXITs using the KVM_USER_VMGEXIT_REQ_CERTS type.
+
+For the KVM_USER_VMGEXIT_REQ_CERTS type, the req_certs union type
+is used. The kernel will supply in 'data_gpa' the value the guest supplies
+via the RAX field of the GHCB when issuing extended guest requests.
+'data_npages' will similarly contain the value the guest supplies in RBX
+denoting the number of shared pages available to write the certificate
+data into.
+
+  - If the supplied number of pages is sufficient, userspace should write
+    the certificate data blob (in the format defined by the GHCB spec) in
+    the address indicated by 'data_gpa' and set 'ret' to 0.
+
+  - If the number of pages supplied is not sufficient, userspace must write
+    the required number of pages in 'data_npages' and then set 'ret' to 1.
+
+  - If userspace is temporarily unable to handle the request, 'ret' should
+    be set to 2 to inform the guest to retry later.
+
+  - If some other error occurred, userspace should set 'ret' to a non-zero
+    value that is distinct from the specific return values mentioned above.
+
+Generally some care needs be taken to keep the returned certificate data in
+sync with the actual endorsement key in use by firmware at the time the
+attestation request is sent to SNP firmware. The recommended scheme to do
+this is for the VMM to obtain a shared or exclusive lock on the path the
+certificate blob file resides at before reading it and returning it to KVM,
+and that it continues to hold the lock until the attestation request is
+actually sent to firmware. To facilitate this, the VMM can set the
+KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE flag before returning the
+certificate blob, in which case another KVM_EXIT_VMGEXIT of type
+KVM_USER_VMGEXIT_REQ_CERTS will be sent to userspace with
+KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE being set in the status field to
+indicate the request is fully-completed and that any associated locks can be
+released.
+
+Tools/libraries that perform updates to SNP firmware TCB values or endorsement
+keys (e.g. firmware interfaces such as SNP_COMMIT, SNP_SET_CONFIG, or
+SNP_VLEK_LOAD, see Documentation/virt/coco/sev-guest.rst for more details) in
+such a way that the certificate blob needs to be updated, should similarly
+take an exclusive lock on the certificate blob for the duration of any updates
+to firmware or the certificate blob contents to ensure that VMMs using the
+above scheme will not return certificate blob data that is out of sync with
+firmware.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 5c6262f3232f..35f0bd91f92e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3297,6 +3297,11 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 		if (!sev_snp_guest(vcpu->kvm))
 			goto vmgexit_err;
 		break;
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+		if (!sev_snp_guest(vcpu->kvm) || !kvm_ghcb_rax_is_valid(svm) ||
+		    !kvm_ghcb_rbx_is_valid(svm))
+			goto vmgexit_err;
+		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
 		goto vmgexit_err;
@@ -3988,6 +3993,84 @@ static void snp_handle_guest_req(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp
 	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
 }
 
+static int snp_complete_ext_guest_req(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	struct vmcb_control_area *control;
+	struct kvm *kvm = vcpu->kvm;
+	sev_ret_code fw_err = 0;
+	int vmm_ret;
+
+	vmm_ret = vcpu->run->vmgexit.req_certs.ret;
+	if (vmm_ret) {
+		if (vmm_ret == SNP_GUEST_VMM_ERR_INVALID_LEN)
+			vcpu->arch.regs[VCPU_REGS_RBX] =
+				vcpu->run->vmgexit.req_certs.data_npages;
+		goto out;
+	}
+
+	/*
+	 * The request was completed on the previous completion callback and
+	 * this completion is only for the STATUS_DONE userspace notification.
+	 */
+	if (vcpu->run->vmgexit.req_certs.status == KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE)
+		goto out_resume;
+
+	control = &svm->vmcb->control;
+
+	if (__snp_handle_guest_req(kvm, control->exit_info_1,
+				   control->exit_info_2, &fw_err))
+		vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+
+out:
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+
+	if (vcpu->run->vmgexit.req_certs.flags & KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE) {
+		vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE;
+		vcpu->run->vmgexit.req_certs.flags = 0;
+		return 0; /* notify userspace of completion */
+	}
+
+out_resume:
+	return 1; /* resume guest */
+}
+
+static int snp_begin_ext_guest_req(struct kvm_vcpu *vcpu)
+{
+	int vmm_ret = SNP_GUEST_VMM_ERR_GENERIC;
+	struct vcpu_svm *svm = to_svm(vcpu);
+	unsigned long data_npages;
+	sev_ret_code fw_err;
+	gpa_t data_gpa;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		goto abort_request;
+
+	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
+	data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
+
+	if (!IS_ALIGNED(data_gpa, PAGE_SIZE))
+		goto abort_request;
+
+	/*
+	 * Grab the certificates from userspace so that can be bundled with
+	 * attestation/guest requests.
+	 */
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_REQ_CERTS;
+	vcpu->run->vmgexit.req_certs.data_gpa = data_gpa;
+	vcpu->run->vmgexit.req_certs.data_npages = data_npages;
+	vcpu->run->vmgexit.req_certs.flags = 0;
+	vcpu->run->vmgexit.req_certs.status = KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING;
+	vcpu->arch.complete_userspace_io = snp_complete_ext_guest_req;
+
+	return 0; /* forward request to userspace */
+
+abort_request:
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(vmm_ret, fw_err));
+	return 1; /* resume guest */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -4266,6 +4349,9 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		snp_handle_guest_req(svm, control->exit_info_1, control->exit_info_2);
 		ret = 1;
 		break;
+	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
+		ret = snp_begin_ext_guest_req(vcpu);
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e325ede0f463..83c562b4712a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -309,6 +309,9 @@ struct vcpu_svm {
 
 	/* Guest GIF value, used when vGIF is not enabled */
 	bool guest_gif;
+
+	/* Transaction ID associated with SNP config updates */
+	u64 snp_transaction_id;
 };
 
 struct svm_cpu_data {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..106367d87189 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,26 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_REQ_CERTS		1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 data_gpa;
+			__u64 data_npages;
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_INVALID_LEN   1
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_BUSY          2
+#define KVM_USER_VMGEXIT_REQ_CERTS_ERROR_GENERIC       (1 << 31)
+			__u32 ret;
+#define KVM_USER_VMGEXIT_REQ_CERTS_FLAGS_NOTIFY_DONE	BIT(0)
+			__u8 flags;
+#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_PENDING	0
+#define KVM_USER_VMGEXIT_REQ_CERTS_STATUS_DONE		1
+			__u8 status;
+		} req_certs;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -178,6 +198,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +454,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v15 11/20] KVM: SEV: Add support to handle RMP nested page faults
    2024-05-01  8:51  9% ` [PATCH v15 05/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
@ 2024-05-01  8:52  2% ` Michael Roth
  2024-05-01  8:52  2% ` [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-05-01  8:52 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/include/asm/sev.h      |   3 +
 arch/x86/kvm/mmu.h              |   2 -
 arch/x86/kvm/mmu/mmu.c          |   1 +
 arch/x86/kvm/svm/sev.c          | 109 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  21 ++++--
 arch/x86/kvm/svm/svm.h          |   3 +
 arch/x86/kvm/trace.h            |  31 +++++++++
 arch/x86/kvm/x86.c              |   1 +
 9 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87265b73906a..8a414fc972f8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1944,6 +1944,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 7f57382afee4..3a06f06b847a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2343c9f00e31..e3cb35b9396d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -251,8 +251,6 @@ static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
 	return __kvm_mmu_honors_guest_mtrrs(kvm_arch_has_noncoherent_dma(kvm));
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d556da052f6..de35dee25bf6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6758,6 +6758,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 
 	return need_tlb_flush;
 }
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
 
 static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
 					   const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 70b8f4cd1b03..7f5ddd92113e 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3465,6 +3465,23 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	int ret;
+
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	/*
+	 * PSMASH_FAIL_INUSE indicates another processor is modifying the
+	 * entry, so retry until that's no longer the case.
+	 */
+	do {
+		ret = psmash(pfn);
+	} while (ret == PSMASH_FAIL_INUSE);
+
+	return ret;
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4221,3 +4238,95 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out_no_trace;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K)
+		goto out;
+
+	ret = snp_rmptable_psmash(pfn);
+	if (ret) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have
+		 * raced with another process and the issue has already resolved
+		 * itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) &&
+		    assigned && rmp_level == PG_LEVEL_4K)
+			goto out;
+
+		pr_warn_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				    gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	trace_kvm_rmp_fault(vcpu, gpa, pfn, error_code, rmp_level, ret);
+out_no_trace:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 422b452fbc3b..7c9807fdafc3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,6 +2043,7 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
@@ -2060,10 +2061,22 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 		error_code |= PFERR_PRIVATE_ACCESS;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 438cad6c9421..d779f1f431af 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -728,6 +728,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -741,6 +742,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index c6b4b1728006..3531a187d5d9 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1834,6 +1834,37 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
 		  __entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
 );
 
+/*
+ * Tracepoint for #NPFs due to RMP faults.
+ */
+TRACE_EVENT(kvm_rmp_fault,
+	TP_PROTO(struct kvm_vcpu *vcpu, u64 gpa, u64 pfn, u64 error_code,
+		 int rmp_level, int psmash_ret),
+	TP_ARGS(vcpu, gpa, pfn, error_code, rmp_level, psmash_ret),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, vcpu_id)
+		__field(u64, gpa)
+		__field(u64, pfn)
+		__field(u64, error_code)
+		__field(int, rmp_level)
+		__field(int, psmash_ret)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id	= vcpu->vcpu_id;
+		__entry->gpa		= gpa;
+		__entry->pfn		= pfn;
+		__entry->error_code	= error_code;
+		__entry->rmp_level	= rmp_level;
+		__entry->psmash_ret	= psmash_ret;
+	),
+
+	TP_printk("vcpu %u gpa %016llx pfn 0x%llx error_code 0x%llx rmp_level %d psmash_ret %d",
+		  __entry->vcpu_id, __entry->gpa, __entry->pfn,
+		  __entry->error_code, __entry->rmp_level, __entry->psmash_ret)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83b8260443a3..14693effec6b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13996,6 +13996,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
 
 static int __init kvm_x86_init(void)
 {
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH 2/3] x86/bus_lock: Add support for AMD
  2024-04-29  6:06  3% [PATCH 0/3] x86/cpu: Add Bus Lock Detect support for AMD Ravi Bangoria
@ 2024-04-29  6:06  4% ` Ravi Bangoria
  2024-05-06 16:05  0%   ` Tom Lendacky
  2024-05-06 16:24  0%   ` Jim Mattson
  0 siblings, 2 replies; 200+ results
From: Ravi Bangoria @ 2024-04-29  6:06 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, seanjc, pbonzini, thomas.lendacky
  Cc: ravi.bangoria, hpa, rmk+kernel, peterz, james.morse,
	lukas.bulwahn, arjan, j.granados, sibs, nik.borisov,
	michael.roth, nikunj.dadhania, babu.moger, x86, kvm,
	linux-kernel, santosh.shukla, ananth.narayan, sandipan.das

Upcoming AMD uarch will support Bus Lock Detect (called Bus Lock Trap
in AMD docs). Add support for the same in Linux. Bus Lock Detect is
enumerated with cpuid CPUID Fn0000_0007_ECX_x0 bit [24 / BUSLOCKTRAP].
It can be enabled through MSR_IA32_DEBUGCTLMSR. When enabled, hardware
clears DR6[11] and raises a #DB exception on occurrence of Bus Lock if
CPL > 0. More detail about the feature can be found in AMD APM[1].

[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
     2023, Vol 2, 13.1.3.6 Bus Lock Trap
     https://bugzilla.kernel.org/attachment.cgi?id=304653

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 arch/x86/kernel/cpu/amd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 39f316d50ae4..013d16479a24 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1058,6 +1058,8 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	bus_lock_init();
 }
 
 #ifdef CONFIG_X86_32
-- 
2.44.0


^ permalink raw reply related	[relevance 4%]

* [PATCH 0/3] x86/cpu: Add Bus Lock Detect support for AMD
@ 2024-04-29  6:06  3% Ravi Bangoria
  2024-04-29  6:06  4% ` [PATCH 2/3] x86/bus_lock: Add " Ravi Bangoria
  0 siblings, 1 reply; 200+ results
From: Ravi Bangoria @ 2024-04-29  6:06 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, seanjc, pbonzini, thomas.lendacky
  Cc: ravi.bangoria, hpa, rmk+kernel, peterz, james.morse,
	lukas.bulwahn, arjan, j.granados, sibs, nik.borisov,
	michael.roth, nikunj.dadhania, babu.moger, x86, kvm,
	linux-kernel, santosh.shukla, ananth.narayan, sandipan.das

Upcoming AMD uarch will support Bus Lock Detect (called Bus Lock Trap
in AMD docs). Add support for the same in Linux. Bus Lock Detect is
enumerated with cpuid CPUID Fn0000_0007_ECX_x0 bit [24 / BUSLOCKTRAP].
It can be enabled through MSR_IA32_DEBUGCTLMSR. When enabled, hardware
clears DR6[11] and raises a #DB exception on occurrence of Bus Lock if
CPL > 0. More detail about the feature can be found in AMD APM[1].

Patches are prepared on tip/x86/cpu (e063b531d4e8)

Patch #3 depends on SEV-ES LBRV fix:
https://lore.kernel.org/r/20240416050338.517-1-ravi.bangoria@amd.com

[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
     2023, Vol 2, 13.1.3.6 Bus Lock Trap
     https://bugzilla.kernel.org/attachment.cgi?id=304653

Ravi Bangoria (3):
  x86/split_lock: Move Split and Bus lock code to a dedicated file
  x86/bus_lock: Add support for AMD
  KVM SVM: Add Bus Lock Detect support

 arch/x86/include/asm/cpu.h           |   4 +
 arch/x86/kernel/cpu/Makefile         |   1 +
 arch/x86/kernel/cpu/amd.c            |   2 +
 arch/x86/kernel/cpu/intel.c          | 407 ---------------------------
 arch/x86/kernel/cpu/split-bus-lock.c | 406 ++++++++++++++++++++++++++
 arch/x86/kvm/svm/nested.c            |   3 +-
 arch/x86/kvm/svm/svm.c               |  16 +-
 7 files changed, 430 insertions(+), 409 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/split-bus-lock.c

-- 
2.44.0


^ permalink raw reply	[relevance 3%]

* [PATCH v11 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  @ 2024-04-24 15:49  4% ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2024-04-24 15:49 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	Daniel P . Berrangé
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Babu Moger,
	Xiaoyao Li, Dapeng Mi, Yongwei Ma, Zhao Liu

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes since v7:
 * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
   ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
 * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
   change due to the rebase.
 * Re-added Yongwei's Tested tag For his re-testing (compilation on
   Intel platforms).

Changes since v3:
 * Rewrote the subject. (Babu)
 * Deleted the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Renamed "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Added the description of the original commit and add Cc.
---
 target/i386/cpu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 93c2f56780f5..c13bfed28878 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -331,7 +331,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -340,11 +340,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v14 05/22] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
  @ 2024-04-21 18:01  9% ` Michael Roth
  2024-04-21 18:01  3% ` [PATCH v14 09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-21 18:01 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  28 ++-
 arch/x86/include/uapi/asm/kvm.h               |  11 +
 arch/x86/kvm/svm/sev.c                        | 195 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 231 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 3381556d596d..d4c4a0b90bc9 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -459,6 +459,30 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest. It must be called prior to issuing
+KVM_SEV_SNP_LAUNCH_UPDATE or KVM_SEV_SNP_LAUNCH_FINISH;
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+                __u16 flags;            /* Must be zero. */
+                __u8 pad0[6];
+                __u64 pad1[4];
+        };
+
+See SNP_LAUNCH_START in the SEV-SNP specification [snp-fw-abi]_ for further
+details on the input parameters in ``struct kvm_sev_snp_launch_start``.
+
 Device attribute API
 ====================
 
@@ -490,9 +514,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9a8b81d20314..5765391f0fdb 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -697,6 +697,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START = 100,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -822,6 +825,14 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+	__u16 flags;
+	__u8 pad0[6];
+	__u64 pad1[4];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c41cc73a1efe..9d08d1202544 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -58,6 +59,21 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_API_MINOR	GENMASK_ULL(7, 0)
+#define SNP_POLICY_MASK_API_MAJOR	GENMASK_ULL(15, 8)
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_RSVD_MBO	BIT_ULL(17)
+#define SNP_POLICY_MASK_DEBUG		BIT_ULL(19)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+
+#define SNP_POLICY_MASK_VALID		(SNP_POLICY_MASK_API_MINOR	| \
+					 SNP_POLICY_MASK_API_MAJOR	| \
+					 SNP_POLICY_MASK_SMT		| \
+					 SNP_POLICY_MASK_RSVD_MBO	| \
+					 SNP_POLICY_MASK_DEBUG		| \
+					 SNP_POLICY_MASK_SINGLE_SOCKET)
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -68,6 +84,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -94,12 +112,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1976,6 +1999,125 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context) {
+		pr_debug("%s: SEV-SNP context already exists. Refusing to allocate an additional one.\n",
+			 __func__);
+		return -EINVAL;
+	}
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.flags) {
+		pr_debug("%s: SEV-SNP hypervisor does not support requested flags 0x%x\n",
+			 __func__, params.flags);
+		return -EINVAL;
+	}
+
+	if (params.policy & ~SNP_POLICY_MASK_VALID) {
+		pr_debug("%s: SEV-SNP hypervisor does not support requested policy 0x%llx (supported 0x%llx).\n",
+			 __func__, params.policy, SNP_POLICY_MASK_VALID);
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_RSVD_MBO)) {
+		pr_debug("%s: SEV-SNP hypervisor does not support requested policy 0x%llx (must be set 0x%llx).\n",
+			 __func__, params.policy, SNP_POLICY_MASK_RSVD_MBO);
+		return -EINVAL;
+	}
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) {
+		pr_debug("%s: SEV-SNP hypervisor does not support limiting guests to a single socket.\n",
+			 __func__);
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_SMT)) {
+		pr_debug("%s: SEV-SNP hypervisor does not support limiting guests to a single SMT thread.\n",
+			 __func__);
+		return -EINVAL;
+	}
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("%s: SEV_CMD_SNP_LAUNCH_START firmware command failed, rc %d\n",
+			 __func__, rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("%s: Failed to bind ASID to SEV-SNP context, rc %d\n",
+			 __func__, rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1999,6 +2141,15 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 		goto out;
 	}
 
+	/*
+	 * Once KVM_SEV_INIT2 initializes a KVM instance as an SNP guest, only
+	 * allow the use of SNP-specific commands.
+	 */
+	if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) {
+		r = -EPERM;
+		goto out;
+	}
+
 	switch (sev_cmd.id) {
 	case KVM_SEV_ES_INIT:
 		if (!sev_es_enabled) {
@@ -2063,6 +2214,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2258,6 +2412,33 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	if (WARN_ONCE(ret, "failed to release guest context")) {
+		up_write(&sev_deactivate_lock);
+		return ret;
+	}
+
+	up_write(&sev_deactivate_lock);
+
+	/* free the context page now */
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2299,7 +2480,15 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7f2e9c7fc4ca..0654fc91d4db 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -92,6 +92,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1


^ permalink raw reply related	[relevance 9%]

* [PATCH v14 11/22] KVM: SEV: Add support to handle RMP nested page faults
                     ` (2 preceding siblings ...)
  2024-04-21 18:01  4% ` [PATCH v14 10/22] KVM: SEV: Add support to handle " Michael Roth
@ 2024-04-21 18:01  2% ` Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-21 18:01 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/include/asm/sev.h      |   3 +
 arch/x86/kvm/mmu.h              |   2 -
 arch/x86/kvm/mmu/mmu.c          |   1 +
 arch/x86/kvm/svm/sev.c          | 109 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  21 ++++--
 arch/x86/kvm/svm/svm.h          |   3 +
 arch/x86/kvm/trace.h            |  31 +++++++++
 arch/x86/kvm/x86.c              |   1 +
 9 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4c9d8a22840a..90f0de2b8645 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1945,6 +1945,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 7f57382afee4..3a06f06b847a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2343c9f00e31..e3cb35b9396d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -251,8 +251,6 @@ static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
 	return __kvm_mmu_honors_guest_mtrrs(kvm_arch_has_noncoherent_dma(kvm));
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eebb1562c5bc..b6d0aa18b72b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6752,6 +6752,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 
 	return need_tlb_flush;
 }
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
 
 static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
 					   const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e9519af8f14c..65882033a82f 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3464,6 +3464,23 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	int ret;
+
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	/*
+	 * PSMASH_FAIL_INUSE indicates another processor is modifying the
+	 * entry, so retry until that's no longer the case.
+	 */
+	do {
+		ret = psmash(pfn);
+	} while (ret == PSMASH_FAIL_INUSE);
+
+	return ret;
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4023,3 +4040,95 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out_no_trace;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K)
+		goto out;
+
+	ret = snp_rmptable_psmash(pfn);
+	if (ret) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have
+		 * raced with another process and the issue has already resolved
+		 * itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) &&
+		    assigned && rmp_level == PG_LEVEL_4K)
+			goto out;
+
+		pr_warn_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				    gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	trace_kvm_rmp_fault(vcpu, gpa, pfn, error_code, rmp_level, ret);
+out_no_trace:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 422b452fbc3b..7c9807fdafc3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,6 +2043,7 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
@@ -2060,10 +2061,22 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 		error_code |= PFERR_PRIVATE_ACCESS;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 730f5ced2a2e..d2b0ec27d4fe 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index c6b4b1728006..3531a187d5d9 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1834,6 +1834,37 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
 		  __entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
 );
 
+/*
+ * Tracepoint for #NPFs due to RMP faults.
+ */
+TRACE_EVENT(kvm_rmp_fault,
+	TP_PROTO(struct kvm_vcpu *vcpu, u64 gpa, u64 pfn, u64 error_code,
+		 int rmp_level, int psmash_ret),
+	TP_ARGS(vcpu, gpa, pfn, error_code, rmp_level, psmash_ret),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, vcpu_id)
+		__field(u64, gpa)
+		__field(u64, pfn)
+		__field(u64, error_code)
+		__field(int, rmp_level)
+		__field(int, psmash_ret)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id	= vcpu->vcpu_id;
+		__entry->gpa		= gpa;
+		__entry->pfn		= pfn;
+		__entry->error_code	= error_code;
+		__entry->rmp_level	= rmp_level;
+		__entry->psmash_ret	= psmash_ret;
+	),
+
+	TP_printk("vcpu %u gpa %016llx pfn 0x%llx error_code 0x%llx rmp_level %d psmash_ret %d",
+		  __entry->vcpu_id, __entry->gpa, __entry->pfn,
+		  __entry->error_code, __entry->rmp_level, __entry->psmash_ret)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83b8260443a3..14693effec6b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13996,6 +13996,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
 
 static int __init kvm_x86_init(void)
 {
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v14 10/22] KVM: SEV: Add support to handle Page State Change VMGEXIT
    2024-04-21 18:01  9% ` [PATCH v14 05/22] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
  2024-04-21 18:01  3% ` [PATCH v14 09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
@ 2024-04-21 18:01  4% ` Michael Roth
  2024-04-21 18:01  2% ` [PATCH v14 11/22] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-21 18:01 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.

Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
it is done for requests that don't use a GHCB page.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst | 14 ++++++++++++++
 arch/x86/kvm/svm/sev.c         | 16 ++++++++++++++++
 include/uapi/linux/kvm.h       |  5 +++++
 3 files changed, 35 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 4a7a2945bc78..85099198a10f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7065,6 +7065,7 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 		/* KVM_EXIT_VMGEXIT */
 		struct kvm_user_vmgexit {
 		#define KVM_USER_VMGEXIT_PSC_MSR	1
+		#define KVM_USER_VMGEXIT_PSC		2
 			__u32 type; /* KVM_USER_VMGEXIT_* type */
 			union {
 				struct {
@@ -7074,9 +7075,14 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 					__u8 op;
 					__u32 ret;
 				} psc_msr;
+				struct {
+					__u64 shared_gpa;
+					__u64 ret;
+				} psc;
 			};
 		};
 
+
 If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
 has issued a VMGEXIT instruction (as documented by the AMD Architecture
 Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
@@ -7094,6 +7100,14 @@ update the private/shared state of the GPA using the corresponding
 KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
 userpace on success, or some non-zero value on failure.
 
+For the KVM_USER_VMGEXIT_PSC type, the psc union type is used. The kernel
+will supply the GPA of the Page State Structure defined in the GHCB spec.
+Userspace will process this structure as defined by the GHCB, and issue
+KVM_SET_MEMORY_ATTRIBUTES ioctls to set the GPAs therein to the expected
+private/shared state. Userspace will return a value in 'ret' that is in
+agreement with the GHCB-defined return values that the guest will expect
+in the SW_EXITINFO2 field of the GHCB in response to these requests.
+
 6. Capabilities that can be enabled on vCPUs
 ============================================
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f6f54a889fde..e9519af8f14c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3275,6 +3275,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 	case SVM_VMGEXIT_HV_FEATURES:
+	case SVM_VMGEXIT_PSC:
 		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
@@ -3493,6 +3494,15 @@ static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
 	return 0; /* forward request to userspace */
 }
 
+static int snp_complete_psc(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, vcpu->run->vmgexit.psc.ret);
+
+	return 1; /* resume guest */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3730,6 +3740,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 
 		ret = 1;
 		break;
+	case SVM_VMGEXIT_PSC:
+		vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+		vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC;
+		vcpu->run->vmgexit.psc.shared_gpa = svm->sev_es.sw_scratch;
+		vcpu->arch.complete_userspace_io = snp_complete_psc;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 54b81e46a9fa..e33c48bfbd67 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -137,6 +137,7 @@ struct kvm_xen_exit {
 
 struct kvm_user_vmgexit {
 #define KVM_USER_VMGEXIT_PSC_MSR	1
+#define KVM_USER_VMGEXIT_PSC		2
 	__u32 type; /* KVM_USER_VMGEXIT_* type */
 	union {
 		struct {
@@ -146,6 +147,10 @@ struct kvm_user_vmgexit {
 			__u8 op;
 			__u32 ret;
 		} psc_msr;
+		struct {
+			__u64 shared_gpa;
+			__u64 ret;
+		} psc;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v14 09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
    2024-04-21 18:01  9% ` [PATCH v14 05/22] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
@ 2024-04-21 18:01  3% ` Michael Roth
  2024-04-21 18:01  4% ` [PATCH v14 10/22] KVM: SEV: Add support to handle " Michael Roth
  2024-04-21 18:01  2% ` [PATCH v14 11/22] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-21 18:01 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

When using gmem, private/shared memory is allocated through separate
pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
backed by private memory or not.

Forward these page state change requests to userspace so that it can
issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
entries when it is ready to map a private page into a guest.

Define a new KVM_EXIT_VMGEXIT for exits of this type, and structure it
so that it can be extended for other cases where VMGEXITs need some
level of handling in userspace.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst    | 33 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/sev-common.h |  6 ++++++
 arch/x86/kvm/svm/sev.c            | 33 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          | 17 ++++++++++++++++
 4 files changed, 89 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..4a7a2945bc78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,39 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+		#define KVM_USER_VMGEXIT_PSC_MSR	1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 gpa;
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+					__u8 op;
+					__u32 ret;
+				} psc_msr;
+			};
+		};
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects handling a VMGEXIT are handled by userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+For the KVM_USER_VMGEXIT_PSC_MSR type, the psc_msr union type is used. The
+kernel will supply the 'gpa' and 'op' fields, and userspace is expected to
+update the private/shared state of the GPA using the corresponding
+KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
+userpace on success, or some non-zero value on failure.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 1006bfffe07a..6d68db812de1 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,11 +101,17 @@ enum psc_op {
 	/* GHCBData[11:0] */				\
 	GHCB_MSR_PSC_REQ)
 
+#define GHCB_MSR_PSC_REQ_TO_GFN(msr) (((msr) & GENMASK_ULL(51, 12)) >> 12)
+#define GHCB_MSR_PSC_REQ_TO_OP(msr) (((msr) & GENMASK_ULL(55, 52)) >> 52)
+
 #define GHCB_MSR_PSC_RESP		0x015
 #define GHCB_MSR_PSC_RESP_VAL(val)			\
 	/* GHCBData[63:32] */				\
 	(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
 
+/* Set highest bit as a generic error response */
+#define GHCB_MSR_PSC_RESP_ERROR (BIT_ULL(63) | GHCB_MSR_PSC_RESP)
+
 /* GHCB Hypervisor Feature Request/Response */
 #define GHCB_MSR_HV_FT_REQ		0x080
 #define GHCB_MSR_HV_FT_RESP		0x081
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 76084e109f66..f6f54a889fde 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3463,6 +3463,36 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 vmm_ret = vcpu->run->vmgexit.psc_msr.ret;
+
+	set_ghcb_msr(svm, (vmm_ret << 32) | GHCB_MSR_PSC_RESP);
+
+	return 1; /* resume guest */
+}
+
+static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
+{
+	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
+	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
+		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
+		return 1; /* resume guest */
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
+	vcpu->run->vmgexit.psc_msr.gpa = gpa;
+	vcpu->run->vmgexit.psc_msr.op = op;
+	vcpu->arch.complete_userspace_io = snp_complete_psc_msr;
+
+	return 0; /* forward request to userspace */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3561,6 +3591,9 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PSC_REQ:
+		ret = snp_begin_psc_msr(vcpu, control->ghcb_gpa);
+		break;
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..54b81e46a9fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,20 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_PSC_MSR	1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 gpa;
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+			__u8 op;
+			__u32 ret;
+		} psc_msr;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -178,6 +192,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +448,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* [kvm-unit-tests RFC PATCH 09/13] x86 AMD SEV-SNP: Test Shared->Private Page State Changes using GHCB MSR
  @ 2024-04-19 12:57  3% ` Pavan Kumar Paluri
  0 siblings, 0 replies; 200+ results
From: Pavan Kumar Paluri @ 2024-04-19 12:57 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christophersen, Michael Roth, Tom Lendacky,
	Pavan Kumar Paluri, Kim Phillips, Vasant Karasulli

The SEV-SNP guest VM issues page state change requests to hypervisor to
convert hypervisor-owned 4K shared pages back to private (guest-owned)
using GHCB MSR protocol. Guest then issues a 'pvalidate' instruction to
validate the pages after the conversions.

The purpose of this test is to determine whether the hypervisor changes
the page state to shared. After the conversion test, issue a re-validation
('pvalidate' with validated bit set) on one of the converted 4K pages to
ensure the page state is actually private.

It is important to note that the re-validation test cannot be performed
on a shared page ('pvalidate' with validated bit unset) as pvalidate
instruction will raise an undefined #PF exception as the page's C-bit
will be 0 during the guest page table walk, as mentioned in APM Vol-3,
PVALIDATE. Therefore, perform writes to the shared pages (with C-bit unset)
to ensure state of the pages are shared.

Signed-off-by: Pavan Kumar Paluri <papaluri@amd.com>
---
 x86/amd_sev.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/x86/amd_sev.c b/x86/amd_sev.c
index 71d1ee1cef91..31d15b49fc7a 100644
--- a/x86/amd_sev.c
+++ b/x86/amd_sev.c
@@ -206,8 +206,28 @@ static void set_pte_decrypted(unsigned long vaddr, int npages)
 	flush_tlb();
 }
 
-static efi_status_t sev_set_pages_state_msr_proto(unsigned long vaddr, int npages,
-						  int operation)
+static void set_pte_encrypted(unsigned long vaddr, int npages)
+{
+	pteval_t *pte;
+	unsigned long vaddr_end = vaddr + (npages * PAGE_SIZE);
+
+	while (vaddr < vaddr_end) {
+		pte = get_pte((pgd_t *)read_cr3(), (void *)vaddr);
+
+		if (!pte)
+			assert_msg(pte, "No pte found for vaddr 0x%lx", vaddr);
+
+		/* Set C-bit */
+		*pte |= get_amd_sev_c_bit_mask();
+
+		vaddr += PAGE_SIZE;
+	}
+
+	flush_tlb();
+}
+
+static efi_status_t sev_set_pages_state_msr_proto(unsigned long vaddr,
+						  int npages, int operation)
 {
 	efi_status_t status;
 
@@ -226,6 +246,16 @@ static efi_status_t sev_set_pages_state_msr_proto(unsigned long vaddr, int npage
 		}
 
 		set_pte_decrypted(vaddr, npages);
+
+	} else {
+		set_pte_encrypted(vaddr, npages);
+
+		status = __sev_set_pages_state_msr_proto(vaddr, npages,
+							 operation);
+		if (status != ES_OK) {
+			printf("Page state change (Shared->Private failure.\n");
+			return status;
+		}
 	}
 
 	return ES_OK;
@@ -358,6 +388,21 @@ static void test_sev_psc_ghcb_msr(void)
 		       1 << SNP_PSC_ALLOC_ORDER);
 	}
 
+	report_info("Shared->Private conversion test using GHCB MSR");
+	status = sev_set_pages_state_msr_proto((unsigned long)vaddr,
+					       1 << SNP_PSC_ALLOC_ORDER,
+					       SNP_PAGE_STATE_PRIVATE);
+
+	report(status == ES_OK, "Shared->Private Page State Change");
+
+	/*
+	 * After performing shared->private test, ensure the page is in
+	 * private state by issuing a pvalidate on a 4K page.
+	 */
+	report(is_validated_private_page((unsigned long)vaddr,
+					 RMP_PG_SIZE_4K, true),
+	       "Expected page state: Private");
+
 	/* Cleanup */
 	free_pages_by_order(vaddr, SNP_PSC_ALLOC_ORDER);
 }
-- 
2.34.1


^ permalink raw reply related	[relevance 3%]

* [PATCH v13 15/26] KVM: SEV: Add support to handle RMP nested page faults
                     ` (2 preceding siblings ...)
  2024-04-18 19:41  4% ` [PATCH v13 14/26] KVM: SEV: Add support to handle " Michael Roth
@ 2024-04-18 19:41  2% ` Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-18 19:41 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/include/asm/sev.h      |   3 +
 arch/x86/kvm/mmu.h              |   2 -
 arch/x86/kvm/mmu/mmu.c          |   1 +
 arch/x86/kvm/svm/sev.c          | 109 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  21 ++++--
 arch/x86/kvm/svm/svm.h          |   3 +
 arch/x86/kvm/trace.h            |  31 +++++++++
 arch/x86/kvm/x86.c              |   1 +
 9 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 744f8c920952..6f03e7649780 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1940,6 +1940,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   const struct kvm_memory_slot *memslot);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 780182cda3ab..234a998e2d2d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e8b620a85627..3317711540cd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -252,8 +252,6 @@ static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
 	return __kvm_mmu_honors_guest_mtrrs(kvm_arch_has_noncoherent_dma(kvm));
 }
 
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 748b9064567e..03b98c14cee1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6744,6 +6744,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 
 	return need_tlb_flush;
 }
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);
 
 static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
 					   const struct kvm_memory_slot *slot)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 96e24a1e34e3..0f70b057bfb8 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3455,6 +3455,23 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	int ret;
+
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	/*
+	 * PSMASH_FAIL_INUSE indicates another processor is modifying the
+	 * entry, so retry until that's no longer the case.
+	 */
+	do {
+		ret = psmash(pfn);
+	} while (ret == PSMASH_FAIL_INUSE);
+
+	return ret;
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4014,3 +4031,95 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out_no_trace;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K)
+		goto out;
+
+	ret = snp_rmptable_psmash(pfn);
+	if (ret) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have
+		 * raced with another process and the issue has already resolved
+		 * itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) &&
+		    assigned && rmp_level == PG_LEVEL_4K)
+			goto out;
+
+		pr_warn_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				    gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	trace_kvm_rmp_fault(vcpu, gpa, pfn, error_code, rmp_level, ret);
+out_no_trace:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d31404953bf1..1cddf7a2aec1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,6 +2043,7 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
@@ -2057,10 +2058,22 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 		error_code &= ~PFERR_SYNTHETIC_MASK;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 730f5ced2a2e..d2b0ec27d4fe 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index c6b4b1728006..3531a187d5d9 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1834,6 +1834,37 @@ TRACE_EVENT(kvm_vmgexit_msr_protocol_exit,
 		  __entry->vcpu_id, __entry->ghcb_gpa, __entry->result)
 );
 
+/*
+ * Tracepoint for #NPFs due to RMP faults.
+ */
+TRACE_EVENT(kvm_rmp_fault,
+	TP_PROTO(struct kvm_vcpu *vcpu, u64 gpa, u64 pfn, u64 error_code,
+		 int rmp_level, int psmash_ret),
+	TP_ARGS(vcpu, gpa, pfn, error_code, rmp_level, psmash_ret),
+
+	TP_STRUCT__entry(
+		__field(unsigned int, vcpu_id)
+		__field(u64, gpa)
+		__field(u64, pfn)
+		__field(u64, error_code)
+		__field(int, rmp_level)
+		__field(int, psmash_ret)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id	= vcpu->vcpu_id;
+		__entry->gpa		= gpa;
+		__entry->pfn		= pfn;
+		__entry->error_code	= error_code;
+		__entry->rmp_level	= rmp_level;
+		__entry->psmash_ret	= psmash_ret;
+	),
+
+	TP_printk("vcpu %u gpa %016llx pfn 0x%llx error_code 0x%llx rmp_level %d psmash_ret %d",
+		  __entry->vcpu_id, __entry->gpa, __entry->pfn,
+		  __entry->error_code, __entry->rmp_level, __entry->psmash_ret)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9923921904a2..a9d014961d2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13996,6 +13996,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
 
 static int __init kvm_x86_init(void)
 {
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v13 14/26] KVM: SEV: Add support to handle Page State Change VMGEXIT
    2024-04-18 19:41  9% ` [PATCH v13 09/26] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
  2024-04-18 19:41  3% ` [PATCH v13 13/26] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
@ 2024-04-18 19:41  4% ` Michael Roth
  2024-04-18 19:41  2% ` [PATCH v13 15/26] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-18 19:41 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.

Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
it is done for requests that don't use a GHCB page.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst | 14 ++++++++++++++
 arch/x86/kvm/svm/sev.c         | 16 ++++++++++++++++
 include/uapi/linux/kvm.h       |  5 +++++
 3 files changed, 35 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 4a7a2945bc78..85099198a10f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7065,6 +7065,7 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 		/* KVM_EXIT_VMGEXIT */
 		struct kvm_user_vmgexit {
 		#define KVM_USER_VMGEXIT_PSC_MSR	1
+		#define KVM_USER_VMGEXIT_PSC		2
 			__u32 type; /* KVM_USER_VMGEXIT_* type */
 			union {
 				struct {
@@ -7074,9 +7075,14 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 					__u8 op;
 					__u32 ret;
 				} psc_msr;
+				struct {
+					__u64 shared_gpa;
+					__u64 ret;
+				} psc;
 			};
 		};
 
+
 If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
 has issued a VMGEXIT instruction (as documented by the AMD Architecture
 Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
@@ -7094,6 +7100,14 @@ update the private/shared state of the GPA using the corresponding
 KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
 userpace on success, or some non-zero value on failure.
 
+For the KVM_USER_VMGEXIT_PSC type, the psc union type is used. The kernel
+will supply the GPA of the Page State Structure defined in the GHCB spec.
+Userspace will process this structure as defined by the GHCB, and issue
+KVM_SET_MEMORY_ATTRIBUTES ioctls to set the GPAs therein to the expected
+private/shared state. Userspace will return a value in 'ret' that is in
+agreement with the GHCB-defined return values that the guest will expect
+in the SW_EXITINFO2 field of the GHCB in response to these requests.
+
 6. Capabilities that can be enabled on vCPUs
 ============================================
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e982468554cb..96e24a1e34e3 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3266,6 +3266,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 	case SVM_VMGEXIT_HV_FEATURES:
+	case SVM_VMGEXIT_PSC:
 		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
@@ -3484,6 +3485,15 @@ static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
 	return 0; /* forward request to userspace */
 }
 
+static int snp_complete_psc(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, vcpu->run->vmgexit.psc.ret);
+
+	return 1; /* resume guest */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3721,6 +3731,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 
 		ret = 1;
 		break;
+	case SVM_VMGEXIT_PSC:
+		vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+		vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC;
+		vcpu->run->vmgexit.psc.shared_gpa = svm->sev_es.sw_scratch;
+		vcpu->arch.complete_userspace_io = snp_complete_psc;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 54b81e46a9fa..e33c48bfbd67 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -137,6 +137,7 @@ struct kvm_xen_exit {
 
 struct kvm_user_vmgexit {
 #define KVM_USER_VMGEXIT_PSC_MSR	1
+#define KVM_USER_VMGEXIT_PSC		2
 	__u32 type; /* KVM_USER_VMGEXIT_* type */
 	union {
 		struct {
@@ -146,6 +147,10 @@ struct kvm_user_vmgexit {
 			__u8 op;
 			__u32 ret;
 		} psc_msr;
+		struct {
+			__u64 shared_gpa;
+			__u64 ret;
+		} psc;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v13 13/26] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
    2024-04-18 19:41  9% ` [PATCH v13 09/26] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
@ 2024-04-18 19:41  3% ` Michael Roth
  2024-04-18 19:41  4% ` [PATCH v13 14/26] KVM: SEV: Add support to handle " Michael Roth
  2024-04-18 19:41  2% ` [PATCH v13 15/26] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-18 19:41 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

When using gmem, private/shared memory is allocated through separate
pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
backed by private memory or not.

Forward these page state change requests to userspace so that it can
issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
entries when it is ready to map a private page into a guest.

Define a new KVM_EXIT_VMGEXIT for exits of this type, and structure it
so that it can be extended for other cases where VMGEXITs need some
level of handling in userspace.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst    | 33 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/sev-common.h |  6 ++++++
 arch/x86/kvm/svm/sev.c            | 33 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          | 17 ++++++++++++++++
 4 files changed, 89 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..4a7a2945bc78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,39 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+		#define KVM_USER_VMGEXIT_PSC_MSR	1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 gpa;
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+					__u8 op;
+					__u32 ret;
+				} psc_msr;
+			};
+		};
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects handling a VMGEXIT are handled by userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+For the KVM_USER_VMGEXIT_PSC_MSR type, the psc_msr union type is used. The
+kernel will supply the 'gpa' and 'op' fields, and userspace is expected to
+update the private/shared state of the GPA using the corresponding
+KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
+userpace on success, or some non-zero value on failure.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 1006bfffe07a..6d68db812de1 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,11 +101,17 @@ enum psc_op {
 	/* GHCBData[11:0] */				\
 	GHCB_MSR_PSC_REQ)
 
+#define GHCB_MSR_PSC_REQ_TO_GFN(msr) (((msr) & GENMASK_ULL(51, 12)) >> 12)
+#define GHCB_MSR_PSC_REQ_TO_OP(msr) (((msr) & GENMASK_ULL(55, 52)) >> 52)
+
 #define GHCB_MSR_PSC_RESP		0x015
 #define GHCB_MSR_PSC_RESP_VAL(val)			\
 	/* GHCBData[63:32] */				\
 	(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
 
+/* Set highest bit as a generic error response */
+#define GHCB_MSR_PSC_RESP_ERROR (BIT_ULL(63) | GHCB_MSR_PSC_RESP)
+
 /* GHCB Hypervisor Feature Request/Response */
 #define GHCB_MSR_HV_FT_REQ		0x080
 #define GHCB_MSR_HV_FT_RESP		0x081
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index bd7f46c61c64..e982468554cb 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3454,6 +3454,36 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 vmm_ret = vcpu->run->vmgexit.psc_msr.ret;
+
+	set_ghcb_msr(svm, (vmm_ret << 32) | GHCB_MSR_PSC_RESP);
+
+	return 1; /* resume guest */
+}
+
+static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
+{
+	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
+	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
+		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
+		return 1; /* resume guest */
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
+	vcpu->run->vmgexit.psc_msr.gpa = gpa;
+	vcpu->run->vmgexit.psc_msr.op = op;
+	vcpu->arch.complete_userspace_io = snp_complete_psc_msr;
+
+	return 0; /* forward request to userspace */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3552,6 +3582,9 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PSC_REQ:
+		ret = snp_begin_psc_msr(vcpu, control->ghcb_gpa);
+		break;
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..54b81e46a9fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,20 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_PSC_MSR	1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 gpa;
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+			__u8 op;
+			__u32 ret;
+		} psc_msr;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -178,6 +192,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +448,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* [PATCH v13 09/26] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
  @ 2024-04-18 19:41  9% ` Michael Roth
  2024-04-18 19:41  3% ` [PATCH v13 13/26] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-04-18 19:41 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  23 +-
 arch/x86/include/uapi/asm/kvm.h               |   8 +
 arch/x86/kvm/svm/sev.c                        | 208 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 236 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 3381556d596d..1b042f827eab 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -459,6 +459,25 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest.
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+        };
+
+See the SEV-SNP spec [snp-fw-abi]_ for further detail on the launch input.
+
 Device attribute API
 ====================
 
@@ -490,9 +509,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9a8b81d20314..bdf8c5461a36 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -697,6 +697,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START = 100,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -822,6 +825,11 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c41cc73a1efe..4c5abc0e7806 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -58,6 +59,25 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_RSVD_MBO	BIT_ULL(17)
+#define SNP_POLICY_MASK_DEBUG		BIT_ULL(19)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+#define SNP_POLICY_MASK_API_MAJOR	GENMASK_ULL(15, 8)
+#define SNP_POLICY_MASK_API_MINOR	GENMASK_ULL(7, 0)
+
+#define SNP_POLICY_MASK_VALID		(SNP_POLICY_MASK_SMT		| \
+					 SNP_POLICY_MASK_RSVD_MBO	| \
+					 SNP_POLICY_MASK_DEBUG		| \
+					 SNP_POLICY_MASK_SINGLE_SOCKET	| \
+					 SNP_POLICY_MASK_API_MAJOR	| \
+					 SNP_POLICY_MASK_API_MINOR)
+
+/* KVM's SNP support is compatible with 1.51 of the SEV-SNP Firmware ABI. */
+#define SNP_POLICY_API_MAJOR		1
+#define SNP_POLICY_API_MINOR		51
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -68,6 +88,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -94,12 +116,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1976,6 +2003,134 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static inline bool sev_version_greater_or_equal(u8 major, u8 minor)
+{
+	if (major < SNP_POLICY_API_MAJOR)
+		return true;
+
+	if (major == SNP_POLICY_API_MAJOR && minor <= SNP_POLICY_API_MINOR)
+		return true;
+
+	return false;
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	u8 major, minor;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context) {
+		pr_debug("SEV-SNP context already exists. Refusing to allocate an additional one.\n");
+		return -EINVAL;
+	}
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.policy & ~SNP_POLICY_MASK_VALID) {
+		pr_debug("SEV-SNP hypervisor does not support requested policy %llx (supported %llx).\n",
+			 params.policy, SNP_POLICY_MASK_VALID);
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_RSVD_MBO)) {
+		pr_debug("SEV-SNP hypervisor does not support requested policy %llx (must be set %llx).\n",
+			 params.policy, SNP_POLICY_MASK_RSVD_MBO);
+		return -EINVAL;
+	}
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single socket.\n");
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_SMT)) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single SMT thread.\n");
+		return -EINVAL;
+	}
+
+	major = (params.policy & SNP_POLICY_MASK_API_MAJOR);
+	minor = (params.policy & SNP_POLICY_MASK_API_MINOR);
+	if (!sev_version_greater_or_equal(major, minor)) {
+		pr_debug("SEV-SNP hypervisor does not support requested version %d.%d (have %d,%d).\n",
+			 major, minor, SNP_POLICY_API_MAJOR, SNP_POLICY_API_MINOR);
+		return -EINVAL;
+	}
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("SEV_CMD_SNP_LAUNCH_START firmware command failed, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("Failed to bind ASID to SEV-SNP context, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -1999,6 +2154,15 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 		goto out;
 	}
 
+	/*
+	 * Once KVM_SEV_INIT2 initializes a KVM instance as an SNP guest, only
+	 * allow the use of SNP-specific commands.
+	 */
+	if (sev_snp_guest(kvm) && sev_cmd.id < KVM_SEV_SNP_LAUNCH_START) {
+		r = -EPERM;
+		goto out;
+	}
+
 	switch (sev_cmd.id) {
 	case KVM_SEV_ES_INIT:
 		if (!sev_es_enabled) {
@@ -2063,6 +2227,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2258,6 +2425,33 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	if (WARN_ONCE(ret, "failed to release guest context")) {
+		up_write(&sev_deactivate_lock);
+		return ret;
+	}
+
+	up_write(&sev_deactivate_lock);
+
+	/* free the context page now */
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2299,7 +2493,15 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7f2e9c7fc4ca..0654fc91d4db 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -92,6 +92,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1


^ permalink raw reply related	[relevance 9%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-17 12:40  0%             ` Igor Mammedov
@ 2024-04-17 13:58  0%               ` boris.ostrovsky
  0 siblings, 0 replies; 200+ results
From: boris.ostrovsky @ 2024-04-17 13:58 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel



On 4/17/24 8:40 AM, Igor Mammedov wrote:
> On Tue, 16 Apr 2024 19:37:09 -0400
> boris.ostrovsky@oracle.com wrote:
> 
>> On 4/16/24 7:17 PM, Sean Christopherson wrote:
>>> On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:
>>>> (Sorry, need to resend)
>>>>
>>>> On 4/16/24 6:03 PM, Paolo Bonzini wrote:
>>>>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>>>>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>>>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>>>>>> Keeping the SIPI pending avoids this scenario.
>>>>>>>
>>>>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>>>>>> have to live with it.  SIPI is discarded because the code is supposed
>>>>>>> to retry it if needed ("INIT-SIPI-SIPI").
>>>>>>
>>>>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>>>>>> is supposed to be dropped.
>>>>>
>>>>> I think the manual is pretty consistent that SIPIs are never latched,
>>>>> they're only ever used in wait-for-SIPI state.
>>>>>   
>>>>>>> The sender should set a flag as early as possible in the SIPI code so
>>>>>>> that it's clear that it was not received; and an extra SIPI is not a
>>>>>>> problem, it will be ignored anyway and will not cause trouble if
>>>>>>> there's a race.
>>>>>>>
>>>>>>> What is the reproducer for this?
>>>>>>
>>>>>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>>>>>> the guest, will get you there in 10-15 minutes.
>>>>>>
>>>>>> Typically (although I think not always) this is happening when OVMF if
>>>>>> trying to rendezvous and a processor is missing and is sent an extra SMI.
>>>>>
>>>>> Can you go into more detail? I wasn't even aware that OVMF's SMM
>>>>> supported hotplug - on real hardware I think there's extra work from
>>>>> the BMC to coordinate all SMIs across both existing and hotplugged
>>>>> packages(*)
>>>>
>>>>
>>>> It's been supported by OVMF for a couple of years (in fact, IIRC you were
>>>> part of at least initial conversations about this, at least for the unplug
>>>> part).
>>>>
>>>> During hotplug QEMU gathers all cpus in OVMF from (I think)
>>>> ich9_apm_ctrl_changed() and they are all waited for in
>>>> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
>>>> that the SMI from QEMU is not delivered to a processor that was *just*
>>>> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
>>>>
>>>>
>>>> At the same time this processor is now being brought up by kernel and is
>>>> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
>>>> SMI reaches the processor then that processor is not going to have a good
>>>> day.
> 
> Do you use qemu/firmware combo that negotiated ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT/
> ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT features?

Yes.

> 
>>>
>>> It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
>>> and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
>>> combinations correctly.
>>>
>>> Why is the SMI from QEMU not delivered?  That seems like the smoking gun.
>>
>> I haven't actually traced this but it seems that what happens is that cv
>> the newly-added processor is about to leave SMM and the count of in-SMM
>> processors is decremented. At the same time, since the processor is
>> still in SMM the QEMU's SMM is not taken.
>>
>> And so when the count is looked at again in SmmWaitForApArrival() one
>> processor is missing.
> 
> Current QEMU CPU hotplug workflow with SMM enabled, should be following:
> 
>    1. OSPM gets list(N) of hotplugged cpus
>    2. OSPM hands over control to firmware (SMM callback leading to SMI broadcast)
>    3. Firmware at this point shall initialize all new CPUs (incl. relocating SMBASE for new ones)
>       it shall pull in all CPUs that are present at the moment
>    4. Firmware returns control to OSPM
>    5. OSPM sends Notify to the list(N) CPUs triggering INIT-SIPI-SIPI _only_ on
>       those CPUs that it collected in step 1
> 
> above steps will repeat until all hotplugged CPUs are handled.
> 
> In nutshell INIT-SIPI-SIPI shall not be sent to a freshly hotplugged CPU
> that OSPM haven't seen (1) yet _and_ firmware should have initialized (3).
> 
> CPUs enumerated at (3) at least shall include CPUs present at (1)
> and may include newer CPU arrived in between (1-3).
> 
> CPUs collected at (1) shall all get SMM, if it doesn't happen
> then hotplug workflow won't work as expected.
> In which case we need to figure out why SMM is not delivered
> or why firmware isn't waiting for hotplugged CPU.

I noticed that I was using a few months old qemu bits and now I am 
having trouble reproducing this on latest bits. Let me see if I can get 
this to fail with latest first and then try to trace why the processor 
is in this unexpected state.

-boris

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 23:37  0%           ` boris.ostrovsky
@ 2024-04-17 12:40  0%             ` Igor Mammedov
  2024-04-17 13:58  0%               ` boris.ostrovsky
  0 siblings, 1 reply; 200+ results
From: Igor Mammedov @ 2024-04-17 12:40 UTC (permalink / raw)
  To: boris.ostrovsky; +Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel

On Tue, 16 Apr 2024 19:37:09 -0400
boris.ostrovsky@oracle.com wrote:

> On 4/16/24 7:17 PM, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:  
> >> (Sorry, need to resend)
> >>
> >> On 4/16/24 6:03 PM, Paolo Bonzini wrote:  
> >>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:  
> >>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:  
> >>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:  
> >>>>>> Keeping the SIPI pending avoids this scenario.  
> >>>>>
> >>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
> >>>>> have to live with it.  SIPI is discarded because the code is supposed
> >>>>> to retry it if needed ("INIT-SIPI-SIPI").  
> >>>>
> >>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
> >>>> is supposed to be dropped.  
> >>>
> >>> I think the manual is pretty consistent that SIPIs are never latched,
> >>> they're only ever used in wait-for-SIPI state.
> >>>  
> >>>>> The sender should set a flag as early as possible in the SIPI code so
> >>>>> that it's clear that it was not received; and an extra SIPI is not a
> >>>>> problem, it will be ignored anyway and will not cause trouble if
> >>>>> there's a race.
> >>>>>
> >>>>> What is the reproducer for this?  
> >>>>
> >>>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> >>>> the guest, will get you there in 10-15 minutes.
> >>>>
> >>>> Typically (although I think not always) this is happening when OVMF if
> >>>> trying to rendezvous and a processor is missing and is sent an extra SMI.  
> >>>
> >>> Can you go into more detail? I wasn't even aware that OVMF's SMM
> >>> supported hotplug - on real hardware I think there's extra work from
> >>> the BMC to coordinate all SMIs across both existing and hotplugged
> >>> packages(*)  
> >>
> >>
> >> It's been supported by OVMF for a couple of years (in fact, IIRC you were
> >> part of at least initial conversations about this, at least for the unplug
> >> part).
> >>
> >> During hotplug QEMU gathers all cpus in OVMF from (I think)
> >> ich9_apm_ctrl_changed() and they are all waited for in
> >> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
> >> that the SMI from QEMU is not delivered to a processor that was *just*
> >> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
> >>
> >>
> >> At the same time this processor is now being brought up by kernel and is
> >> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
> >> SMI reaches the processor then that processor is not going to have a good
> >> day. 

Do you use qemu/firmware combo that negotiated ICH9_LPC_SMI_F_CPU_HOTPLUG_BIT/
ICH9_LPC_SMI_F_CPU_HOT_UNPLUG_BIT features?

> > 
> > It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
> > and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
> > combinations correctly.
> > 
> > Why is the SMI from QEMU not delivered?  That seems like the smoking gun.  
> 
> I haven't actually traced this but it seems that what happens is that cv
> the newly-added processor is about to leave SMM and the count of in-SMM 
> processors is decremented. At the same time, since the processor is 
> still in SMM the QEMU's SMM is not taken.
> 
> And so when the count is looked at again in SmmWaitForApArrival() one 
> processor is missing.

Current QEMU CPU hotplug workflow with SMM enabled, should be following:

  1. OSPM gets list(N) of hotplugged cpus 
  2. OSPM hands over control to firmware (SMM callback leading to SMI broadcast)
  3. Firmware at this point shall initialize all new CPUs (incl. relocating SMBASE for new ones)
     it shall pull in all CPUs that are present at the moment
  4. Firmware returns control to OSPM
  5. OSPM sends Notify to the list(N) CPUs triggering INIT-SIPI-SIPI _only_ on
     those CPUs that it collected in step 1

above steps will repeat until all hotplugged CPUs are handled.

In nutshell INIT-SIPI-SIPI shall not be sent to a freshly hotplugged CPU
that OSPM haven't seen (1) yet _and_ firmware should have initialized (3).

CPUs enumerated at (3) at least shall include CPUs present at (1)
and may include newer CPU arrived in between (1-3).

CPUs collected at (1) shall all get SMM, if it doesn't happen
then hotplug workflow won't work as expected.
In which case we need to figure out why SMM is not delivered
or why firmware isn't waiting for hotplugged CPU.

> 
> -boris
> 


^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 23:17  0%         ` Sean Christopherson
@ 2024-04-16 23:37  0%           ` boris.ostrovsky
  2024-04-17 12:40  0%             ` Igor Mammedov
  0 siblings, 1 reply; 200+ results
From: boris.ostrovsky @ 2024-04-16 23:37 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel



On 4/16/24 7:17 PM, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:
>> (Sorry, need to resend)
>>
>> On 4/16/24 6:03 PM, Paolo Bonzini wrote:
>>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>>>> Keeping the SIPI pending avoids this scenario.
>>>>>
>>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>>>> have to live with it.  SIPI is discarded because the code is supposed
>>>>> to retry it if needed ("INIT-SIPI-SIPI").
>>>>
>>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>>>> is supposed to be dropped.
>>>
>>> I think the manual is pretty consistent that SIPIs are never latched,
>>> they're only ever used in wait-for-SIPI state.
>>>
>>>>> The sender should set a flag as early as possible in the SIPI code so
>>>>> that it's clear that it was not received; and an extra SIPI is not a
>>>>> problem, it will be ignored anyway and will not cause trouble if
>>>>> there's a race.
>>>>>
>>>>> What is the reproducer for this?
>>>>
>>>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>>>> the guest, will get you there in 10-15 minutes.
>>>>
>>>> Typically (although I think not always) this is happening when OVMF if
>>>> trying to rendezvous and a processor is missing and is sent an extra SMI.
>>>
>>> Can you go into more detail? I wasn't even aware that OVMF's SMM
>>> supported hotplug - on real hardware I think there's extra work from
>>> the BMC to coordinate all SMIs across both existing and hotplugged
>>> packages(*)
>>
>>
>> It's been supported by OVMF for a couple of years (in fact, IIRC you were
>> part of at least initial conversations about this, at least for the unplug
>> part).
>>
>> During hotplug QEMU gathers all cpus in OVMF from (I think)
>> ich9_apm_ctrl_changed() and they are all waited for in
>> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
>> that the SMI from QEMU is not delivered to a processor that was *just*
>> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
>>
>>
>> At the same time this processor is now being brought up by kernel and is
>> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
>> SMI reaches the processor then that processor is not going to have a good
>> day.
> 
> It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
> and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
> combinations correctly.
> 
> Why is the SMI from QEMU not delivered?  That seems like the smoking gun.

I haven't actually traced this but it seems that what happens is that 
the newly-added processor is about to leave SMM and the count of in-SMM 
processors is decremented. At the same time, since the processor is 
still in SMM the QEMU's SMM is not taken.

And so when the count is looked at again in SmmWaitForApArrival() one 
processor is missing.


-boris

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 22:56  0%       ` boris.ostrovsky
@ 2024-04-16 23:17  0%         ` Sean Christopherson
  2024-04-16 23:37  0%           ` boris.ostrovsky
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-04-16 23:17 UTC (permalink / raw)
  To: boris.ostrovsky; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Apr 16, 2024, boris.ostrovsky@oracle.com wrote:
> (Sorry, need to resend)
> 
> On 4/16/24 6:03 PM, Paolo Bonzini wrote:
> > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
> > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > > > On 4/16/24 22:47, Boris Ostrovsky wrote:
> > > > > Keeping the SIPI pending avoids this scenario.
> > > > 
> > > > This is incorrect - it's yet another ugly legacy facet of x86, but we
> > > > have to live with it.  SIPI is discarded because the code is supposed
> > > > to retry it if needed ("INIT-SIPI-SIPI").
> > > 
> > > I couldn't find in the SDM/APM a definitive statement about whether SIPI
> > > is supposed to be dropped.
> > 
> > I think the manual is pretty consistent that SIPIs are never latched,
> > they're only ever used in wait-for-SIPI state.
> > 
> > > > The sender should set a flag as early as possible in the SIPI code so
> > > > that it's clear that it was not received; and an extra SIPI is not a
> > > > problem, it will be ignored anyway and will not cause trouble if
> > > > there's a race.
> > > > 
> > > > What is the reproducer for this?
> > > 
> > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> > > the guest, will get you there in 10-15 minutes.
> > > 
> > > Typically (although I think not always) this is happening when OVMF if
> > > trying to rendezvous and a processor is missing and is sent an extra SMI.
> > 
> > Can you go into more detail? I wasn't even aware that OVMF's SMM
> > supported hotplug - on real hardware I think there's extra work from
> > the BMC to coordinate all SMIs across both existing and hotplugged
> > packages(*)
> 
> 
> It's been supported by OVMF for a couple of years (in fact, IIRC you were
> part of at least initial conversations about this, at least for the unplug
> part).
> 
> During hotplug QEMU gathers all cpus in OVMF from (I think)
> ich9_apm_ctrl_changed() and they are all waited for in
> SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen
> that the SMI from QEMU is not delivered to a processor that was *just*
> successfully hotplugged and so it is pinged again (https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304).
> 
> 
> At the same time this processor is now being brought up by kernel and is
> being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after the
> SMI reaches the processor then that processor is not going to have a good
> day.

It's specifically SIPI that's problematic.  INIT is blocked by SMM, but latched,
and SMIs are blocked by WFS, but latched.  And AFAICT, KVM emulates all of those
combinations correctly.

Why is the SMI from QEMU not delivered?  That seems like the smoking gun.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 22:14  0%       ` Sean Christopherson
@ 2024-04-16 23:02  0%         ` boris.ostrovsky
  0 siblings, 0 replies; 200+ results
From: boris.ostrovsky @ 2024-04-16 23:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel



On 4/16/24 6:14 PM, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
>> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>>> Keeping the SIPI pending avoids this scenario.
>>>>
>>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>>> have to live with it.  SIPI is discarded because the code is supposed
>>>> to retry it if needed ("INIT-SIPI-SIPI").
>>>
>>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>>> is supposed to be dropped.
>>
>> I think the manual is pretty consistent that SIPIs are never latched,
>> they're only ever used in wait-for-SIPI state.
> 
> Ya, the "Interrupt Command Register (ICR)" section for "110 (Start-Up)" explicitly
> says it's software's responsibility to detect whether or not the SIPI was delivered,
> and to resend SIPI(s) if needed.
> 
>    IPIs sent with this delivery mode are not automatically retried if the source
>    APIC is unable to deliver it. It is up to the software to determine if the
>    SIPI was not successfully delivered and to reissue the SIPI if necessary.


Right, I saw that. I was hoping to see something about SIPI being 
dropped. IOW my question was what happens to a SIPI that was delivered 
to a processor in SMM and not what should I do if it wasn't.

-boris

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 22:03  0%     ` Paolo Bonzini
  2024-04-16 22:14  0%       ` Sean Christopherson
@ 2024-04-16 22:56  0%       ` boris.ostrovsky
  2024-04-16 23:17  0%         ` Sean Christopherson
  1 sibling, 1 reply; 200+ results
From: boris.ostrovsky @ 2024-04-16 22:56 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, seanjc, linux-kernel

(Sorry, need to resend)

On 4/16/24 6:03 PM, Paolo Bonzini wrote:
> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
>> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
>>> On 4/16/24 22:47, Boris Ostrovsky wrote:
>>>> Keeping the SIPI pending avoids this scenario.
>>>
>>> This is incorrect - it's yet another ugly legacy facet of x86, but we
>>> have to live with it.  SIPI is discarded because the code is supposed
>>> to retry it if needed ("INIT-SIPI-SIPI").
>>
>> I couldn't find in the SDM/APM a definitive statement about whether SIPI
>> is supposed to be dropped.
> 
> I think the manual is pretty consistent that SIPIs are never latched,
> they're only ever used in wait-for-SIPI state.
> 
>>> The sender should set a flag as early as possible in the SIPI code so
>>> that it's clear that it was not received; and an extra SIPI is not a
>>> problem, it will be ignored anyway and will not cause trouble if
>>> there's a race.
>>>
>>> What is the reproducer for this?
>>
>> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
>> the guest, will get you there in 10-15 minutes.
>>
>> Typically (although I think not always) this is happening when OVMF if
>> trying to rendezvous and a processor is missing and is sent an extra SMI.
> 
> Can you go into more detail? I wasn't even aware that OVMF's SMM
> supported hotplug - on real hardware I think there's extra work from
> the BMC to coordinate all SMIs across both existing and hotplugged
> packages(*)


It's been supported by OVMF for a couple of years (in fact, IIRC you 
were part of at least initial conversations about this, at least for the 
unplug part).

During hotplug QEMU gathers all cpus in OVMF from (I think) 
ich9_apm_ctrl_changed() and they are all waited for in 
SmmCpuRendezvous()->SmmWaitForApArrival(). Occasionally it may so happen 
that the SMI from QEMU is not delivered to a processor that was *just* 
successfully hotplugged and so it is pinged again 
(https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c#L304). 


At the same time this processor is now being brought up by kernel and is 
being sent INIT-SIPI-SIPI. If these (or at least the SIPIs) arrive after 
the SMI reaches the processor then that processor is not going to have a 
good day.


> 
> What should happen is that SMIs are blocked on the new CPUs, so that
> only existing CPUs answer. These restore the 0x30000 segment to
> prepare for the SMI on the new CPUs, and send an INIT-SIPI to start
> the SMI on the new CPUs. Does OVMF do anything like that?
You mean this: 
https://github.com/tianocore/edk2/blob/fcfdbe29874320e9f876baa7afebc3fca8f4a7df/OvmfPkg/CpuHotplugSmm/Smbase.c#L272 
?


-boris

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 22:03  0%     ` Paolo Bonzini
@ 2024-04-16 22:14  0%       ` Sean Christopherson
  2024-04-16 23:02  0%         ` boris.ostrovsky
  2024-04-16 22:56  0%       ` boris.ostrovsky
  1 sibling, 1 reply; 200+ results
From: Sean Christopherson @ 2024-04-16 22:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: boris.ostrovsky, kvm, linux-kernel

On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
> > On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > > On 4/16/24 22:47, Boris Ostrovsky wrote:
> > >> Keeping the SIPI pending avoids this scenario.
> > >
> > > This is incorrect - it's yet another ugly legacy facet of x86, but we
> > > have to live with it.  SIPI is discarded because the code is supposed
> > > to retry it if needed ("INIT-SIPI-SIPI").
> >
> > I couldn't find in the SDM/APM a definitive statement about whether SIPI
> > is supposed to be dropped.
> 
> I think the manual is pretty consistent that SIPIs are never latched,
> they're only ever used in wait-for-SIPI state.

Ya, the "Interrupt Command Register (ICR)" section for "110 (Start-Up)" explicitly
says it's software's responsibility to detect whether or not the SIPI was delivered,
and to resend SIPI(s) if needed.

  IPIs sent with this delivery mode are not automatically retried if the source
  APIC is unable to deliver it. It is up to the software to determine if the
  SIPI was not successfully delivered and to reissue the SIPI if necessary.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  2024-04-16 20:57  4%   ` boris.ostrovsky
@ 2024-04-16 22:03  0%     ` Paolo Bonzini
  2024-04-16 22:14  0%       ` Sean Christopherson
  2024-04-16 22:56  0%       ` boris.ostrovsky
  0 siblings, 2 replies; 200+ results
From: Paolo Bonzini @ 2024-04-16 22:03 UTC (permalink / raw)
  To: boris.ostrovsky; +Cc: kvm, seanjc, linux-kernel

On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@oracle.com> wrote:
> On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > On 4/16/24 22:47, Boris Ostrovsky wrote:
> >> Keeping the SIPI pending avoids this scenario.
> >
> > This is incorrect - it's yet another ugly legacy facet of x86, but we
> > have to live with it.  SIPI is discarded because the code is supposed
> > to retry it if needed ("INIT-SIPI-SIPI").
>
> I couldn't find in the SDM/APM a definitive statement about whether SIPI
> is supposed to be dropped.

I think the manual is pretty consistent that SIPIs are never latched,
they're only ever used in wait-for-SIPI state.

> > The sender should set a flag as early as possible in the SIPI code so
> > that it's clear that it was not received; and an extra SIPI is not a
> > problem, it will be ignored anyway and will not cause trouble if
> > there's a race.
> >
> > What is the reproducer for this?
>
> Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> the guest, will get you there in 10-15 minutes.
>
> Typically (although I think not always) this is happening when OVMF if
> trying to rendezvous and a processor is missing and is sent an extra SMI.

Can you go into more detail? I wasn't even aware that OVMF's SMM
supported hotplug - on real hardware I think there's extra work from
the BMC to coordinate all SMIs across both existing and hotplugged
packages(*)

What should happen is that SMIs are blocked on the new CPUs, so that
only existing CPUs answer. These restore the 0x30000 segment to
prepare for the SMI on the new CPUs, and send an INIT-SIPI to start
the SMI on the new CPUs. Does OVMF do anything like that?

Paolo


^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
  @ 2024-04-16 20:57  4%   ` boris.ostrovsky
  2024-04-16 22:03  0%     ` Paolo Bonzini
  0 siblings, 1 reply; 200+ results
From: boris.ostrovsky @ 2024-04-16 20:57 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: seanjc, linux-kernel


On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> On 4/16/24 22:47, Boris Ostrovsky wrote:
>> When a processor is running in SMM and receives INIT message the 
>> interrupt
>> is left pending until SMM is exited. On the other hand, SIPI, which
>> typically follows INIT, is discarded. This presents a problem since 
>> sender
>> has no way of knowing that its SIPI has been dropped, which results in
>> processor failing to come up.
>>
>> Keeping the SIPI pending avoids this scenario.
>
> This is incorrect - it's yet another ugly legacy facet of x86, but we 
> have to live with it.  SIPI is discarded because the code is supposed 
> to retry it if needed ("INIT-SIPI-SIPI").


I couldn't find in the SDM/APM a definitive statement about whether SIPI 
is supposed to be dropped.


>
> The sender should set a flag as early as possible in the SIPI code so 
> that it's clear that it was not received; and an extra SIPI is not a 
> problem, it will be ignored anyway and will not cause trouble if 
> there's a race.
>
> What is the reproducer for this?
>

Hotplugging/unplugging cpus in a loop, especially if you oversubscribe 
the guest, will get you there in 10-15 minutes.

Typically (although I think not always) this is happening when OVMF if 
trying to rendezvous and a processor is missing and is sent an extra SMI.


-boris


^ permalink raw reply	[relevance 4%]

* Re: [PATCH for-9.1 v1 0/3] Add SEV/SEV-ES machine compat options for KVM_SEV_INIT2
  2024-04-09 23:07  4% [PATCH for-9.1 v1 0/3] Add SEV/SEV-ES machine compat options for KVM_SEV_INIT2 Michael Roth
@ 2024-04-11 17:35  0% ` Paolo Bonzini
  0 siblings, 0 replies; 200+ results
From: Paolo Bonzini @ 2024-04-11 17:35 UTC (permalink / raw)
  To: Michael Roth
  Cc: qemu-devel, kvm, Tom Lendacky, Pankaj Gupta, Larry Dewey, Roy Hopkins

On Wed, Apr 10, 2024 at 1:08 AM Michael Roth <michael.roth@amd.com> wrote:
>
> These patches are also available at:
>
>   https://github.com/amdese/qemu/commits/sev-init-legacy-v1
>
> and are based on top Paolo's qemu-coco-queue branch containing the
> following patches:

A more complete version of patch 2 was already on the list, so I
queued 1 and 3 to qemu-coco-queue.

Thanks!

Paolo

>
>   [PATCH for-9.1 00/26] x86, kvm: common confidential computing subset
>   https://lore.kernel.org/all/20240322181116.1228416-1-pbonzini@redhat.com/T/
>
> Overview
> --------
>
> With the following patches applied from qemu-coco-queue:
>
>   https://lore.kernel.org/all/20240319140000.1014247-1-pbonzini@redhat.com/
>
> QEMU version 9.1+ will begin automatically making use of the new
> KVM_SEV_INIT2 API for initializing SEV and SEV-ES (and eventually, SEV-SNP)
> guests verses the older KVM_SEV_INIT/KVM_SEV_ES_INIT interfaces.
>
> However, the older interfaces would silently avoid sync'ing FPU/XSAVE state
> set by QEMU to each vCPU's VMSA prior to encryption. With KVM_SEV_INIT2,
> this state will now be synced into the VMSA, resulting in measurements
> changes and, theoretically, behaviorial changes, though the latter are
> unlikely to be seen in practice. The specific VMSA changes are documented
> in the section below for reference.
>
> This series implements machine compatibility options for SEV/SEV-ES so that
> only VMs created with QEMU 9.1+ will make use of KVM_SEV_INIT2 so that VMSA
> differences can be accounted for beforehand, and older machine types will
> continue using the older interfaces to avoid unexpected measurement
> changes.
>
> Specific VMSA changes
> ---------------------
>
> With KVM_SEV_INIT2, rather than 0, QEMU/KVM will instead begin setting the
> following fields in the VMSA before measurement/encryption:
>
>   VMSA byte offset [1032:1033] = 80 1f (MXCSR, Multimedia Control Status
>                                         Register)
>   VMSA byte offset [1040:1041] = 7f 03 (FCW, FPU/x86 Control Word)
>
> Setting FCW (FPU/x86 Control Word) to 0x37f is consistent with 11.5.7 of
> APM Volume 2. MXCSR reset state is not defined for XSAVE, but QEMU's 0x1f80
> value is consistent with machine reset state documented in APM Volume 2
> 4.2.2. As such, it is reasonable to begin including these in the VMSA
> measurement calculations.
>
> NOTE: section 11.5.7 also documents that FTW should be all 1's, whereas
>       QEMU currently sets all zeroes. Should that be changed as part of
>       this, or are there other reasons for setting 0?
>
> Thanks,
>
> Mike
>
> ----------------------------------------------------------------
> Michael Roth (3):
>       i386/sev: Add 'legacy-vm-type' parameter for SEV guest objects
>       hw/i386: Add 9.1 machine types for i440fx/q35
>       hw/i386/sev: Use legacy SEV VM types for older machine types
>
>  hw/i386/pc.c         |  5 +++++
>  hw/i386/pc_piix.c    | 13 ++++++++++++-
>  hw/i386/pc_q35.c     | 12 +++++++++++-
>  include/hw/i386/pc.h |  3 +++
>  qapi/qom.json        | 11 ++++++++++-
>  target/i386/sev.c    | 19 ++++++++++++++++++-
>  6 files changed, 59 insertions(+), 4 deletions(-)
>
>
>


^ permalink raw reply	[relevance 0%]

* [PATCH for-9.1 v1 0/3] Add SEV/SEV-ES machine compat options for KVM_SEV_INIT2
@ 2024-04-09 23:07  4% Michael Roth
  2024-04-11 17:35  0% ` Paolo Bonzini
  0 siblings, 1 reply; 200+ results
From: Michael Roth @ 2024-04-09 23:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, Paolo Bonzini, Tom Lendacky, Pankaj Gupta, Larry Dewey, Roy Hopkins

These patches are also available at:

  https://github.com/amdese/qemu/commits/sev-init-legacy-v1

and are based on top Paolo's qemu-coco-queue branch containing the
following patches:

  [PATCH for-9.1 00/26] x86, kvm: common confidential computing subset
  https://lore.kernel.org/all/20240322181116.1228416-1-pbonzini@redhat.com/T/

Overview
--------

With the following patches applied from qemu-coco-queue:

  https://lore.kernel.org/all/20240319140000.1014247-1-pbonzini@redhat.com/

QEMU version 9.1+ will begin automatically making use of the new
KVM_SEV_INIT2 API for initializing SEV and SEV-ES (and eventually, SEV-SNP)
guests verses the older KVM_SEV_INIT/KVM_SEV_ES_INIT interfaces.

However, the older interfaces would silently avoid sync'ing FPU/XSAVE state
set by QEMU to each vCPU's VMSA prior to encryption. With KVM_SEV_INIT2,
this state will now be synced into the VMSA, resulting in measurements
changes and, theoretically, behaviorial changes, though the latter are
unlikely to be seen in practice. The specific VMSA changes are documented
in the section below for reference.

This series implements machine compatibility options for SEV/SEV-ES so that
only VMs created with QEMU 9.1+ will make use of KVM_SEV_INIT2 so that VMSA
differences can be accounted for beforehand, and older machine types will
continue using the older interfaces to avoid unexpected measurement
changes.

Specific VMSA changes
---------------------

With KVM_SEV_INIT2, rather than 0, QEMU/KVM will instead begin setting the
following fields in the VMSA before measurement/encryption:

  VMSA byte offset [1032:1033] = 80 1f (MXCSR, Multimedia Control Status
                                        Register)
  VMSA byte offset [1040:1041] = 7f 03 (FCW, FPU/x86 Control Word)

Setting FCW (FPU/x86 Control Word) to 0x37f is consistent with 11.5.7 of
APM Volume 2. MXCSR reset state is not defined for XSAVE, but QEMU's 0x1f80
value is consistent with machine reset state documented in APM Volume 2
4.2.2. As such, it is reasonable to begin including these in the VMSA
measurement calculations.

NOTE: section 11.5.7 also documents that FTW should be all 1's, whereas
      QEMU currently sets all zeroes. Should that be changed as part of
      this, or are there other reasons for setting 0?

Thanks,

Mike

----------------------------------------------------------------
Michael Roth (3):
      i386/sev: Add 'legacy-vm-type' parameter for SEV guest objects
      hw/i386: Add 9.1 machine types for i440fx/q35
      hw/i386/sev: Use legacy SEV VM types for older machine types

 hw/i386/pc.c         |  5 +++++
 hw/i386/pc_piix.c    | 13 ++++++++++++-
 hw/i386/pc_q35.c     | 12 +++++++++++-
 include/hw/i386/pc.h |  3 +++
 qapi/qom.json        | 11 ++++++++++-
 target/i386/sev.c    | 19 ++++++++++++++++++-
 6 files changed, 59 insertions(+), 4 deletions(-)




^ permalink raw reply	[relevance 4%]

* [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults
  2024-03-29 22:58  2% ` [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  2024-03-29 22:58  2%   ` Michael Roth
@ 2024-03-29 22:58  2%   ` Michael Roth
  1 sibling, 0 replies; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |   3 ++
 arch/x86/kvm/svm/sev.c     | 103 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c     |  21 ++++++--
 arch/x86/kvm/svm/svm.h     |   3 ++
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 780182cda3ab..234a998e2d2d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c35ed9d91c89..a0a88471f9ab 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3397,6 +3397,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3956,3 +3963,99 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have raced with
+		 * another process and the issue has already resolved itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) && assigned &&
+		    rmp_level == PG_LEVEL_4K) {
+			pr_debug_ratelimited("%s: PSMASH for GPA 0x%llx failed with ret %d due to potential race",
+					     __func__, gpa, ret);
+			goto out;
+		}
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c162f6a1d78..648a05ca53fc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,15 +2043,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bb04d63012b4..c0675ff2d8a2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
-- 
2.25.1


X-sender: <linux-kernel+bounces-125497-steffen.klassert=secunet.com@vger.kernel.org>
X-Receiver: <steffen.klassert@secunet.com> ORCPT=rfc822;steffen.klassert@secunet.com
X-CreatedBy: MSExchange15
X-HeloDomain: mbx-dresden-01.secunet.de
X-ExtendedProps: BQBjAAoA4EmmlidQ3AgFADcAAgAADwA8AAAATWljcm9zb2Z0LkV4Y2hhbmdlLlRyYW5zcG9ydC5NYWlsUmVjaXBpZW50Lk9yZ2FuaXphdGlvblNjb3BlEQAAAAAAAAAAAAAAAAAAAAAADwA/AAAATWljcm9zb2Z0LkV4Y2hhbmdlLlRyYW5zcG9ydC5EaXJlY3RvcnlEYXRhLk1haWxEZWxpdmVyeVByaW9yaXR5DwADAAAATG93
X-Source: SMTP:Default MBX-ESSEN-02
X-SourceIPAddress: 10.53.40.199
X-EndOfInjectedXHeaders: 20715
Received: from mbx-dresden-01.secunet.de (10.53.40.199) by
 mbx-essen-02.secunet.de (10.53.40.198) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.37; Sat, 30 Mar 2024 00:02:52 +0100
Received: from a.mx.secunet.com (62.96.220.36) by cas-essen-01.secunet.de
 (10.53.40.201) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend
 Transport; Sat, 30 Mar 2024 00:02:52 +0100
Received: from localhost (localhost [127.0.0.1])
	by a.mx.secunet.com (Postfix) with ESMTP id 89E3B20882
	for <steffen.klassert@secunet.com>; Sat, 30 Mar 2024 00:02:52 +0100 (CET)
X-Virus-Scanned: by secunet
X-Spam-Flag: NO
X-Spam-Score: -5.15
X-Spam-Level:
X-Spam-Status: No, score=-5.15 tagged_above=-999 required=2.1
	tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.099, DKIM_SIGNED=0.1,
	DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HEADER_FROM_DIFFERENT_DOMAINS=0.249, MAILING_LIST_MULTI=-1,
	RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001]
	autolearn=unavailable autolearn_force=no
Authentication-Results: a.mx.secunet.com (amavisd-new);
	dkim=pass (1024-bit key) header.d=amd.com
Received: from a.mx.secunet.com ([127.0.0.1])
	by localhost (a.mx.secunet.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id wHNyzjsIKDhp for <steffen.klassert@secunet.com>;
	Sat, 30 Mar 2024 00:02:51 +0100 (CET)
Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=139.178.88.99; helo=sv.mirrors.kernel.org; envelope-from=linux-kernel+bounces-125497-steffen.klassert=secunet.com@vger.kernel.org; receiver=steffen.klassert@secunet.com 
DKIM-Filter: OpenDKIM Filter v2.11.0 a.mx.secunet.com 793422087D
Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org [139.178.88.99])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by a.mx.secunet.com (Postfix) with ESMTPS id 793422087D
	for <steffen.klassert@secunet.com>; Sat, 30 Mar 2024 00:02:51 +0100 (CET)
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by sv.mirrors.kernel.org (Postfix) with ESMTPS id 6A6352832F1
	for <steffen.klassert@secunet.com>; Fri, 29 Mar 2024 23:02:49 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id ECE2213F006;
	Fri, 29 Mar 2024 23:02:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="zw5+RhL/"
Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2048.outbound.protection.outlook.com [40.107.243.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A08713E414;
	Fri, 29 Mar 2024 23:02:13 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.243.48
ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1711753336; cv=fail; b=LzVM2oPKxcDMPhTGrN1EjkcJHNwS3bh+14wE3eIZAlJcRpZ7fViydGmITkkNSe8XdfvC3xvChzx8OBTDynOtxHCmdWezcu5S7Dq9sVn5pZrUfVrtHU8hLP2DTEkow3G+9GQeOf7uuaruamqj7HblM3eLI9JBtPEJe6L6IfchT5k=
ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1711753336; c=relaxed/simple;
	bh=ve0Q/9IkEDuMVauLWMMuYGROPkz8VDXCebbb2IciUSs=;
	h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=m4D9/d/1CgNnpUIXnIDmC2BOqXfLvwWQE5n+ZA+DbK77nW1C3Gq/zLoGVjYJD1/X1NedCKLg6IqZSPipXMyBrCxPdo4/HVpFBHPSJYkuDjnrmZY9Wuca7bQBJBYJ7HwfvE9hBP4nosUGj9Hm+UQqsjhqtaqFMPvbH6J9Dzl87c4=
ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=zw5+RhL/; arc=fail smtp.client-ip=40.107.243.48
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com
Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=Rpv0NaEcWWmOacEWTJ4zXiLzuFIvhF2pgNzp6IXt+9RIVisgGJK/84XT195gmZgaB6bW/Jelueaazeq5ZGNQkOcWEt0QZJMBz4ceYMBXPXx8aNGhDdcx2RThLdqEanGR4/Y5HLyV0tROWvkbeHUURtdLSthwd30o6EGkEWi2FEe4dUvKI8tifAgUN0MD4EMCmAF5qzBHcM+XCHaCKXu9W8HK7hljQIZ/SGX1fvtWmjFpzTDsxWYWtV1pNl4UU4/L27x57bT7+tfALgs8/bUGVRdCboU/1nMCQEUARiZkwltyFuUkDPvqxy7C9kRSUG6EbtyBC2Uw0sHyoSNa+r3YzA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=aREpckIK1aTtdsTDTyZ9MXdMoxILhHALRKMNW8mq2U4=;
 b=JyShgHcZa2/wma0PqUHAezNHvPONlHryO8/XJB1I50gnnfxl57oFiWU9/wHoVVmqKtNAmbMEqDws0sNbUmQdLKdYvDX8KOXqiwgGZ5ItdSaTdW/hVRFsmTBSoNbqdPnj1B8AdltPC1n+HdqzfZzgurDzO0CylqwZk75MdK4+xiUUjoMv8PsYAbh0RISnlEuZKdeEYhyqnKtAJ+kWpJFukP8S0JfNY50G4S5e1V5VJMJRpzixURLISWViF222MI8R6S+WQg938MqizQF/+d4OBkjUK+Zb54xcLAcgaB5WCpFDRnSe2RhlrwTlHbJ4lbVwKWCIFLoNtRKfXsXhUVRH4Q==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is
 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com;
 dmarc=pass (p=quarantine sp=quarantine pct=100) action=none
 header.from=amd.com; dkim=none (message not signed); arc=none (0)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=aREpckIK1aTtdsTDTyZ9MXdMoxILhHALRKMNW8mq2U4=;
 b=zw5+RhL/bwPs4XPGs8H1awi574VvDZOke4fiosae+nVYgXxK3ZQB45aDqrxVN5DHDKXl7Shji5iXQVxjNKXJod67K1kmhzWGQ5lEqQQidigjKYIoL7zsO9fG5TZk8w1DmfuO5IEzGLcDKiiPO513qLKjuoFmKnmLnkv2EOGR9Xc=
Received: from SJ0PR03CA0276.namprd03.prod.outlook.com (2603:10b6:a03:39e::11)
 by DM4PR12MB7719.namprd12.prod.outlook.com (2603:10b6:8:101::13) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.32; Fri, 29 Mar
 2024 23:02:11 +0000
Received: from SJ1PEPF00001CE0.namprd05.prod.outlook.com
 (2603:10b6:a03:39e:cafe::9a) by SJ0PR03CA0276.outlook.office365.com
 (2603:10b6:a03:39e::11) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.41 via Frontend
 Transport; Fri, 29 Mar 2024 23:02:11 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17)
 smtp.mailfrom=amd.com; dkim=none (message not signed)
 header.d=none;dmarc=pass action=none header.from=amd.com;
Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
Received: from SATLEXMB04.amd.com (165.204.84.17) by
 SJ1PEPF00001CE0.mail.protection.outlook.com (10.167.242.8) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.20.7409.10 via Frontend Transport; Fri, 29 Mar 2024 23:02:11 +0000
Received: from localhost (10.180.168.240) by SATLEXMB04.amd.com
 (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 29 Mar
 2024 18:02:09 -0500
From: Michael Roth <michael.roth@amd.com>
To: <kvm@vger.kernel.org>
CC: <linux-coco@lists.linux.dev>, <linux-mm@kvack.org>,
	<linux-crypto@vger.kernel.org>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<jroedel@suse.de>, <thomas.lendacky@amd.com>, <hpa@zytor.com>,
	<ardb@kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>,
	<vkuznets@redhat.com>, <jmattson@google.com>, <luto@kernel.org>,
	<dave.hansen@linux.intel.com>, <slp@redhat.com>, <pgonda@google.com>,
	<peterz@infradead.org>, <srinivas.pandruvada@linux.intel.com>,
	<rientjes@google.com>, <dovmurik@linux.ibm.com>, <tobin@ibm.com>,
	<bp@alien8.de>, <vbabka@suse.cz>, <kirill@shutemov.name>,
	<ak@linux.intel.com>, <tony.luck@intel.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>, <alpergun@google.com>,
	<jarkko@kernel.org>, <ashish.kalra@amd.com>, <nikunj.dadhania@amd.com>,
	<pankaj.gupta@amd.com>, <liam.merwick@oracle.com>, Brijesh Singh
	<brijesh.singh@amd.com>
Subject: [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults
Date: Fri, 29 Mar 2024 17:58:23 -0500
Message-ID: <20240329225835.400662-18-michael.roth@amd.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20240329225835.400662-1-michael.roth@amd.com>
References: <20240329225835.400662-1-michael.roth@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com
 (10.181.40.145)
X-EOPAttributedMessage: 0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: SJ1PEPF00001CE0:EE_|DM4PR12MB7719:EE_
X-MS-Office365-Filtering-Correlation-Id: 28caeec9-eaaa-4720-7f97-08dc5044445d
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: JXTCISndS2/r5x6NMJ2t51Vfu/p6ZmdNWROhuaETSgas9t4XpeEnKggbccuOh7Q+j+6ibsqRa10T6H+RNY/f7HovlDKdRM/wNfHnuxFCPsZ1gh6AiyjHtHZN4EZ2PLlUmQ90dvSUfdVy+1kAAusLmfVCZf59F4BZNReSwwjSihPdwVwSFllZSq582E2WT80rAqlcmuio+b2fZleaW5lP+scaBjnoY1wi3IC4CvKrwM2OQ5U2tfSboz1tzbhquIVdcUgyeulo5Iub4yHrEdlDmJPb1K0Edzo25153XHHRyqHeAaDOOEJt2x5/aDcRU9U38Qx9KSQRY/Djcy9kc7RzY7IRKLjH3XRsmWiLh/27noGI41o9DlAy3/3lidY9/kmDiMdbdDz1pwncBITDWkznU2rBxDb4b5ogSI+Hrvfuy3o2MzNdiIfswSUM/Iw+a/nxja45nLrfM+NyMztnK2extIPkV9UZaGw6yVTBQ21iYqDlCT2txUgbQEPJ+PftUWsXsTjF1hBEfmu07FvQdvieUKOgYsyG4LCBifEr8aJBh4ETKlfuMYVIUbOSRnYYFvALURQQwP6VG5Vv/7UiZrnVaeS8teVws3JcH6k5U7eg42YN4aD9Ucv1ImJHU8BtYr/QLLY/8/H/+PThTFCmnKDgkd2ySOnuz2XqMkYTTZ8QKQ/hxcGDPYcEJRCSpXamJXssFD0CU4VYsq9C3230r2w28ZYLzVfsQv1V00F+qLQ1NQK64I/hVdR//ANKXyKG/AAP
X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230031)(7416005)(36860700004)(82310400014)(376005)(1800799015);DIR:OUT;SFP:1101;
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Mar 2024 23:02:11.3650
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 28caeec9-eaaa-4720-7f97-08dc5044445d
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF00001CE0.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR12MB7719
Return-Path: linux-kernel+bounces-125497-steffen.klassert=secunet.com@vger.kernel.org
X-MS-Exchange-Organization-OriginalArrivalTime: 29 Mar 2024 23:02:52.5860
 (UTC)
X-MS-Exchange-Organization-Network-Message-Id: bd609fe2-c0bb-4657-f0c0-08dc50445cde
X-MS-Exchange-Organization-OriginalClientIPAddress: 62.96.220.36
X-MS-Exchange-Organization-OriginalServerIPAddress: 10.53.40.201
X-MS-Exchange-Organization-Cross-Premises-Headers-Processed: cas-essen-01.secunet.de
X-MS-Exchange-Organization-OrderedPrecisionLatencyInProgress: LSRV=cas-essen-01.secunet.de:TOTAL-FE=0.007|SMR=0.006(SMRPI=0.004(SMRPI-FrontendProxyAgent=0.004));2024-03-29T23:02:52.593Z
X-MS-Exchange-Forest-ArrivalHubServer: mbx-essen-02.secunet.de
X-MS-Exchange-Organization-AuthSource: cas-essen-01.secunet.de
X-MS-Exchange-Organization-AuthAs: Anonymous
X-MS-Exchange-Organization-OriginalSize: 20168
X-MS-Exchange-Organization-Transport-Properties: DeliveryPriority=Low
X-MS-Exchange-Organization-Prioritization: 2:ShadowRedundancy
X-MS-Exchange-Organization-IncludeInSla: False:ShadowRedundancy

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |   3 ++
 arch/x86/kvm/svm/sev.c     | 103 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c     |  21 ++++++--
 arch/x86/kvm/svm/svm.h     |   3 ++
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 780182cda3ab..234a998e2d2d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c35ed9d91c89..a0a88471f9ab 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3397,6 +3397,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3956,3 +3963,99 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have raced with
+		 * another process and the issue has already resolved itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) && assigned &&
+		    rmp_level == PG_LEVEL_4K) {
+			pr_debug_ratelimited("%s: PSMASH for GPA 0x%llx failed with ret %d due to potential race",
+					     __func__, gpa, ret);
+			goto out;
+		}
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c162f6a1d78..648a05ca53fc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,15 +2043,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bb04d63012b4..c0675ff2d8a2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
-- 
2.25.1




^ permalink raw reply related	[relevance 2%]

* [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
  2024-03-29 22:58 10% ` [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
@ 2024-03-29 22:58  8%   ` Michael Roth
  0 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
[mdr: hold sev_deactivate_lock when calling SEV_CMD_SNP_DECOMMISSION]
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  23 ++-
 arch/x86/include/uapi/asm/kvm.h               |   8 +
 arch/x86/kvm/svm/sev.c                        | 152 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 180 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index f7c007d34114..a10b817c162d 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -459,6 +459,25 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest.
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+        };
+
+See the SEV-SNP spec [snp-fw-abi]_ for further detail on the launch input.
+
 Device attribute API
 ====================
 
@@ -490,9 +509,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 725b75cfe9ff..350ddd5264ea 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -693,6 +693,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -818,6 +821,11 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3d9771163562..6c7c77e33e62 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -58,6 +59,10 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -68,6 +73,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -94,12 +101,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1967,6 +1979,102 @@ int sev_dev_get_attr(u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context) {
+		pr_debug("SEV-SNP context already exists. Refusing to allocate an additional one.");
+		return -EINVAL;
+	}
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single socket.");
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_SMT)) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single SMT thread.");
+		return -EINVAL;
+	}
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("SEV_CMD_SNP_LAUNCH_START command failed, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("Failed to bind ASID to SEV-SNP context, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -2054,6 +2162,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2249,6 +2360,33 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	if (WARN_ONCE(ret, "failed to release guest context")) {
+		up_write(&sev_deactivate_lock);
+		return ret;
+	}
+
+	up_write(&sev_deactivate_lock);
+
+	/* free the context page now */
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2290,7 +2428,15 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 4a01a81dd9b9..a3c190642c57 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -92,6 +92,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1


X-sender: <linux-kernel+bounces-125487-steffen.klassert=secunet.com@vger.kernel.org>
X-Receiver: <steffen.klassert@secunet.com> ORCPT=rfc822;steffen.klassert@secunet.com NOTIFY=NEVER; X-ExtendedProps=BQAVABYAAgAAAAUAFAARAPDFCS25BAlDktII2g02frgPADUAAABNaWNyb3NvZnQuRXhjaGFuZ2UuVHJhbnNwb3J0LkRpcmVjdG9yeURhdGEuSXNSZXNvdXJjZQIAAAUAagAJAAEAAAAAAAAABQAWAAIAAAUAQwACAAAFAEYABwADAAAABQBHAAIAAAUAEgAPAGIAAAAvbz1zZWN1bmV0L291PUV4Y2hhbmdlIEFkbWluaXN0cmF0aXZlIEdyb3VwIChGWURJQk9IRjIzU1BETFQpL2NuPVJlY2lwaWVudHMvY249U3RlZmZlbiBLbGFzc2VydDY4YwUACwAXAL4AAACheZxkHSGBRqAcAp3ukbifQ049REI2LENOPURhdGFiYXNlcyxDTj1FeGNoYW5nZSBBZG1pbmlzdHJhdGl2ZSBHcm91cCAoRllESUJPSEYyM1NQRExUKSxDTj1BZG1pbmlzdHJhdGl2ZSBHcm91cHMsQ049c2VjdW5ldCxDTj1NaWNyb3NvZnQgRXhjaGFuZ2UsQ049U2VydmljZXMsQ049Q29uZmlndXJhdGlvbixEQz1zZWN1bmV0LERDPWRlBQAOABEABiAS9uuMOkqzwmEZDvWNNQUAHQAPAAwAAABtYngtZXNzZW4tMDIFADwAAgAADwA2AAAATWljcm9zb2Z0LkV4Y2hhbmdlLlRyYW5zcG9ydC5NYWlsUmVjaXBpZW50LkRpc3BsYXlOYW1lDwARAAAAS2xhc3NlcnQsIFN0ZWZmZW4FAAwAAgAABQBsAAIAAAUAWAAXAEoAAADwxQktuQQJQ5LSCNoNNn64Q049S2xhc3NlcnQgU3RlZmZlbixPVT1Vc2VycyxPVT1NaWdyYXRpb24sREM9c2VjdW5ldCxEQz1kZQUAJgACAAEFACIADwAxAAAAQXV0b1Jlc3BvbnNlU3VwcHJlc3M6IDANClRyYW5zbWl0SGlzdG9yeTogRmFsc2UNCg8ALwAAAE1pY3Jvc29mdC5FeGNoYW5nZS5UcmFuc3BvcnQuRXhwYW5zaW9uR3JvdXBUeXBlDwAVAAAATWVtYmVyc0dyb3VwRXhwYW5zaW9uBQAjAAIAAQ==
X-CreatedBy: MSExchange15
X-HeloDomain: b.mx.secunet.com
X-ExtendedProps: BQBjAAoAWUmmlidQ3AgFAGEACAABAAAABQA3AAIAAA8APAAAAE1pY3Jvc29mdC5FeGNoYW5nZS5UcmFuc3BvcnQuTWFpbFJlY2lwaWVudC5Pcmdhbml6YXRpb25TY29wZREAAAAAAAAAAAAAAAAAAAAAAAUASQACAAEFAAQAFCABAAAAHAAAAHN0ZWZmZW4ua2xhc3NlcnRAc2VjdW5ldC5jb20FAAYAAgABBQApAAIAAQ8ACQAAAENJQXVkaXRlZAIAAQUAAgAHAAEAAAAFAAMABwAAAAAABQAFAAIAAQUAYgAKAIEAAADNigAABQBkAA8AAwAAAEh1Yg==
X-Source: SMTP:Default MBX-ESSEN-02
X-SourceIPAddress: 62.96.220.37
X-EndOfInjectedXHeaders: 33241
Received: from cas-essen-01.secunet.de (10.53.40.201) by
 mbx-essen-02.secunet.de (10.53.40.198) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.37; Fri, 29 Mar 2024 23:59:53 +0100
Received: from b.mx.secunet.com (62.96.220.37) by cas-essen-01.secunet.de
 (10.53.40.201) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend
 Transport; Fri, 29 Mar 2024 23:59:53 +0100
Received: from localhost (localhost [127.0.0.1])
	by b.mx.secunet.com (Postfix) with ESMTP id A97F12032C
	for <steffen.klassert@secunet.com>; Fri, 29 Mar 2024 23:59:53 +0100 (CET)
X-Virus-Scanned: by secunet
X-Spam-Flag: NO
X-Spam-Score: -5.15
X-Spam-Level:
X-Spam-Status: No, score=-5.15 tagged_above=-999 required=2.1
	tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.099, DKIM_SIGNED=0.1,
	DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HEADER_FROM_DIFFERENT_DOMAINS=0.249, MAILING_LIST_MULTI=-1,
	RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001]
	autolearn=unavailable autolearn_force=no
Authentication-Results: a.mx.secunet.com (amavisd-new);
	dkim=pass (1024-bit key) header.d=amd.com
Received: from b.mx.secunet.com ([127.0.0.1])
	by localhost (a.mx.secunet.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id UKNPQtZCaYj3 for <steffen.klassert@secunet.com>;
	Fri, 29 Mar 2024 23:59:52 +0100 (CET)
Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=139.178.88.99; helo=sv.mirrors.kernel.org; envelope-from=linux-kernel+bounces-125487-steffen.klassert=secunet.com@vger.kernel.org; receiver=steffen.klassert@secunet.com 
DKIM-Filter: OpenDKIM Filter v2.11.0 b.mx.secunet.com 46151200BB
Authentication-Results: b.mx.secunet.com;
	dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="ZP/7DMTG"
Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org [139.178.88.99])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by b.mx.secunet.com (Postfix) with ESMTPS id 46151200BB
	for <steffen.klassert@secunet.com>; Fri, 29 Mar 2024 23:59:52 +0100 (CET)
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by sv.mirrors.kernel.org (Postfix) with ESMTPS id 5AABE284496
	for <steffen.klassert@secunet.com>; Fri, 29 Mar 2024 22:59:50 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id A601513E6A0;
	Fri, 29 Mar 2024 22:59:35 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="ZP/7DMTG"
Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2040.outbound.protection.outlook.com [40.107.236.40])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6BF1313CF91;
	Fri, 29 Mar 2024 22:59:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.236.40
ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1711753171; cv=fail; b=jcayv/9uC4T8P77oFK+iKjXiPda/yTMegLjI59U/clJBPkoiIJ8ErPpZ2PlVQhoCTfRelgXTfdoP81/auei39z7Cd+/bYhHK1kWn9a7Bvok5bTqu5bHX6Oh9HXIEAG/I1mo5CmXHqmq8CJD8B6FuQfwggWQv2BLlvWl7lGpJl/c=
ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1711753171; c=relaxed/simple;
	bh=5c8zhDfcMbpsXLVQsZOnYPgv5aHR2rf5q1ILo5PbHMc=;
	h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=Os3eu02qBp9/vs/KmBhy0MByeKMmof1Cz6+cmle3GZn2HF2mjuOHlQxn88FJPntD1wD3KvaU3RnDv3A9zndXKrH0+tFsRL9gUg8bKaQoUHSLW9u5+sS3GG9GM14Ye0v2tQwbx9QP/AAJjn0ixq30DIYE8a+1Zs6zma9Q7Yslmbs=
ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=ZP/7DMTG; arc=fail smtp.client-ip=40.107.236.40
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com
Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=a+RJHf/rVZH530XT1GCJWPwJ6Mpc1hKrcQvYd4xL54lLlm+ytsJmysKthG2vCa7fCEw8QUSV/HKIe5jFMCkHOeRvUw/7+pN3VsWTxZBOw4PlFqBlrXYiTuXB/4wrqxVKhAzhd+t1QS506zNTd5y0Cxu+NJgrOgizSjsM6VQMZH2sMSKLGTBt9M4kkFoy9FMLN1nINWolJ6fhtFXgOKJr0UpIf9xsGnln6A+V3apqvIG2W/EDB2VaA9d1jPe/7fPbOVSKSJOegSLtF11EvPOlwvwjald09QeoJHYDsST+dXS0IE+zgnR/GlQAfIg/IfoSDFINh5htMZ3UDkg9ssVd9g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=g4F0v4TXhVkzuRwLgbfEWUQkfmsL+Uf1rI69FDB8x/8=;
 b=TuqSH36JyOXmYccLiUXWdFEih5He/146zy7PY2/vd0g6VC/PTWaQFiHts+uuRyJcYKnTRWf5+UtbofHWWFYZ5O11xxJvCkGBNuMF20+G7x/HYtz3/W0Djy2WtYCsNU4892Bzp72PRz4MMl/IlwWbmbDSddD97KPirj9rKQ4SL4PyskhEalF9gp+b8JyZzdJwUWS65A/DUikIgoAImTNHuL83Qo1onV/Ag8TxdPXhlLiD9knQ61afh+kEMZesJL3c/ZkDtJ41VGGBndZa3ntrzJqc+jZ79GZ8sPg3HcmRjoH5eKw0+iHfEdbRxal2pS+if8CN1RNOAAOOhiOXfwwgqg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is
 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com;
 dmarc=pass (p=quarantine sp=quarantine pct=100) action=none
 header.from=amd.com; dkim=none (message not signed); arc=none (0)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=g4F0v4TXhVkzuRwLgbfEWUQkfmsL+Uf1rI69FDB8x/8=;
 b=ZP/7DMTGJLB9Yii5sXMLLSiTJTEwT+kEdYoYzl0aGj81F+Dh5bNz2ZpxpgG6OKG/7RH7DifdvlDGg5ioSSL4KWF/dfpgDwucUIToufxWW7twWOS3RzAqReBBqpa83c9GfG8/jtbTIfxTgGCX4d2X78viqbON2UTpouGMWD1B19A=
Received: from BYAPR06CA0006.namprd06.prod.outlook.com (2603:10b6:a03:d4::19)
 by PH8PR12MB6721.namprd12.prod.outlook.com (2603:10b6:510:1cc::18) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.31; Fri, 29 Mar
 2024 22:59:25 +0000
Received: from SJ1PEPF00001CDF.namprd05.prod.outlook.com
 (2603:10b6:a03:d4:cafe::54) by BYAPR06CA0006.outlook.office365.com
 (2603:10b6:a03:d4::19) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.40 via Frontend
 Transport; Fri, 29 Mar 2024 22:59:25 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17)
 smtp.mailfrom=amd.com; dkim=none (message not signed)
 header.d=none;dmarc=pass action=none header.from=amd.com;
Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
Received: from SATLEXMB04.amd.com (165.204.84.17) by
 SJ1PEPF00001CDF.mail.protection.outlook.com (10.167.242.7) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.20.7409.10 via Frontend Transport; Fri, 29 Mar 2024 22:59:23 +0000
Received: from localhost (10.180.168.240) by SATLEXMB04.amd.com
 (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 29 Mar
 2024 17:59:21 -0500
From: Michael Roth <michael.roth@amd.com>
To: <kvm@vger.kernel.org>
CC: <linux-coco@lists.linux.dev>, <linux-mm@kvack.org>,
	<linux-crypto@vger.kernel.org>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<jroedel@suse.de>, <thomas.lendacky@amd.com>, <hpa@zytor.com>,
	<ardb@kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>,
	<vkuznets@redhat.com>, <jmattson@google.com>, <luto@kernel.org>,
	<dave.hansen@linux.intel.com>, <slp@redhat.com>, <pgonda@google.com>,
	<peterz@infradead.org>, <srinivas.pandruvada@linux.intel.com>,
	<rientjes@google.com>, <dovmurik@linux.ibm.com>, <tobin@ibm.com>,
	<bp@alien8.de>, <vbabka@suse.cz>, <kirill@shutemov.name>,
	<ak@linux.intel.com>, <tony.luck@intel.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>, <alpergun@google.com>,
	<jarkko@kernel.org>, <ashish.kalra@amd.com>, <nikunj.dadhania@amd.com>,
	<pankaj.gupta@amd.com>, <liam.merwick@oracle.com>, Brijesh Singh
	<brijesh.singh@amd.com>
Subject: [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
Date: Fri, 29 Mar 2024 17:58:16 -0500
Message-ID: <20240329225835.400662-11-michael.roth@amd.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20240329225835.400662-1-michael.roth@amd.com>
References: <20240329225835.400662-1-michael.roth@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com
 (10.181.40.145)
X-EOPAttributedMessage: 0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: SJ1PEPF00001CDF:EE_|PH8PR12MB6721:EE_
X-MS-Office365-Filtering-Correlation-Id: a9179cff-2565-428c-c1f4-08dc5043e056
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: N0Q/ob1ES6c3vUyxIlbqNB1kiL8OchYZphqdYFkdXJpTeOncGhrxGWgpCPK3lwltWnf6Oi2ReoR2DcLJUy+jBMH9sC0SYt3rWAAol+BcQ7Jhf2NEHsziRkhN1DSQ1Dl+zorUbf0dQl9qJSO9/jFwPuJ53q/yhLZAN9Cx5sipS9ZxyPaBmuhSScBnshrZnTUScfBuq6KXgc6hm+a8ba6nC9pw5J/u1BKP6Vi7t4jpT5Pnrv/GYBHQRJ++N5tWPHBFMHwONq3UGpHOQEjboXryDCoBsahNlrjw8O9Fhx5u9dfruz5kBDa1M7EXOTC2FRiK0McBIhsrbp/pu/h9xbfiNz/rYkbHHj6+pt2BbW/e0yQzyRRO4fCGwMUyryyv3wV1Y0/CGDHJMwcf/+KEwK5sfbvJzkNIPB9BWrvh+J7FBHoLiD1sPPbJM+EwI0hSU/J85LJsGzpGDcmTvkASBALQkAriYslf9x8KopxE/h1aHB3jxThDLZZoruvoTKonOZ/K17BI//ZZac0gITuyIqitRIiWcYTmxozfE5O94AO0yXbmGkXAN656jwxiw59sppRQWWHK/lxH3E0mDKpu7nZ+dUOb1PO3cTy/3nw0Wn7hTGUEBVAtgazKKPxEh3iUp6pU1stIxfRcV6vWBay3UcQ1YVZiEBjzohCsXR2NabzELLKGFdknfV/2XZZHzZk/eUNmjdRmQOIMz6qCBgJr4CsQww==
X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:ErrorRetry;CAT:NONE;SFS:(13230031)(7416005)(36860700004)(82310400014)(376005)(1800799015);DIR:OUT;SFP:1101;
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Mar 2024 22:59:23.5330
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: a9179cff-2565-428c-c1f4-08dc5043e056
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF00001CDF.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH8PR12MB6721
Return-Path: linux-kernel+bounces-125487-steffen.klassert=secunet.com@vger.kernel.org
X-MS-Exchange-Organization-OriginalArrivalTime: 29 Mar 2024 22:59:53.7218
 (UTC)
X-MS-Exchange-Organization-Network-Message-Id: 1095868b-7c5a-431c-6ffb-08dc5043f242
X-MS-Exchange-Organization-OriginalClientIPAddress: 62.96.220.37
X-MS-Exchange-Organization-OriginalServerIPAddress: 10.53.40.201
X-MS-Exchange-Organization-Cross-Premises-Headers-Processed: cas-essen-01.secunet.de
X-MS-Exchange-Organization-OrderedPrecisionLatencyInProgress: LSRV=mbx-essen-02.secunet.de:TOTAL-HUB=0.416|SMR=0.329(SMRDE=0.005|SMRC=0.323(SMRCL=0.103|X-SMRCR=0.322))|CAT=0.086(CATOS=0.001
 |CATRESL=0.029(CATRESLP2R=0.020)|CATORES=0.051(CATRS=0.051(CATRS-Transport
 Rule Agent=0.001 (X-ETREX=0.001)|CATRS-Index Routing
 Agent=0.048))|CATORT=0.001(CATRT=0.001));2024-03-29T22:59:54.158Z
X-MS-Exchange-Forest-ArrivalHubServer: mbx-essen-02.secunet.de
X-MS-Exchange-Organization-AuthSource: cas-essen-01.secunet.de
X-MS-Exchange-Organization-AuthAs: Anonymous
X-MS-Exchange-Organization-FromEntityHeader: Internet
X-MS-Exchange-Organization-OriginalSize: 22023
X-MS-Exchange-Organization-HygienePolicy: Standard
X-MS-Exchange-Organization-MessageLatency: SRV=cas-essen-01.secunet.de:TOTAL-FE=0.020|SMR=0.009(SMRPI=0.007(SMRPI-FrontendProxyAgent=0.007))|SMS=0.011
X-MS-Exchange-Organization-Recipient-Limit-Verified: True
X-MS-Exchange-Organization-TotalRecipientCount: 1
X-MS-Exchange-Organization-Rules-Execution-History: 0b0cf904-14ac-4724-8bdf-482ee6223cf2%%%fd34672d-751c-45ae-a963-ed177fcabe23%%%d8080257-b0c3-47b4-b0db-23bc0c8ddb3c%%%95e591a2-5d7d-4afa-b1d0-7573d6c0a5d9%%%f7d0f6bc-4dcc-4876-8c5d-b3d6ddbb3d55%%%16355082-c50b-4214-9c7d-d39575f9f79b
X-MS-Exchange-Forest-RulesExecuted: mbx-essen-02
X-MS-Exchange-Organization-RulesExecuted: mbx-essen-02
X-MS-Exchange-Forest-IndexAgent-0: AQ0CZW4AAZcVAAAPAAADH4sIAAAAAAAEAMVaC3fbtpKmZEuy5Sh23k
 2bdlHf01zb0dtvu+nWcZTWJ35kLSftPdkeLk1CFjeSqCUpO769+VX7
 B3dmQFAgRdl53N3VSSQSHAwGg2++GYD+799euE53iz1z7f/kXps17d
 5Zm/14Km7LHt7+bHStsul0fyrkC/mXbw70ZuON3jx8pe/vvD7c/VVv
 nuwcn7BTfmb3POa3OesYg57ZZn3XMbnnsZbjMqPHoFcJerGzAff8ci
 F/ApKgtWv0LGb3bN82OvbfuccMZrqXfd85c41+2zaZZZ9BBxDt+fy9
 zwYet5jv4L3nuwPTL+RxzC43vIHLu7znM6dFZoiB2BFcu3Ikj5lgCr
 T0mOHDL3z1HRv6nPJCXqruOIZV4T0yg1mGb4B90Bzq/KsnDZZTtbvG
 GS+jf17AZLuOy0EC5t01fNvpMY9z6i1d4PW5abdsk55St6Z91uNWyW
 m1SqeXH70csV47XtuGTi+NjmuwHw26K7/DO6XP267lbrG207HArHPd
 4obp2+eGz/WOY75jF+ga0+h0YCi0V989eE6L/byxe3RwsNds7h0d/j
 Ey9IFttg3eYceOD/Z2xV3ZhTtl6FKpVMizcrlcObddv/LuvFt5v7FW
 AYFSl4PTLkuB09ErLqw5fP7BWH2ZPXmCPQ3XbFMPu2d2BhavDIy+XT
 G8Lqoqt1n0Az3ZBnui9sMRPfzPz8smG/P5B6ut1mHE2KeUqChh3IgB
 NTJghbXsDkAbvNI741aR1TaqABCPuzhVb+HJYhFkLN7h4r60iJiw7F
 aLlUpnts+MynPHHCC4CTIf67/Tz+pWyNs9i79nrXWzWl23lldqtZVy
 2ahVTzdq62ZtrW6xWrW6trJCK/qZthXy4NPPNvDnn1lpZXWzuMae4E
 99lUGL7XkDCODTS4q19mWfu+e2B/EIsds13vFhADOXG9YlERN/z81B
 GIbsmPsDt+dtsSrDuB2YyGBFVurxM7DwnGMrd13HRekntY0yG0eI8L
 h0xQcewz/kQFIQZ9OQGD1BeGiqCVb7GJaC79AxbOgY0CYpEoVVupGM
 iyO+Mlyjy33uemzB7i1uMUGijIHXdSQEr9fXBavpnm+4vuj2CX5B8a
 0t8StDIRhk3BjsT0VWfnR9sLYC7Nyxzcttpb2yxH6hRRSPcHXBRWW2
 VElWssHOHO/84m1t7Y/tUSVHTQYgsU87nF047jvDdQaQJWLaPmyL6T
 QTeJy9hdmUWhcl49T+QyfftwYu5RyL+4bdQdcoWdHu9QfBYrDn/Nw2
 OWQi37VPBz5nO6/2oPlpwgfxRqjfrBY32ZPV6maxVkPUH/MWdwEG3I
 v0RHmIT7T47UXb9nmpb0BA/KEX2VtgzRKaLm4gyox+F0xHvL2FJSrB
 HAbdYC5hLisH8/9IbcWIqqJQrnoK1MX1U3KIDsDavt/3tioVi5/zjg
 NN5SCfVGC5rL4LMKx0uWUblXq1tlyp1Ss7kLAOKDr0Rhgd+m+olHTq
 5+ulV4NTwE65b7WCMcNJhAN6g37fcf1wuBNutoGsvMrq6vraGoZ86e
 WBDgumNyPZXNEZ+OJ6lXUgsWXsyRY8blLJUFstL68sBpqGnmShsouL
 izIk6cH7EiY/xz2rUBHiVdYr6yuVav19dWOndNJ2uoan7/OeZZjvLk
 vgG09/AyQ7wGLL8KH0SfAV2tVzOs7ZpZjNE7RBWbuIDXIy3qXn826F
 Ep3irLWNtarQEkln16Xy0+skZJJar6+erq+aLb7ZapXLy6tVy7JW62
 sr3IglqWv1iXx0rRgG4drmMqYe/NnEGOS9QZeqKbNr6baFdKYQTZMD
 MUNhaXc58jlyARQCbncbykaokc+B3s+BjzHzy9LVOIVmZjum3wEuIj
 KSCmW+2TvcO6kXKQ0pQx2+Ksnicljyqlw2Ll0VBSPFpA6P9YOd3+EZ
 UaCY/EZtAye/Ua8FDBQjd5ebHDKCPuhbWFVS/aw4BBh5uc76hlXfDt
 U++YT8EEkLkWaV6OFBQNrsLxZv2T2RZ3+vQ8DuUtS+bjb05fqzvRN9
 73lTTRkLtdf7++zHH1l1cXzv53vNnWf7Df3Z8dHO892d5on+b6/3jl
 8qvWujNdyYOvR0zAMJ8WVrc329VltbXl2rl8tr5rq5vs6Xl/lafRzE
 Y2piyI49xTWtr+KSwvc6LCjOOkA/7iIgpvuDynvTLbd/Sn4Ea+TzpK
 cWPx2cufyMnj2JPsPR27SlHHaa73YH5fZ8pAkspiY0c5WQBwVfrSqA
 B5RrMsQDoUbQK7f0865n6C2olmBH6G0rq7jzSj9uNBsn+q9H+8/1w5
 2G3njTODyJVw21sV0Omsf6q+Ojk6N4lzrhGEJwx2OiJ1WislJ4Ybvd
 CwNS3c6zvSKDEgPKg3lRgbwiKM8HJYccFuPz1dH+3u7fIAKbL/XmQc
 RKxC0AbaG2tnhVp73DXwCkzaPdl42TYac6IhtDQzpwg/wHRYR+avvb
 wwew59vfOW7ox781GwcLCdvFxYjwi73Dhn7w+qTxO8mCrq7RD+WIOG
 kB15eLG+r69TzaTOIum/Vc3fBsy1MUh887DhS/S6gaRQL9VwsCGXUM
 uxvrQIxDXXBIZBuLI1vCDgLzX1BHLwxZiS3B1+K29Bo1o7sA25isFX
 oLHnZsz9fbsMugq+3h86iNvT6m7MA5myvFGmw8a1Ug1nXFP2QjzKXV
 GXht4Z2FiM+6kMWxuRh1Zdd4T82Lw9Et56KnX7hQCi08HrOcQ+GLU7
 t3bungEKPT0c3+wFtAgVLw2OU+e0qG0f5Ct1rCwoXHtBEIvCX52aba
 RlA77xlQb6Nd8Xp9qNNyMJkuRE4gXugv9l83fy2yQ0BxkSnjBN15x+
 NXKR1rqLI+/Y/2D04JdC8OXSI/fVcH1QvzYP0Wk2azFmwGcPMPfZ7+
 AL80+tMf/vL+33vz1Bo0qTNK0PiD97E6E/ZDAUijC8H+lc0jS82zLT
 Y/Yspwwi5tAvGHUrdAbW1zbR2Dura5jrxcp514gFgLfc59HTc4C0jT
 eFEkwl46NzoKMD8IhUShaPYSOwk363Jbi78GnjECHKlKUo7Yiuwdv/
 RolwG7WwOLDqHG8DzHtGEBLXZh+23lDIAO//C2JbnZd6FI9nBDCYVZ
 R5wlgsMDRaAamVymlTJZGPb1fNjIeIp6abXdU44ihCos+mzIbRj8sP
 s0xJ4SEP7L7snv+u5xY+ekIau3MnWpDBnr3IECcwlXLxhBp3MBHmer
 Yrwug3BiS4Z71se4UwqqQIzWC9xGwDAsyxUnn0/Znx8UPIrRg5GVdl
 xx14yGPCbDDoQNGCePK+gYI3AMzV4tTeWDp8TIBnbVpX91lF745cUr
 /WXj+LCxr+/s7h69PjxRYwXj8ftASTK3IHqRO6J24jzLOGM8rH4KlW
 Tf68N4C1KVMoRrkgD6ik6biKPQp6WfiKEhAlXGUtYT6ArHgR8hPhLp
 xCbmYtJRCAQ/+KC3MP+Cop1Ov2nNw/IiMLWIBv5gsdaFTvrh+goWAG
 HYlydbEzIFrETL5Ty2EAmuGePmYXwrHg9kFBgFz2NZGRKQRSlsFNwo
 siTMToSzRD1yBJUCsHCPfUen9vPuAuZzsWgosX1NOARJIAyJ6ocECJ
 2Z/ntwD4aOgiJQBOMMw1X1mUAezE/mJ+RKnK6sNqL+isKO3KCibWf3
 ZO+NCrXhqia7V91ifSl9/FP9Hdn7ie/Q6cnDjvTq4zmndw1DEWPI7k
 TbZOd48ig1Do9OTv42qsV0+pcQKE5XH3jcXXgshqdUp4MXsFHvQxIU
 4YYTXQQP23/nTmtByC5eOeyLndf7JyP0+tzp/dXHdOhc4Bmo6/UNky
 M9GAm8SydtkPR6rMYUzogedQZlWhSxYziJdndUkqgcBGOL03X+Hipf
 SJPHvDXw6NhasQusgECxMXUbeELKy/NXkEmpsXf4Zmc8n8QtDlJILE
 ESqgnB8aQxMuNPB4BYxHJwJv346q3YR3pUeYVhOVBc9ByIBrHJhW1F
 16aXAQRbKloMhm7uQC0CpSr3v8ij5JbrJ3Vwsvi/PBfc9/ptRNSXIQ
 RJ4TMYWvQLHPCURRyiiEGYmf3LhWAUPIUqStngLhLronHxCyqLyNEd
 e0wDf25tMVypRPXhu6hwr4E1htizmElLcubAAnJRNoxWigmR28L8p8
 417pdoLUBx/EVTHdZRqJXtNPee401yNfVPmmkA1GpAHdFeW4pHxp1G
 DE8gYipFVhOjhVsvQCQd6NAZ8mhypzpep7wUZvWEYww1+Qe/wWFFvb
 q6QmeGtbW6OAT/kpGDj2l4PDx7Pm7sNvbeNPQXe4d7zV+3hlLy4wZV
 kzxwbtk9G3b1Ah6BtYvbo/1OgU3eKesUGTWO/q0EygkQGSmfRkaN95
 KjynuLt4xBxx8zryGfxZ8S5pyBL9ehvkKvwevLa9Xi8rK6ELB2VJbg
 YshUiCXK6JpEjo08Z+Ca4E313Ghk5/+ZJ2j/B6X6uJ0r1X9kf7SG2m
 sNt+oiLYlywRJ/JWQ52NbG7BSvkz6hbKhGhx3ZcXpdfu1e4fqjO4UX
 rjhFU/6OJ9wp4G4tzqK/7Rwf6keHu40FOg+ab4Wk6fIOx5iJHHTMJ1
 cC156njfFYsFRBY5RKr9cZWWEkWvHWTT156EHNrK7omJ3ulcXBaNk5
 erqgIiDgaKLAIEItcKHrXCZEyhV0/NGRIhhis1pcB4ZYqW8Ua/SXMh
 9pQOzzIXJcNzz1REWDXiw/k29gs2F1eBxYCRuuJOSQ6FXJMLGX/AzR
 WwPsDhM+gSH8o5hhqgdEv8Mgxxl8Dyk/CZpRiCYIfFAQS4fRSRZ+ir
 sCwCg96UUGTgLduBiy8TXvGePv1CMP5HvGFaNaMzZqlrV5ulkuG8tm
 bbO6tlI3V9evfc+Y+AY99pRecdQxV8H3esKLY0L1Va9SujZWe5DPfP
 dyGyP7NXjYwL8YRRlG7fgGXch5kdfmgTLAESRD4PLggtTQ32zCPTPP
 XGfQZ4ZpOoOeLzlfqjB8p2ubOio5c+nUGUzW+65zhjw+ckiqEEPwd0
 fiBX3sfDg8Bg3ehUe8AsH4J3q9kK+X66vlmvirHk2b1LI5bSqj5XIp
 7R5eZCe0yQltaiKlfYUX+C+r5aAdvsU1yMNFWpsAmRwJZKgF2me0Gy
 ktXdBuCrFJLQNi09oMPIVH2AWFUduUNg3XQgC+oSWt3Iq+8E+ohX9k
 Q150AVXCsNCAgnY7m9IKZGGauozMKKsKkLYctNwlY0gAHtGIKe2ONp
 nXZoQBuagrbmizsYHQgahnSlg7kbqxREYK+Xva/YlUPq1paW06nE74
 NJW6m9K0lJYNW2T7NxOaNqE9HGn/FuVTWeqVzuDQ0yQ5mU/NisZpbW
 6SzEultDVai4gxY9txCW5oN2+kCllNy8I6JotNJ7enYFKgOkPmpcU1
 zRphI9YdXCoWlzCQFW6f1mZhCsJyeJrT8gAMuMgE88qM4HBJrPtdBZ
 ng0lCzwBW057WCAKqKk4x2I484zGQTED6TbIn2DSI/LvyILu5lhhN8
 RGZnadZzhOcJioj5mA0AGLjOiwgiJwh4Cy+R8hvwPQ1oT30/zqt57T
 tcspQ2g54ESW0KNU/LUbLCddMyNgnMU1d2mZLy06KvtDDsflP2zcHF
 f2hT0xSPtIj5DGE4nAKstVhuIo1wrCx03KVwFh3BwwIGoiOASgiTPW
 jqUoJwJpeCoNYyWiGcIEj+F4nJWEYbxDrCBfyboui7CcErbRMDyTni
 WMshWiSuwhGDgNLSCSGspce0T8QDKlksl9yeyo4uPYEz4Kg8iaW1W2
 Bq6PkMzVTgVngDIDqBfCXAj9c3yJNTdB2F9B1ouRmoBW6fuyKiQfIb
 khTQndRuE0hmkyLlDl3cUaNDiD0igYnANvDz3GRqSvChEHgsBXIENl
 joUCC2IkOGVJZDaRy7FopMLqExlUUWSk0KJwyvEZbTSt9seEvAC27B
 z+qjsFfYmErlRvUPnUyk8Q3FFxg2pRWQJVLaA5pOQKSpewIV4x9NCy
 JS8zWRZCYTpPVg1e6Q8dR3NpAMjMkJgW+lAMXUVFRmOmAwpAjhFuyV
 Q+xRCh5mVZTMIg5zon1WqE3NjHF1RsQvwhJpBCY4LSM66E4m5dXGBz
 KassSlwpNfYfouCMQixWkzYl55NOA22QB8kiGumxWMkU3dGC5KaiIe
 BaR2VqYzYWTIddHb2UkcCH2uuAKLltuS+Se1WxNk/G1SSKkHXHQ3LU
 ehQPh6IpI+hqlQhG0GFwh63ZrEZD2q7Y70YcQMHEK7D4+yRIbwT4pl
 s5Q1KIKmKLiImbWHijDG13j5R4rx4MwZ6n6HPF9AGhfFTCotrinpB4
 6dk/MSeJ4USU1jgi5gHadkZM1oBZn9b4rKTciEJYEAJyWsaZlN8uGj
 UCHI5LU5lYcVgEFBK8g28KoAksRSRmBJgh8gJOao3RraD413qd6+TY
 t1JxydRryVo5T0gIwR8BMD0dDAn1CLfi3IFuvSFNQnabHo+DSVmaVb
 kQGFkVmqUYXO0JNZ1IyzAJCkUeBBGgMZsB1UMsLsXLBkwB5IqiFsgA
 bSIpVrD0HsPukU1uaoUprUviWDb4cBUgjmngl9RRi4Kwa6ERZFxGB5
 gWpRw8jYEZL3pCdxuCARzKFmEVmpqdAhooCZQfvvTIr0PRwI+U0oLA
 wLhowaR1MIIfRPVoGisgm6ark/fjoP5XSE5kxQfBaSciiMe4umNpcm
 U2VCh1EeZBGHyBKTmL4FfWUFP8+SN8LwmSS4ZnGZoMsDwVoQFCATQp
 H8KSqNNAHgXlr6h/z5SCUQCRKxQZsShYf4phLrIQ30L0r3vBo75Dea
 0VDg63A6ZOr0SN75ZtSAMOQFJwgYg/0k/0jYIBhbKFQy3cNP0kbLES
 hEerkSeLRqlOwISOTtXLDxDC5uXx22UQxMZWizqYZtaHkMLWI50kE5
 BML3YUYphEea4pQhwxNO5HCFEFdy351NHEt4Q8ioFUVIFInDZVM5UT
 UpjbfVjDxBdAFmiHC7JaejRpxgrTDQZOjJjVVqejSbiPrhI3TOCrK9
 P6xpH6aol4LVrJwsBM7cZMLO9KtUlCtiXQDbVMzQpj7JWlFXCGsfKZ
 ErK5nA5mCHi7UKeozaC+OLnNufykuxhEUyd4c1pLSHWu7GgzdY6NlY
 hooQckIhNDda1ahHLrRGc4LipiU+BRTpqAcA/11SYfMVOeFR0qP70T
 3XXCyXqQar9EU1m5jjDYGBm0EK/nSHp/LjMDBGc3lYLcs0mhXFWyQj
 PExHkTChfR/4LSD2yTCKZeUzLQ6yUtp3We0x+C1pIBblh8L/9xYsE/
 XznFrAy/1FXsBPqJVnMneFZA4xPCkvcrI6yuUhpeICoTHTWHWI/HhH
 sCXB4FZYMU6KIiGgoBlhpNgs4PX/AFRqziAVQQAAAQrDAzw/eG1sIH
 ZlcnNpb249IjEuMCIgZW5jb2Rpbmc9InV0Zi0xNiI/Pg0KPEVtYWls
 U2V0Pg0KICA8VmVyc2lvbj4xNS4wLjAuMDwvVmVyc2lvbj4NCiAgPE
 VtYWlscz4NCiAgICA8RW1haWwgU3RhcnRJbmRleD0iMjEiPg0KICAg
 ICAgPEVtYWlsU3RyaW5nPmJyaWplc2guc2luZ2hAYW1kLmNvbTwvRW
 1haWxTdHJpbmc+DQogICAgPC9FbWFpbD4NCiAgICA8RW1haWwgU3Rh
 cnRJbmRleD0iNDczIiBQb3NpdGlvbj0iT3RoZXIiPg0KICAgICAgPE
 VtYWlsU3RyaW5nPmFzaGlzaC5rYWxyYUBhbWQuY29tPC9FbWFpbFN0
 cmluZz4NCiAgICA8L0VtYWlsPg0KICAgIDxFbWFpbCBTdGFydEluZG
 V4PSI1OTYiIFBvc2l0aW9uPSJPdGhlciI+DQogICAgICA8RW1haWxT
 dHJpbmc+bWljaGFlbC5yb3RoQGFtZC5jb208L0VtYWlsU3RyaW5nPg
 0KICAgIDwvRW1haWw+DQogIDwvRW1haWxzPg0KPC9FbWFpbFNldD4B
 DJ0FPD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTE2Ij
 8+DQo8Q29udGFjdFNldD4NCiAgPFZlcnNpb24+MTUuMC4wLjA8L1Zl
 cnNpb24+DQogIDxDb250YWN0cz4NCiAgICA8Q29udGFjdCBTdGFydE
 luZGV4PSI2Ij4NCiAgICAgIDxQZXJzb24gU3RhcnRJbmRleD0iNiI+
 DQogICAgICAgIDxQZXJzb25TdHJpbmc+QnJpamVzaCBTaW5naDwvUG
 Vyc29uU3RyaW5nPg0KICAgICAgPC9QZXJzb24+DQogICAgICA8QnVz
 aW5lc3MgU3RhcnRJbmRleD0iMTA5Ij4NCiAgICAgICAgPEJ1c2luZX
 NzU3RyaW5nPlNOUDwvQnVzaW5lc3NTdHJpbmc+DQogICAgICA8L0J1
 c2luZXNzPg0KICAgICAgPEVtYWlscz4NCiAgICAgICAgPEVtYWlsIF
 N0YXJ0SW5kZXg9IjIxIj4NCiAgICAgICAgICA8RW1haWxTdHJpbmc+
 YnJpamVzaC5zaW5naEBhbWQuY29tPC9FbWFpbFN0cmluZz4NCiAgIC
 AgICAgPC9FbWFpbD4NCiAgICAgIDwvRW1haWxzPg0KICAgICAgPENv
 bnRhY3RTdHJpbmc+QnJpamVzaCBTaW5naCAmbHQ7YnJpamVzaC5zaW
 5naEBhbWQuY29tJmd0Ow0KDQpLVk1fU0VWX1NOUF9MQVVOQ0hfU1RB
 UlQgYmVnaW5zIHRoZSBsYXVuY2ggcHJvY2VzcyBmb3IgYW4gU0VWLV
 NOUDwvQ29udGFjdFN0cmluZz4NCiAgICA8L0NvbnRhY3Q+DQogIDwv
 Q29udGFjdHM+DQo8L0NvbnRhY3RTZXQ+AQ7QAVJldHJpZXZlck9wZX
 JhdG9yLDEwLDA7UmV0cmlldmVyT3BlcmF0b3IsMTEsNDtQb3N0RG9j
 UGFyc2VyT3BlcmF0b3IsMTAsMTtQb3N0RG9jUGFyc2VyT3BlcmF0b3
 IsMTEsMDtQb3N0V29yZEJyZWFrZXJEaWFnbm9zdGljT3BlcmF0b3Is
 MTAsMTA7UG9zdFdvcmRCcmVha2VyRGlhZ25vc3RpY09wZXJhdG9yLD
 ExLDA7VHJhbnNwb3J0V3JpdGVyUHJvZHVjZXIsMjAsMTI=
X-MS-Exchange-Forest-IndexAgent: 1 6878
X-MS-Exchange-Forest-EmailMessageHash: 10974F98
X-MS-Exchange-Forest-Language: en
X-MS-Exchange-Organization-Processed-By-Journaling: Journal Agent

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
[mdr: hold sev_deactivate_lock when calling SEV_CMD_SNP_DECOMMISSION]
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  23 ++-
 arch/x86/include/uapi/asm/kvm.h               |   8 +
 arch/x86/kvm/svm/sev.c                        | 152 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 180 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index f7c007d34114..a10b817c162d 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -459,6 +459,25 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest.
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+        };
+
+See the SEV-SNP spec [snp-fw-abi]_ for further detail on the launch input.
+
 Device attribute API
 ====================
 
@@ -490,9 +509,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 725b75cfe9ff..350ddd5264ea 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -693,6 +693,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -818,6 +821,11 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3d9771163562..6c7c77e33e62 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -58,6 +59,10 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -68,6 +73,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -94,12 +101,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1967,6 +1979,102 @@ int sev_dev_get_attr(u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context) {
+		pr_debug("SEV-SNP context already exists. Refusing to allocate an additional one.");
+		return -EINVAL;
+	}
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single socket.");
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_SMT)) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single SMT thread.");
+		return -EINVAL;
+	}
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("SEV_CMD_SNP_LAUNCH_START command failed, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("Failed to bind ASID to SEV-SNP context, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -2054,6 +2162,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2249,6 +2360,33 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	if (WARN_ONCE(ret, "failed to release guest context")) {
+		up_write(&sev_deactivate_lock);
+		return ret;
+	}
+
+	up_write(&sev_deactivate_lock);
+
+	/* free the context page now */
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2290,7 +2428,15 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 4a01a81dd9b9..a3c190642c57 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -92,6 +92,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1



^ permalink raw reply related	[relevance 8%]

* [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults
  2024-03-29 22:58  2% ` [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
@ 2024-03-29 22:58  2%   ` Michael Roth
  2024-03-29 22:58  2%   ` Michael Roth
  1 sibling, 0 replies; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |   3 ++
 arch/x86/kvm/svm/sev.c     | 103 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c     |  21 ++++++--
 arch/x86/kvm/svm/svm.h     |   3 ++
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 780182cda3ab..234a998e2d2d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c35ed9d91c89..a0a88471f9ab 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3397,6 +3397,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3956,3 +3963,99 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have raced with
+		 * another process and the issue has already resolved itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) && assigned &&
+		    rmp_level == PG_LEVEL_4K) {
+			pr_debug_ratelimited("%s: PSMASH for GPA 0x%llx failed with ret %d due to potential race",
+					     __func__, gpa, ret);
+			goto out;
+		}
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c162f6a1d78..648a05ca53fc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,15 +2043,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bb04d63012b4..c0675ff2d8a2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
-- 
2.25.1


X-sender: <linux-kernel+bounces-125497-steffen.klassert=secunet.com@vger.kernel.org>
X-Receiver: <steffen.klassert@secunet.com> ORCPT=rfc822;steffen.klassert@secunet.com NOTIFY=NEVER; X-ExtendedProps=BQAVABYAAgAAAAUAFAARAPDFCS25BAlDktII2g02frgPADUAAABNaWNyb3NvZnQuRXhjaGFuZ2UuVHJhbnNwb3J0LkRpcmVjdG9yeURhdGEuSXNSZXNvdXJjZQIAAAUAagAJAAEAAAAAAAAABQAWAAIAAAUAQwACAAAFAEYABwADAAAABQBHAAIAAAUAEgAPAGIAAAAvbz1zZWN1bmV0L291PUV4Y2hhbmdlIEFkbWluaXN0cmF0aXZlIEdyb3VwIChGWURJQk9IRjIzU1BETFQpL2NuPVJlY2lwaWVudHMvY249U3RlZmZlbiBLbGFzc2VydDY4YwUACwAXAL4AAACheZxkHSGBRqAcAp3ukbifQ049REI2LENOPURhdGFiYXNlcyxDTj1FeGNoYW5nZSBBZG1pbmlzdHJhdGl2ZSBHcm91cCAoRllESUJPSEYyM1NQRExUKSxDTj1BZG1pbmlzdHJhdGl2ZSBHcm91cHMsQ049c2VjdW5ldCxDTj1NaWNyb3NvZnQgRXhjaGFuZ2UsQ049U2VydmljZXMsQ049Q29uZmlndXJhdGlvbixEQz1zZWN1bmV0LERDPWRlBQAOABEABiAS9uuMOkqzwmEZDvWNNQUAHQAPAAwAAABtYngtZXNzZW4tMDIFADwAAgAADwA2AAAATWljcm9zb2Z0LkV4Y2hhbmdlLlRyYW5zcG9ydC5NYWlsUmVjaXBpZW50LkRpc3BsYXlOYW1lDwARAAAAS2xhc3NlcnQsIFN0ZWZmZW4FAAwAAgAABQBsAAIAAAUAWAAXAEoAAADwxQktuQQJQ5LSCNoNNn64Q049S2xhc3NlcnQgU3RlZmZlbixPVT1Vc2VycyxPVT1NaWdyYXRpb24sREM9c2VjdW5ldCxEQz1kZQUAJgACAAEFACIADwAxAAAAQXV0b1Jlc3BvbnNlU3VwcHJlc3M6IDANClRyYW5zbWl0SGlzdG9yeTogRmFsc2UNCg8ALwAAAE1pY3Jvc29mdC5FeGNoYW5nZS5UcmFuc3BvcnQuRXhwYW5zaW9uR3JvdXBUeXBlDwAVAAAATWVtYmVyc0dyb3VwRXhwYW5zaW9uBQAjAAIAAQ==
X-CreatedBy: MSExchange15
X-HeloDomain: a.mx.secunet.com
X-ExtendedProps: BQBjAAoAEJTp8x1Q3AgFAGEACAABAAAABQA3AAIAAA8APAAAAE1pY3Jvc29mdC5FeGNoYW5nZS5UcmFuc3BvcnQuTWFpbFJlY2lwaWVudC5Pcmdhbml6YXRpb25TY29wZREAAAAAAAAAAAAAAAAAAAAAAAUASQACAAEFAGIACgAfAAAAjYoAAAUABAAUIAEAAAAcAAAAc3RlZmZlbi5rbGFzc2VydEBzZWN1bmV0LmNvbQUABgACAAEFACkAAgABDwAJAAAAQ0lBdWRpdGVkAgABBQACAAcAAQAAAAUAAwAHAAAAAAAFAAUAAgABBQBkAA8AAwAAAEh1Yg==
X-Source: SMTP:Default MBX-DRESDEN-01
X-SourceIPAddress: 62.96.220.36
X-EndOfInjectedXHeaders: 29656
Received: from cas-essen-01.secunet.de (10.53.40.201) by
 mbx-dresden-01.secunet.de (10.53.40.199) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.37; Sat, 30 Mar 2024 00:02:52 +0100
Received: from a.mx.secunet.com (62.96.220.36) by cas-essen-01.secunet.de
 (10.53.40.201) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend
 Transport; Sat, 30 Mar 2024 00:02:52 +0100
Received: from localhost (localhost [127.0.0.1])
	by a.mx.secunet.com (Postfix) with ESMTP id 89E3B20882
	for <steffen.klassert@secunet.com>; Sat, 30 Mar 2024 00:02:52 +0100 (CET)
X-Virus-Scanned: by secunet
X-Spam-Flag: NO
X-Spam-Score: -5.15
X-Spam-Level:
X-Spam-Status: No, score=-5.15 tagged_above=-999 required=2.1
	tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.099, DKIM_SIGNED=0.1,
	DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HEADER_FROM_DIFFERENT_DOMAINS=0.249, MAILING_LIST_MULTI=-1,
	RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001]
	autolearn=unavailable autolearn_force=no
Authentication-Results: a.mx.secunet.com (amavisd-new);
	dkim=pass (1024-bit key) header.d=amd.com
Received: from a.mx.secunet.com ([127.0.0.1])
	by localhost (a.mx.secunet.com [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id wHNyzjsIKDhp for <steffen.klassert@secunet.com>;
	Sat, 30 Mar 2024 00:02:51 +0100 (CET)
Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=139.178.88.99; helo=sv.mirrors.kernel.org; envelope-from=linux-kernel+bounces-125497-steffen.klassert=secunet.com@vger.kernel.org; receiver=steffen.klassert@secunet.com 
DKIM-Filter: OpenDKIM Filter v2.11.0 a.mx.secunet.com 793422087D
Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org [139.178.88.99])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by a.mx.secunet.com (Postfix) with ESMTPS id 793422087D
	for <steffen.klassert@secunet.com>; Sat, 30 Mar 2024 00:02:51 +0100 (CET)
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by sv.mirrors.kernel.org (Postfix) with ESMTPS id 6A6352832F1
	for <steffen.klassert@secunet.com>; Fri, 29 Mar 2024 23:02:49 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id ECE2213F006;
	Fri, 29 Mar 2024 23:02:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="zw5+RhL/"
Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2048.outbound.protection.outlook.com [40.107.243.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A08713E414;
	Fri, 29 Mar 2024 23:02:13 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.243.48
ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1711753336; cv=fail; b=LzVM2oPKxcDMPhTGrN1EjkcJHNwS3bh+14wE3eIZAlJcRpZ7fViydGmITkkNSe8XdfvC3xvChzx8OBTDynOtxHCmdWezcu5S7Dq9sVn5pZrUfVrtHU8hLP2DTEkow3G+9GQeOf7uuaruamqj7HblM3eLI9JBtPEJe6L6IfchT5k=
ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1711753336; c=relaxed/simple;
	bh=ve0Q/9IkEDuMVauLWMMuYGROPkz8VDXCebbb2IciUSs=;
	h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=m4D9/d/1CgNnpUIXnIDmC2BOqXfLvwWQE5n+ZA+DbK77nW1C3Gq/zLoGVjYJD1/X1NedCKLg6IqZSPipXMyBrCxPdo4/HVpFBHPSJYkuDjnrmZY9Wuca7bQBJBYJ7HwfvE9hBP4nosUGj9Hm+UQqsjhqtaqFMPvbH6J9Dzl87c4=
ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=zw5+RhL/; arc=fail smtp.client-ip=40.107.243.48
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com
Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=Rpv0NaEcWWmOacEWTJ4zXiLzuFIvhF2pgNzp6IXt+9RIVisgGJK/84XT195gmZgaB6bW/Jelueaazeq5ZGNQkOcWEt0QZJMBz4ceYMBXPXx8aNGhDdcx2RThLdqEanGR4/Y5HLyV0tROWvkbeHUURtdLSthwd30o6EGkEWi2FEe4dUvKI8tifAgUN0MD4EMCmAF5qzBHcM+XCHaCKXu9W8HK7hljQIZ/SGX1fvtWmjFpzTDsxWYWtV1pNl4UU4/L27x57bT7+tfALgs8/bUGVRdCboU/1nMCQEUARiZkwltyFuUkDPvqxy7C9kRSUG6EbtyBC2Uw0sHyoSNa+r3YzA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=aREpckIK1aTtdsTDTyZ9MXdMoxILhHALRKMNW8mq2U4=;
 b=JyShgHcZa2/wma0PqUHAezNHvPONlHryO8/XJB1I50gnnfxl57oFiWU9/wHoVVmqKtNAmbMEqDws0sNbUmQdLKdYvDX8KOXqiwgGZ5ItdSaTdW/hVRFsmTBSoNbqdPnj1B8AdltPC1n+HdqzfZzgurDzO0CylqwZk75MdK4+xiUUjoMv8PsYAbh0RISnlEuZKdeEYhyqnKtAJ+kWpJFukP8S0JfNY50G4S5e1V5VJMJRpzixURLISWViF222MI8R6S+WQg938MqizQF/+d4OBkjUK+Zb54xcLAcgaB5WCpFDRnSe2RhlrwTlHbJ4lbVwKWCIFLoNtRKfXsXhUVRH4Q==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is
 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com;
 dmarc=pass (p=quarantine sp=quarantine pct=100) action=none
 header.from=amd.com; dkim=none (message not signed); arc=none (0)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=aREpckIK1aTtdsTDTyZ9MXdMoxILhHALRKMNW8mq2U4=;
 b=zw5+RhL/bwPs4XPGs8H1awi574VvDZOke4fiosae+nVYgXxK3ZQB45aDqrxVN5DHDKXl7Shji5iXQVxjNKXJod67K1kmhzWGQ5lEqQQidigjKYIoL7zsO9fG5TZk8w1DmfuO5IEzGLcDKiiPO513qLKjuoFmKnmLnkv2EOGR9Xc=
Received: from SJ0PR03CA0276.namprd03.prod.outlook.com (2603:10b6:a03:39e::11)
 by DM4PR12MB7719.namprd12.prod.outlook.com (2603:10b6:8:101::13) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.32; Fri, 29 Mar
 2024 23:02:11 +0000
Received: from SJ1PEPF00001CE0.namprd05.prod.outlook.com
 (2603:10b6:a03:39e:cafe::9a) by SJ0PR03CA0276.outlook.office365.com
 (2603:10b6:a03:39e::11) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.41 via Frontend
 Transport; Fri, 29 Mar 2024 23:02:11 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17)
 smtp.mailfrom=amd.com; dkim=none (message not signed)
 header.d=none;dmarc=pass action=none header.from=amd.com;
Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
Received: from SATLEXMB04.amd.com (165.204.84.17) by
 SJ1PEPF00001CE0.mail.protection.outlook.com (10.167.242.8) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.20.7409.10 via Frontend Transport; Fri, 29 Mar 2024 23:02:11 +0000
Received: from localhost (10.180.168.240) by SATLEXMB04.amd.com
 (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 29 Mar
 2024 18:02:09 -0500
From: Michael Roth <michael.roth@amd.com>
To: <kvm@vger.kernel.org>
CC: <linux-coco@lists.linux.dev>, <linux-mm@kvack.org>,
	<linux-crypto@vger.kernel.org>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <tglx@linutronix.de>, <mingo@redhat.com>,
	<jroedel@suse.de>, <thomas.lendacky@amd.com>, <hpa@zytor.com>,
	<ardb@kernel.org>, <pbonzini@redhat.com>, <seanjc@google.com>,
	<vkuznets@redhat.com>, <jmattson@google.com>, <luto@kernel.org>,
	<dave.hansen@linux.intel.com>, <slp@redhat.com>, <pgonda@google.com>,
	<peterz@infradead.org>, <srinivas.pandruvada@linux.intel.com>,
	<rientjes@google.com>, <dovmurik@linux.ibm.com>, <tobin@ibm.com>,
	<bp@alien8.de>, <vbabka@suse.cz>, <kirill@shutemov.name>,
	<ak@linux.intel.com>, <tony.luck@intel.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>, <alpergun@google.com>,
	<jarkko@kernel.org>, <ashish.kalra@amd.com>, <nikunj.dadhania@amd.com>,
	<pankaj.gupta@amd.com>, <liam.merwick@oracle.com>, Brijesh Singh
	<brijesh.singh@amd.com>
Subject: [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults
Date: Fri, 29 Mar 2024 17:58:23 -0500
Message-ID: <20240329225835.400662-18-michael.roth@amd.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20240329225835.400662-1-michael.roth@amd.com>
References: <20240329225835.400662-1-michael.roth@amd.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com
 (10.181.40.145)
X-EOPAttributedMessage: 0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: SJ1PEPF00001CE0:EE_|DM4PR12MB7719:EE_
X-MS-Office365-Filtering-Correlation-Id: 28caeec9-eaaa-4720-7f97-08dc5044445d
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: JXTCISndS2/r5x6NMJ2t51Vfu/p6ZmdNWROhuaETSgas9t4XpeEnKggbccuOh7Q+j+6ibsqRa10T6H+RNY/f7HovlDKdRM/wNfHnuxFCPsZ1gh6AiyjHtHZN4EZ2PLlUmQ90dvSUfdVy+1kAAusLmfVCZf59F4BZNReSwwjSihPdwVwSFllZSq582E2WT80rAqlcmuio+b2fZleaW5lP+scaBjnoY1wi3IC4CvKrwM2OQ5U2tfSboz1tzbhquIVdcUgyeulo5Iub4yHrEdlDmJPb1K0Edzo25153XHHRyqHeAaDOOEJt2x5/aDcRU9U38Qx9KSQRY/Djcy9kc7RzY7IRKLjH3XRsmWiLh/27noGI41o9DlAy3/3lidY9/kmDiMdbdDz1pwncBITDWkznU2rBxDb4b5ogSI+Hrvfuy3o2MzNdiIfswSUM/Iw+a/nxja45nLrfM+NyMztnK2extIPkV9UZaGw6yVTBQ21iYqDlCT2txUgbQEPJ+PftUWsXsTjF1hBEfmu07FvQdvieUKOgYsyG4LCBifEr8aJBh4ETKlfuMYVIUbOSRnYYFvALURQQwP6VG5Vv/7UiZrnVaeS8teVws3JcH6k5U7eg42YN4aD9Ucv1ImJHU8BtYr/QLLY/8/H/+PThTFCmnKDgkd2ySOnuz2XqMkYTTZ8QKQ/hxcGDPYcEJRCSpXamJXssFD0CU4VYsq9C3230r2w28ZYLzVfsQv1V00F+qLQ1NQK64I/hVdR//ANKXyKG/AAP
X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230031)(7416005)(36860700004)(82310400014)(376005)(1800799015);DIR:OUT;SFP:1101;
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Mar 2024 23:02:11.3650
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 28caeec9-eaaa-4720-7f97-08dc5044445d
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF00001CE0.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR12MB7719
Return-Path: linux-kernel+bounces-125497-steffen.klassert=secunet.com@vger.kernel.org
X-MS-Exchange-Organization-OriginalArrivalTime: 29 Mar 2024 23:02:52.5860
 (UTC)
X-MS-Exchange-Organization-Network-Message-Id: bd609fe2-c0bb-4657-f0c0-08dc50445cde
X-MS-Exchange-Organization-OriginalClientIPAddress: 62.96.220.36
X-MS-Exchange-Organization-OriginalServerIPAddress: 10.53.40.201
X-MS-Exchange-Organization-Cross-Premises-Headers-Processed: cas-essen-01.secunet.de
X-MS-Exchange-Organization-OrderedPrecisionLatencyInProgress: LSRV=mbx-dresden-01.secunet.de:TOTAL-HUB=0.455|SMR=0.363(SMRDE=0.035|SMRC=0.328(SMRCL=0.104|X-SMRCR=0.328))|CAT=0.090(CATOS=0.017
 (CATSM=0.017(CATSM-Malware
 Agent=0.016))|CATRESL=0.041(CATRESLP2R=0.019)|CATORES=0.030
 (CATRS=0.029(CATRS-Index Routing Agent=0.028)));2024-03-29T23:02:53.041Z
X-MS-Exchange-Forest-ArrivalHubServer: mbx-dresden-01.secunet.de
X-MS-Exchange-Organization-AuthSource: cas-essen-01.secunet.de
X-MS-Exchange-Organization-AuthAs: Anonymous
X-MS-Exchange-Organization-FromEntityHeader: Internet
X-MS-Exchange-Organization-OriginalSize: 20168
X-MS-Exchange-Organization-HygienePolicy: Standard
X-MS-Exchange-Organization-MessageLatency: SRV=cas-essen-01.secunet.de:TOTAL-FE=0.007|SMR=0.006(SMRPI=0.004(SMRPI-FrontendProxyAgent=0.004))
X-MS-Exchange-Organization-AVStamp-Enterprise: 1.0
X-MS-Exchange-Organization-Recipient-Limit-Verified: True
X-MS-Exchange-Organization-TotalRecipientCount: 1
X-MS-Exchange-Organization-Rules-Execution-History: 0b0cf904-14ac-4724-8bdf-482ee6223cf2%%%fd34672d-751c-45ae-a963-ed177fcabe23%%%d8080257-b0c3-47b4-b0db-23bc0c8ddb3c%%%95e591a2-5d7d-4afa-b1d0-7573d6c0a5d9%%%f7d0f6bc-4dcc-4876-8c5d-b3d6ddbb3d55%%%16355082-c50b-4214-9c7d-d39575f9f79b
X-MS-Exchange-Forest-RulesExecuted: mbx-dresden-01
X-MS-Exchange-Organization-RulesExecuted: mbx-dresden-01
X-MS-Exchange-Forest-IndexAgent-0: AQ0CZW4AAQISAAAPAAADH4sIAAAAAAAEAL06a3PbRpLgW6Iky46dy2
 YrOY+TiqMHSYmULEv22hfFph2dLZslSr7c1lWhQAAksQYBHkDS1sbe
 X74fth8YcMCHJHtTYTnQYKanp9/d08g/j54Ffu+B+Dlw/maHXdF0vE
 5X/KXFr5UQX38yelbF9HuPV4orxf/p2p5o1t+Um68awgmF7Rkt17aE
 44lB1xadoR0OSjTsGoH1zghs0XcN0w5FACuBYw4c3wtXir4nDNcVPb
 vnB+fCMAEiBKCWEQIyn5GZvjewvUEo/Da9nxw3xACPqwgkY6UYH2F7
 pj8E4CAkILNrm29F23DcISyaxhCRts7HFCbPXSk6AxEYDhKAIN+/aj
 yriFMY2UHgB0CHxcQYjhcKw7IcZMJwgem2H/QMfAOaV4q4mVGKwXkf
 yGzaNmE8bByLke8Oe7aoCdgzB0kllvAQJS86QGaJOGobQ3dAMoS/uN
 QGtYmeE8JG4BYEZw/e2TbLLRwYAxuY8pJSE6OwIt51jYEiB/t93zYB
 8cgxhAN/+0bHjqD5LELz4s2xXv/16FQ/rh+/Pvlf/dnh2ctTPJMo8Q
 PQgQUSHvhAN+igD/rGl67hWais064DMu7Zhsfy9T33PDoZdpmgcpgG
 sjzbtkLc2LKjvbFdvbUDzwZZAZ533YhNojV0/m5L+wBbAZ1OsQ1W6h
 pBxw7wFFoEYox+H4mPgD2QBRw25r6EK++6jtll+gzRaB4fNn+BabDi
 IVnxSlEl2AnDIcsg7LsOy5iOJUokabDseJYzcqwhqH73xUoRVxxQYO
 izEMJhK7T/fwjTY68wgexwCG+2RSbSdDqebZX9drvcOr+y8z7xy5Y9
 sl2/D3tx3zGwZ4BUT/wBbOvxWyWAN2XXxFmfs+cwBAPoiheGGxjiLw
 a9Vd7im7KnXC6vFIURmN2t9/t7W45nukPL3jLC3lZojypd8UEIsSM2
 N1WotyNYHTGEKfD3QVS3EeoKv5mIRr0YkahVI0RJ0lTYroSVpO2Ktu
 OiysDUOrZVEtXaHhqNHVDgW9tcLwGMZbs2v5fXUaGW026LcrkDdmNs
 XSCD1gWL6KmW/V7c39+u7tdMy9gxWpVKbWfXODjYt2tWzQLRbO/t7p
 KsLzxnpQhMX3LYTz+J8kG1tCc24Xkg4NV+DwHYEy3fdyP31UemDq8D
 vdM1W2vsOKI/0AO7E4oNfK4/BJltbYCPnDWeHp7WQTIDDgu7L9gdAZ
 GoHf/ML/7IDlyjXxEbW7Dve8tuOx45GO+GwHT0Un/9pn7y8rAh5G8X
 JbwJh0QujHkBDrCGFKQgrpvDIBj7GyYKwwODhoDxpHFGR23KoxgFn3
 P06qxZF4nfDoBKhhpKgJokV28815tHf63rwOXkb3sOaO14CrQ6z3KS
 ftGasyAtxty5Z1sH1kHV3D+oVIxtY39/9361fWC05lnMBJoJa5lYRU
 vZ2Tm4j7ZCf6s7aC6YpRwT0qJjidBmG9F7YSDtZGT2hzqgERvwKInh
 3q4YGe7QBoeR/MNC+fGoZ7bKjzE9B75bISydviEeMTTa10cygOg8iM
 Ei9Pp60OtTpNf7kEPD7hoQrffbng4G2vbgjM3f4L/oHJgBfPi8K/6x
 htmwcfi83tQb9RP9FxyugZJe1t/UX4KW1kVZVNGuN8cIAnswBNeIjs
 IDcP0jW8sEYRAO+xAdkDBTlQcSiDIRG/hEKfymSGKGzIDkgY9va7Th
 YaSJg3t7JQhWOwd7O6WDA9YEOyaa6waSEBptW4fyzDd1nJxLwfh8yW
 AkbmQ9UuxIj2IBCFynMmY2upIArYH04cnKptpLx9prQhvKdi7j9ND1
 B0A5PB/OAhMbb0kaeEz5MYwVKJS6D9VLUBJIoIsJsoTsKDAU0YwwpK
 ymzCdMRpnv0FyH5xLTQATa5uPHAq1Gb/5y9Ow0CbS1MR6LDSpCqVwa
 OD1brQR9E6IWlZIhVMFR7RJirRPXR1zhYcmkIIRiowOlENY+UFTECL
 ko9YQDtueYDhtDmQrJKJVRxZlAxQeXH/cDZwRwFXE0dzNUzVCIq5Vi
 AtO4asQ6dGa1KUAxcBUoia7/DoagLiiYxvJIoJuSStj1h64lIKhDIW
 aJYV9AcLcr6qYtxSTaYu1OZF26E+oRfxghSqjE9XWhWKP89QMdmPP0
 AEBdp+dADlv7Di5KD8SZF9e6SXl7vleOkIvnjUOx/f4H133/f953pW
 n0s35gSusPp0HZF5WFjwkLI2d5xEbqI484MWbu4QxBIIQORajesmNp
 4NzvIYlSQgxEHMrmj5MHLIM4kE28bukdSEXgziwQJIfEUhJ3+/SkWD
 EpJEDxO0kCLsDmW3ROCsYoiD/eQFggmAdc3387pFRJ95c1FoGMhDCM
 Q+YMiYgPH8QdCft7iUfiUy5Vbbj5WxM2IxrPXskh3+F/sD5JbCVBvK
 JiZwiw40Ph6A8Hc0U4HcUh/GEIrNF1koJ0oNwqe8Y53iHxPsl3SLh8
 ++4Iq1/qRiSwvXPg6tV4Vj850Z+f1ZunOlaJPx+dYhX1QIVM7Kquo8
 wOn/73WfN0q/Hm8OUR1dt4tYxygjxs+gCsP48T2ODXgkAPJ6K61WwD
 t2C65fcYjSGg5nFdwN4JDG8IN2JncE4X8Ul81JmA/IX3Y9MYjDs2nE
 2wjMYjoxs7XglUGwgm0SEQWgFdqtGruPcANlKZK6PaunhObEQ3AcxG
 rxqnM4UUsY3LcTMhnCRC8k6Nh7kMTvEiiZ/m6eq8HOFNDBQAgKHNVl
 eKumpw2wn7PlABNMOtYny07GckELXkredHssx71RrezeQmxw4rQhyN
 G3RRpwMyvhvYhnWeLBmoNUJtkCQOIg0161AN4zpvbSh6wHj6w8Dxhy
 F3Yaghk0CImuFSSJY9kaNNXevsMNkBBBIMlHwCHUaQgKqdQaQslDZr
 iqDLhsvxB03rXdf/MUwy7STNmsw38MmY/HbUMGtEwpRiiMV4OlZVVL
 IkkMUxYTjwsV9ogm2dY690RsuIyqO/DdEjO54fUBcyWb1dVgHFoV08
 eiTi683ui3mR3LJbw04ylP8QPhBNqcBk6TMO1SWl0I+mrhqnha63h5
 6p6yUO2cqN4XOC9kVMgNkOptzl32fkSgyM8/HknbU/Xa8h8N27tOfO
 o+l2xUztJbKV/G2Il5D9BbgrlMtGx3C8Crq5MwCTB6P1/HfSZW01i3
 UNsNAASnmL4v9MxLK7Ij0DWzyIhpIHYIhjhzR5C3vTttuuzEK3NT1J
 hesnFjAotbi6uHt3tuI+ySnkb75dyYZUsnaJ+lOUPlGPP8S9qr6P30
 McwyUJX9lN5G/C2uYUN/I3w1/k7+PMCADWO7OUoyZ83Bef5z5qycZc
 X7lgm1uvJf0bS/y/G30dbz4U5ONLDz3EpmicnnBTp3H8lPAA+0pF1R
 8OuB/S57uTHK/Lbs4lzTjqLc/qk+GCbMbVzOperb1nVK37+5XK3u6+
 sX3PNO7ttM1Lm3GMZl4zjlexBVTb3t0pVe+JTRrU9pVuHLZD+m3dwW
 9ppt3HHvX8/pOyx7vypk9uWqmdmsB8qHadsE1EOUU3LCvAUPJoVk/Q
 fu8MdPzMptceJjcrEfuSndXEwQP0P536P2ADUVuLu1gJeiZDejnaH7
 XLqMnQG34CktIYx+SPtaGjMCGErv26v6c/qx+enp3U9af1J6+f1g+b
 zaPmaXNd/NcFSKZl4HghXPzPB1CUPBCvzl6+vIiGOdtdO5GoAjO6cX
 8y85dFg08SwqXIriaMz0UjhTL32ohieiS24wsD5EWlVYa2iSXn+NLI
 jVa6L030/AwvpM/NXBda0afGthNAgdiy21ggDvuWQSVO8kv7BRWi4j
 x3p+6jkNhmZ0UqUpix9fmyixyE/H0SZmZP+QrON7fbEZ/EXwrUuGa1
 rhbWLo/83XlBOf5w12pt71p7O9vVWmu3UjG39+7fa7dr1r5Ru0rkn/
 5oN7GKkf9+rYZfYfDPfYz6So+e/08KfejBnX7YX8MV+jhHnyUAAl3K
 8Zy4ew9YaQ6sxoDInQS24D9spBmDQbCGgRYH3NbfGBnULZKfC+FArr
 lwb894r+N+I3So0f47fkPAI7+33dBWFeziR7bP/PyhZrLIiJQ9z581
 9Bf1k1dQHh4+efL67NWp+AAFGE7/tX7yWn6Nub9zjxSys1dKpmGi7B
 L1iN8+TjFzZXWJ3yTV2w/FfDyXaXKMplx/9evRa8Ylv1+q+sQPm5tz
 +ft31Uuy2CQdQ7Rskx9vbYhRz0b3rTQpeuF3/Fqldq9SxfWVoqZltX
 xBW8hphUJK+xIH+YyWzWgLWS2XSWkFHGfTWgZg4AljgITngraY1wrL
 2jWehDHMwF5+hTEPGB6OgJmitkSveBzAS2x5GOcJeR4x5+gfzgMBvG
 tJW04TqjyhYrT4mgKStZSWhjGBXYcttDHPwItaEWZgCU5kShSqgNRs
 Vp7FABmaIZKI2ZT2HzRDFBZgHhDCPD/zKKI8i4tJhSdvh1VVYsRsdB
 bDxLwn6ckQgygBdS/hRMpZJsAvTC5pK/FBebnKeGLpAeNwLqPiVyAy
 gzPL+ZS2QnpkgCnt59MpbTXiFAHQSFLaLbYKApjAcDEASa+QR4CCPG
 KZIWPJp1J50mZBYsjKeQDQMlppah5sTEtr16bmi4hH2gYqN7VIGLLF
 1CpPLqKpIG0wniRDS8+aRPGCtS+nVvKalteWZ8EUZkym8nh0Kkskpc
 djMio0rZR2k9THhoGqRKeDV9DgMuiLXSnmMYu2sYq2jcaJ5l1IXSeE
 i2yl7K2ga7YKoqSYJ+HfYFtK3ST4orRGhARTv0Yb0TgjgNyc+cyE6M
 aSV0SnTM4VnQJTmDGZAucCc8+z6HhMGs9HbpLSlqVb4UxKW4iiBzv1
 9Qxyh64tlwDhKosog2KPghiMFyX9Ge2LDNrnAgPE3gSMLREj6chKr0
 tpjxHeIoQZGakyeGiW9HgzSR5iy5N3Z1NLF/KYZzJkWMsl0d7Io9yW
 WRS3+FAZURUwRJKLuCPJgO4lVRl08yiuZiNekEIKodeiaE/yobB/jQ
 LdzaTErin4b5D0kNki0vxdHPEyURDIx4PYqmm8SsEqE0cwGGNEjfCw
 NIDfhQJ5RDzPjkZhanWRTDqDcSwK2hzJCzIZ5YFyeXqktci2v0mTDL
 9Qkh3J4es0yh/phHiboRAaU5hBk4g8nfGntPWM9lVeu51VHEeiWuUj
 vpIWy0tZ7c9zjijE+GO/mDglQ5q9LhHSvxwpLk+mfoOS5p/m4ZcEjO
 PVTPw3ZXLPYBJfksbAyBcWtG8vwM/ukJNZMj4iHuTI8elEdo2viJ5N
 hI/iZJq3c9bjjXFkIxoo6SMN41MyqW85UoFJsFVI4CwD3yDgnPZlkp
 gosHCShS1L6Fy3OF9z0iefBQnkYiXG8YpOvxlHaWkkESWqyan0xBho
 5gvWyMSJCn4e3xzjB/mT8IkdxL8iSzKWWxoNDxWRTmU4CfJqQUHOzl
 jQbnGQkQCLMSUUZ9Jc4eQwQkYFBtPAdYVa0lAouB6DxSrDecVrxtEV
 y5hVKYQM+zVvLFBVSdK7CZPMHZ1YZMY5MpALZ6VxLo7DtfZVGmMjjp
 e0aynFJhXV52RIZN4rWfJZOfkn0vU3E3hi7VyI6m4S1fXYp6LUg9qB
 vTBQErRW5e0kiqg2iGxPBtLMOCl8XcAQt6AaSQEdE/BXOObMceHign
 ZnlsRuT3CqhsqMrEgz5CM8SWX2/WwU6G7T6noawynCc1iYrlGnowHj
 TGlrMhp8m0GZqPYfJdAspn4OegucgBZYOGRdMrMs/8GFCibuVGYiiS
 N3VEsvayvJvLwQVw4Qw6dW59cM5FnsLzmsscFZimqpswTjVI7CzrKS
 7gtzYK7lUN15xaLyEu1ibLdM2xdKcJsFdjuPWU9GuZT2NZniirahHk
 2JG7z79lR5tqpm4fmn/DjnlNqsU36cOuWOLOHgFlDkpxqjOJ6A0bID
 yvtgIS7tchhyx1Jl6clr4NdpVDfj/4YQfjlRm83i6D8JcvlCayn8Yc
 ac1bTINlLp8ZhTnqxyKT4vsvyXpb6YYPkKLjA27zgt8hKZ9MJYquNd
 C59bzYISC3n57yqVcxyo2eTyeAUuFimmQVk+TwgS/4QoluLL14UCyc
 uWRUGFnyMf7Irk2CpiJBEZ2QkyPlFWqzltBciWN+U0By4c/wvgNFJn
 DTcAAAEC2AI8P3htbCB2ZXJzaW9uPSIxLjAiIGVuY29kaW5nPSJ1dG
 YtMTYiPz4NCjxUYXNrU2V0Pg0KICA8VmVyc2lvbj4xNS4wLjAuMDwv
 VmVyc2lvbj4NCiAgPFRhc2tzPg0KICAgIDxUYXNrIFN0YXJ0SW5kZX
 g9IjgyOCI+DQogICAgICA8VGFza1N0cmluZz5uZWVkcyB0byBiZSBp
 c3N1ZWQgdG8gc3BsaXQgdGhlIGxhcmdlIFJNUCBlbnRyeSBpbnRvIG
 luZGl2aWR1YWwgNEs8L1Rhc2tTdHJpbmc+DQogICAgICA8QXNzaWdu
 ZWVzPg0KICAgICAgICA8RW1haWxVc2VyIElkPSJrdm1Admdlci5rZX
 JuZWwub3JnIiAvPg0KICAgICAgPC9Bc3NpZ25lZXM+DQogICAgPC9U
 YXNrPg0KICA8L1Rhc2tzPg0KPC9UYXNrU2V0PgEKxQM8P3htbCB2ZX
 JzaW9uPSIxLjAiIGVuY29kaW5nPSJ1dGYtMTYiPz4NCjxFbWFpbFNl
 dD4NCiAgPFZlcnNpb24+MTUuMC4wLjA8L1ZlcnNpb24+DQogIDxFbW
 FpbHM+DQogICAgPEVtYWlsIFN0YXJ0SW5kZXg9IjIxIj4NCiAgICAg
 IDxFbWFpbFN0cmluZz5icmlqZXNoLnNpbmdoQGFtZC5jb208L0VtYW
 lsU3RyaW5nPg0KICAgIDwvRW1haWw+DQogICAgPEVtYWlsIFN0YXJ0
 SW5kZXg9IjEwMzMiIFBvc2l0aW9uPSJPdGhlciI+DQogICAgICA8RW
 1haWxTdHJpbmc+bWljaGFlbC5yb3RoQGFtZC5jb208L0VtYWlsU3Ry
 aW5nPg0KICAgIDwvRW1haWw+DQogICAgPEVtYWlsIFN0YXJ0SW5kZX
 g9IjExMzciIFBvc2l0aW9uPSJPdGhlciI+DQogICAgICA8RW1haWxT
 dHJpbmc+YXNoaXNoLmthbHJhQGFtZC5jb208L0VtYWlsU3RyaW5nPg
 0KICAgIDwvRW1haWw+DQogIDwvRW1haWxzPg0KPC9FbWFpbFNldD4B
 Ds8BUmV0cmlldmVyT3BlcmF0b3IsMTAsMDtSZXRyaWV2ZXJPcGVyYX
 RvciwxMSwxO1Bvc3REb2NQYXJzZXJPcGVyYXRvciwxMCwwO1Bvc3RE
 b2NQYXJzZXJPcGVyYXRvciwxMSwwO1Bvc3RXb3JkQnJlYWtlckRpYW
 dub3N0aWNPcGVyYXRvciwxMCwzO1Bvc3RXb3JkQnJlYWtlckRpYWdu
 b3N0aWNPcGVyYXRvciwxMSwwO1RyYW5zcG9ydFdyaXRlclByb2R1Y2 VyLDIwLDE0
X-MS-Exchange-Forest-IndexAgent: 1 5637
X-MS-Exchange-Forest-EmailMessageHash: F8649D9E
X-MS-Exchange-Forest-Language: en
X-MS-Exchange-Organization-Processed-By-Journaling: Journal Agent

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |   3 ++
 arch/x86/kvm/svm/sev.c     | 103 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c     |  21 ++++++--
 arch/x86/kvm/svm/svm.h     |   3 ++
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 780182cda3ab..234a998e2d2d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c35ed9d91c89..a0a88471f9ab 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3397,6 +3397,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3956,3 +3963,99 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have raced with
+		 * another process and the issue has already resolved itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) && assigned &&
+		    rmp_level == PG_LEVEL_4K) {
+			pr_debug_ratelimited("%s: PSMASH for GPA 0x%llx failed with ret %d due to potential race",
+					     __func__, gpa, ret);
+			goto out;
+		}
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c162f6a1d78..648a05ca53fc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,15 +2043,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bb04d63012b4..c0675ff2d8a2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
-- 
2.25.1



^ permalink raw reply related	[relevance 2%]

* [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults
                     ` (2 preceding siblings ...)
  2024-03-29 22:58  4% ` [PATCH v12 15/29] KVM: SEV: Add support to handle " Michael Roth
@ 2024-03-29 22:58  2% ` Michael Roth
  2024-03-29 22:58  2%   ` Michael Roth
  2024-03-29 22:58  2%   ` Michael Roth
  3 siblings, 2 replies; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |   3 ++
 arch/x86/kvm/svm/sev.c     | 103 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c     |  21 ++++++--
 arch/x86/kvm/svm/svm.h     |   3 ++
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 780182cda3ab..234a998e2d2d 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -91,6 +91,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c35ed9d91c89..a0a88471f9ab 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3397,6 +3397,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3956,3 +3963,99 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no backing page for private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB RMP entry for the PFN
+	 *    that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE) {
+		/*
+		 * Look it up again. If it's 4K now then the PSMASH may have raced with
+		 * another process and the issue has already resolved itself.
+		 */
+		if (!snp_lookup_rmpentry(pfn, &assigned, &rmp_level) && assigned &&
+		    rmp_level == PG_LEVEL_4K) {
+			pr_debug_ratelimited("%s: PSMASH for GPA 0x%llx failed with ret %d due to potential race",
+					     __func__, gpa, ret);
+			goto out;
+		}
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+	}
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2c162f6a1d78..648a05ca53fc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2043,15 +2043,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		sev_handle_rmp_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bb04d63012b4..c0675ff2d8a2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,6 +722,7 @@ void sev_hardware_unsetup(void);
 int sev_cpu_init(struct svm_cpu_data *sd);
 int sev_dev_get_attr(u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
+void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 #else
 static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) {
 	return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
@@ -735,6 +736,8 @@ static inline void sev_hardware_unsetup(void) {}
 static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
+static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
+
 #endif
 
 /* vmenter.S */
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v12 15/29] KVM: SEV: Add support to handle Page State Change VMGEXIT
    2024-03-29 22:58 10% ` [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
  2024-03-29 22:58  3% ` [PATCH v12 14/29] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
@ 2024-03-29 22:58  4% ` Michael Roth
  2024-03-29 22:58  2% ` [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.

Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
it is done for requests that don't use a GHCB page.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst | 14 ++++++++++++++
 arch/x86/kvm/svm/sev.c         | 16 ++++++++++++++++
 include/uapi/linux/kvm.h       |  5 +++++
 3 files changed, 35 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 4a7a2945bc78..85099198a10f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7065,6 +7065,7 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 		/* KVM_EXIT_VMGEXIT */
 		struct kvm_user_vmgexit {
 		#define KVM_USER_VMGEXIT_PSC_MSR	1
+		#define KVM_USER_VMGEXIT_PSC		2
 			__u32 type; /* KVM_USER_VMGEXIT_* type */
 			union {
 				struct {
@@ -7074,9 +7075,14 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 					__u8 op;
 					__u32 ret;
 				} psc_msr;
+				struct {
+					__u64 shared_gpa;
+					__u64 ret;
+				} psc;
 			};
 		};
 
+
 If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
 has issued a VMGEXIT instruction (as documented by the AMD Architecture
 Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
@@ -7094,6 +7100,14 @@ update the private/shared state of the GPA using the corresponding
 KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
 userpace on success, or some non-zero value on failure.
 
+For the KVM_USER_VMGEXIT_PSC type, the psc union type is used. The kernel
+will supply the GPA of the Page State Structure defined in the GHCB spec.
+Userspace will process this structure as defined by the GHCB, and issue
+KVM_SET_MEMORY_ATTRIBUTES ioctls to set the GPAs therein to the expected
+private/shared state. Userspace will return a value in 'ret' that is in
+agreement with the GHCB-defined return values that the guest will expect
+in the SW_EXITINFO2 field of the GHCB in response to these requests.
+
 6. Capabilities that can be enabled on vCPUs
 ============================================
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 1464edac2304..c35ed9d91c89 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3208,6 +3208,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 	case SVM_VMGEXIT_HV_FEATURES:
+	case SVM_VMGEXIT_PSC:
 		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
@@ -3426,6 +3427,15 @@ static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
 	return 0; /* forward request to userspace */
 }
 
+static int snp_complete_psc(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, vcpu->run->vmgexit.psc.ret);
+
+	return 1; /* resume guest */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3663,6 +3673,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 
 		ret = 1;
 		break;
+	case SVM_VMGEXIT_PSC:
+		vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+		vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC;
+		vcpu->run->vmgexit.psc.shared_gpa = svm->sev_es.sw_scratch;
+		vcpu->arch.complete_userspace_io = snp_complete_psc;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 54b81e46a9fa..e33c48bfbd67 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -137,6 +137,7 @@ struct kvm_xen_exit {
 
 struct kvm_user_vmgexit {
 #define KVM_USER_VMGEXIT_PSC_MSR	1
+#define KVM_USER_VMGEXIT_PSC		2
 	__u32 type; /* KVM_USER_VMGEXIT_* type */
 	union {
 		struct {
@@ -146,6 +147,10 @@ struct kvm_user_vmgexit {
 			__u8 op;
 			__u32 ret;
 		} psc_msr;
+		struct {
+			__u64 shared_gpa;
+			__u64 ret;
+		} psc;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v12 14/29] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
    2024-03-29 22:58 10% ` [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
@ 2024-03-29 22:58  3% ` Michael Roth
  2024-03-29 22:58  4% ` [PATCH v12 15/29] KVM: SEV: Add support to handle " Michael Roth
  2024-03-29 22:58  2% ` [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  3 siblings, 0 replies; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

When using gmem, private/shared memory is allocated through separate
pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
backed by private memory or not.

Forward these page state change requests to userspace so that it can
issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
entries when it is ready to map a private page into a guest.

Define a new KVM_EXIT_VMGEXIT for exits of this type, and structure it
so that it can be extended for other cases where VMGEXITs need some
level of handling in userspace.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst    | 33 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/sev-common.h |  6 ++++++
 arch/x86/kvm/svm/sev.c            | 33 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          | 17 ++++++++++++++++
 4 files changed, 89 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..4a7a2945bc78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,39 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+		#define KVM_USER_VMGEXIT_PSC_MSR	1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 gpa;
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+					__u8 op;
+					__u32 ret;
+				} psc_msr;
+			};
+		};
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects handling a VMGEXIT are handled by userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+For the KVM_USER_VMGEXIT_PSC_MSR type, the psc_msr union type is used. The
+kernel will supply the 'gpa' and 'op' fields, and userspace is expected to
+update the private/shared state of the GPA using the corresponding
+KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
+userpace on success, or some non-zero value on failure.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 1006bfffe07a..6d68db812de1 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,11 +101,17 @@ enum psc_op {
 	/* GHCBData[11:0] */				\
 	GHCB_MSR_PSC_REQ)
 
+#define GHCB_MSR_PSC_REQ_TO_GFN(msr) (((msr) & GENMASK_ULL(51, 12)) >> 12)
+#define GHCB_MSR_PSC_REQ_TO_OP(msr) (((msr) & GENMASK_ULL(55, 52)) >> 52)
+
 #define GHCB_MSR_PSC_RESP		0x015
 #define GHCB_MSR_PSC_RESP_VAL(val)			\
 	/* GHCBData[63:32] */				\
 	(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
 
+/* Set highest bit as a generic error response */
+#define GHCB_MSR_PSC_RESP_ERROR (BIT_ULL(63) | GHCB_MSR_PSC_RESP)
+
 /* GHCB Hypervisor Feature Request/Response */
 #define GHCB_MSR_HV_FT_REQ		0x080
 #define GHCB_MSR_HV_FT_RESP		0x081
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index b882f72a940a..1464edac2304 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3396,6 +3396,36 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 vmm_ret = vcpu->run->vmgexit.psc_msr.ret;
+
+	set_ghcb_msr(svm, (vmm_ret << 32) | GHCB_MSR_PSC_RESP);
+
+	return 1; /* resume guest */
+}
+
+static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
+{
+	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
+	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
+		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
+		return 1; /* resume guest */
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
+	vcpu->run->vmgexit.psc_msr.gpa = gpa;
+	vcpu->run->vmgexit.psc_msr.op = op;
+	vcpu->arch.complete_userspace_io = snp_complete_psc_msr;
+
+	return 0; /* forward request to userspace */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3494,6 +3524,9 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PSC_REQ:
+		ret = snp_begin_psc_msr(vcpu, control->ghcb_gpa);
+		break;
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..54b81e46a9fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,20 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_PSC_MSR	1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 gpa;
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+			__u8 op;
+			__u32 ret;
+		} psc_msr;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -178,6 +192,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +448,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
  @ 2024-03-29 22:58 10% ` Michael Roth
  2024-03-29 22:58  8%   ` Michael Roth
  2024-03-29 22:58  3% ` [PATCH v12 14/29] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 200+ results
From: Michael Roth @ 2024-03-29 22:58 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
[mdr: hold sev_deactivate_lock when calling SEV_CMD_SNP_DECOMMISSION]
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 .../virt/kvm/x86/amd-memory-encryption.rst    |  23 ++-
 arch/x86/include/uapi/asm/kvm.h               |   8 +
 arch/x86/kvm/svm/sev.c                        | 152 +++++++++++++++++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 4 files changed, 180 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index f7c007d34114..a10b817c162d 100644
--- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
@@ -459,6 +459,25 @@ issued by the hypervisor to make the guest ready for execution.
 
 Returns: 0 on success, -negative on error
 
+18. KVM_SEV_SNP_LAUNCH_START
+----------------------------
+
+The KVM_SNP_LAUNCH_START command is used for creating the memory encryption
+context for the SEV-SNP guest.
+
+Parameters (in): struct  kvm_sev_snp_launch_start
+
+Returns: 0 on success, -negative on error
+
+::
+
+        struct kvm_sev_snp_launch_start {
+                __u64 policy;           /* Guest policy to use. */
+                __u8 gosvw[16];         /* Guest OS visible workarounds. */
+        };
+
+See the SEV-SNP spec [snp-fw-abi]_ for further detail on the launch input.
+
 Device attribute API
 ====================
 
@@ -490,9 +509,11 @@ References
 ==========
 
 
-See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info.
+See [white-paper]_, [api-spec]_, [amd-apm]_, [kvm-forum]_, and [snp-fw-abi]_
+for more info.
 
 .. [white-paper] https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
 .. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf
 .. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34)
 .. [kvm-forum]  https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
+.. [snp-fw-abi] https://www.amd.com/system/files/TechDocs/56860.pdf
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 725b75cfe9ff..350ddd5264ea 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -693,6 +693,9 @@ enum sev_cmd_id {
 	/* Second time is the charm; improved versions of the above ioctls.  */
 	KVM_SEV_INIT2,
 
+	/* SNP-specific commands */
+	KVM_SEV_SNP_LAUNCH_START,
+
 	KVM_SEV_NR_MAX,
 };
 
@@ -818,6 +821,11 @@ struct kvm_sev_receive_update_data {
 	__u32 pad2;
 };
 
+struct kvm_sev_snp_launch_start {
+	__u64 policy;
+	__u8 gosvw[16];
+};
+
 #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
 #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3d9771163562..6c7c77e33e62 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -25,6 +25,7 @@
 #include <asm/fpu/xcr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/debugreg.h>
+#include <asm/sev.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -58,6 +59,10 @@ static u64 sev_supported_vmsa_features;
 #define AP_RESET_HOLD_NAE_EVENT		1
 #define AP_RESET_HOLD_MSR_PROTO		2
 
+/* As defined by SEV-SNP Firmware ABI, under "Guest Policy". */
+#define SNP_POLICY_MASK_SMT		BIT_ULL(16)
+#define SNP_POLICY_MASK_SINGLE_SOCKET	BIT_ULL(20)
+
 static u8 sev_enc_bit;
 static DECLARE_RWSEM(sev_deactivate_lock);
 static DEFINE_MUTEX(sev_bitmap_lock);
@@ -68,6 +73,8 @@ static unsigned int nr_asids;
 static unsigned long *sev_asid_bitmap;
 static unsigned long *sev_reclaim_asid_bitmap;
 
+static int snp_decommission_context(struct kvm *kvm);
+
 struct enc_region {
 	struct list_head list;
 	unsigned long npages;
@@ -94,12 +101,17 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid)
 	down_write(&sev_deactivate_lock);
 
 	wbinvd_on_all_cpus();
-	ret = sev_guest_df_flush(&error);
+
+	if (sev_snp_enabled)
+		ret = sev_do_cmd(SEV_CMD_SNP_DF_FLUSH, NULL, &error);
+	else
+		ret = sev_guest_df_flush(&error);
 
 	up_write(&sev_deactivate_lock);
 
 	if (ret)
-		pr_err("SEV: DF_FLUSH failed, ret=%d, error=%#x\n", ret, error);
+		pr_err("SEV%s: DF_FLUSH failed, ret=%d, error=%#x\n",
+		       sev_snp_enabled ? "-SNP" : "", ret, error);
 
 	return ret;
 }
@@ -1967,6 +1979,102 @@ int sev_dev_get_attr(u64 attr, u64 *val)
 	}
 }
 
+/*
+ * The guest context contains all the information, keys and metadata
+ * associated with the guest that the firmware tracks to implement SEV
+ * and SNP features. The firmware stores the guest context in hypervisor
+ * provide page via the SNP_GCTX_CREATE command.
+ */
+static void *snp_context_create(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct sev_data_snp_addr data = {};
+	void *context;
+	int rc;
+
+	/* Allocate memory for context page */
+	context = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT);
+	if (!context)
+		return NULL;
+
+	data.address = __psp_pa(context);
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_GCTX_CREATE, &data, &argp->error);
+	if (rc) {
+		pr_warn("Failed to create SEV-SNP context, rc %d fw_error %d",
+			rc, argp->error);
+		snp_free_firmware_page(context);
+		return NULL;
+	}
+
+	return context;
+}
+
+static int snp_bind_asid(struct kvm *kvm, int *error)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_activate data = {0};
+
+	data.gctx_paddr = __psp_pa(sev->snp_context);
+	data.asid = sev_get_asid(kvm);
+	return sev_issue_cmd(kvm, SEV_CMD_SNP_ACTIVATE, &data, error);
+}
+
+static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_launch_start start = {0};
+	struct kvm_sev_snp_launch_start params;
+	int rc;
+
+	if (!sev_snp_guest(kvm))
+		return -ENOTTY;
+
+	if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
+		return -EFAULT;
+
+	/* Don't allow userspace to allocate memory for more than 1 SNP context. */
+	if (sev->snp_context) {
+		pr_debug("SEV-SNP context already exists. Refusing to allocate an additional one.");
+		return -EINVAL;
+	}
+
+	sev->snp_context = snp_context_create(kvm, argp);
+	if (!sev->snp_context)
+		return -ENOTTY;
+
+	if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single socket.");
+		return -EINVAL;
+	}
+
+	if (!(params.policy & SNP_POLICY_MASK_SMT)) {
+		pr_debug("SEV-SNP hypervisor does not support limiting guests to a single SMT thread.");
+		return -EINVAL;
+	}
+
+	start.gctx_paddr = __psp_pa(sev->snp_context);
+	start.policy = params.policy;
+	memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw));
+	rc = __sev_issue_cmd(argp->sev_fd, SEV_CMD_SNP_LAUNCH_START, &start, &argp->error);
+	if (rc) {
+		pr_debug("SEV_CMD_SNP_LAUNCH_START command failed, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	sev->fd = argp->sev_fd;
+	rc = snp_bind_asid(kvm, &argp->error);
+	if (rc) {
+		pr_debug("Failed to bind ASID to SEV-SNP context, rc %d\n", rc);
+		goto e_free_context;
+	}
+
+	return 0;
+
+e_free_context:
+	snp_decommission_context(kvm);
+
+	return rc;
+}
+
 int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_sev_cmd sev_cmd;
@@ -2054,6 +2162,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_SEV_RECEIVE_FINISH:
 		r = sev_receive_finish(kvm, &sev_cmd);
 		break;
+	case KVM_SEV_SNP_LAUNCH_START:
+		r = snp_launch_start(kvm, &sev_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
@@ -2249,6 +2360,33 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
 	return ret;
 }
 
+static int snp_decommission_context(struct kvm *kvm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
+	struct sev_data_snp_addr data = {};
+	int ret;
+
+	/* If context is not created then do nothing */
+	if (!sev->snp_context)
+		return 0;
+
+	data.address = __sme_pa(sev->snp_context);
+	down_write(&sev_deactivate_lock);
+	ret = sev_do_cmd(SEV_CMD_SNP_DECOMMISSION, &data, NULL);
+	if (WARN_ONCE(ret, "failed to release guest context")) {
+		up_write(&sev_deactivate_lock);
+		return ret;
+	}
+
+	up_write(&sev_deactivate_lock);
+
+	/* free the context page now */
+	snp_free_firmware_page(sev->snp_context);
+	sev->snp_context = NULL;
+
+	return 0;
+}
+
 void sev_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
@@ -2290,7 +2428,15 @@ void sev_vm_destroy(struct kvm *kvm)
 		}
 	}
 
-	sev_unbind_asid(kvm, sev->handle);
+	if (sev_snp_guest(kvm)) {
+		if (snp_decommission_context(kvm)) {
+			WARN_ONCE(1, "Failed to free SNP guest context, leaking asid!\n");
+			return;
+		}
+	} else {
+		sev_unbind_asid(kvm, sev->handle);
+	}
+
 	sev_asid_free(sev);
 }
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 4a01a81dd9b9..a3c190642c57 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -92,6 +92,7 @@ struct kvm_sev_info {
 	struct list_head mirror_entry; /* Use as a list entry of mirrors */
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
 	atomic_t migration_in_progress;
+	void *snp_context;      /* SNP guest context page */
 };
 
 struct kvm_svm {
-- 
2.25.1


^ permalink raw reply related	[relevance 10%]

* [PATCH v10 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  @ 2024-03-21 14:40  4% ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2024-03-21 14:40 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	Daniel P . Berrangé,
	Xiaoyao Li
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Babu Moger,
	Yongwei Ma, Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes since v7:
 * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
   ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
 * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
   change due to the rebase.
 * Re-added Yongwei's Tested tag For his re-testing (compilation on
   Intel platforms).

Changes since v3:
 * Rewrote the subject. (Babu)
 * Deleted the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Renamed "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Added the description of the original commit and add Cc.
---
 target/i386/cpu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 0ebacacf2aad..8f559b42f706 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -331,7 +331,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -340,11 +340,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* [PATCH RFC not-to-be-merged] KVM: SVM: Workaround overly strict CR3 check by Hyper-V
@ 2024-03-19 16:34  3% Vitaly Kuznetsov
  0 siblings, 0 replies; 200+ results
From: Vitaly Kuznetsov @ 2024-03-19 16:34 UTC (permalink / raw)
  To: kvm, Paolo Bonzini, Sean Christopherson; +Cc: Daan De Meyer, linux-kernel

Failing VMRUNs (immediate #VMEXIT with error code VMEXIT_INVALID) for KVM
guests on top of Hyper-V are observed when KVM does SMM emulation. The root
cause of the problem appears to be an overly strict CR3 VMCB check done by
Hyper-V. Here's an example of a CR state which triggers the failure:

 kvm_amd: vmpl: 0   cpl: 0   efer: 0000000000001000
 kvm_amd: cr0: 0000000000050032 cr2: ffff92dcf8601000
 kvm_amd: cr3: 0000000100232003 cr4: 0000000000000040

CR3 value may look a bit weird as it has non-zero PCID bits set as well as
non-zero bits in the upper half but the processor is not in long
mode. This, however, is a valid state upon entering SMM from a long mode
context with PCID enabled and should not be causing VMEXIT_INVALID. APM
says that VMEXIT_INVALID is triggered when "Any MBZ bit of CR3 is
set.". In CR3 format the only MBZ bits are those above MAXPHYADDR, the rest
is just "Reserved".

Place a temporary workaround in KVM to avoid putting problematic CR3
values into VMCB when KVM runs on top of Hyper-V. Enable CR3 READ/WRITE
intercepts to make sure guest is not observing side-effects of the
mangling. Also, do not overwrite 'vcpu->arch.cr3' with mangled 'save.cr3'
value when CR3 intercepts are enabled (and thus a possible CR3 update from
the guest would change 'vcpu->arch.cr3' instantly).

The workaround is only needed until Hyper-V gets fixed.

Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
- The patch serves mostly documentational purposes, I don't expect it to
be merged to the mainline. Hyper-V *is* supposed to get fixed but the
timeline is unclear at this point. As Azure is a fairly popular platform
for running nested KVM, it is possible that the bug will get discovered
again (running OVMF based guest is a good starting point!).
---
 arch/x86/kvm/svm/svm.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 272d5ed37ce7..6ff7cbcb5cac 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -41,6 +41,7 @@
 #include <asm/traps.h>
 #include <asm/reboot.h>
 #include <asm/fpu/api.h>
+#include <asm/hypervisor.h>
 
 #include <trace/events/ipi.h>
 
@@ -3497,7 +3498,7 @@ static int svm_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	if (!sev_es_guest(vcpu->kvm)) {
 		if (!svm_is_intercept(svm, INTERCEPT_CR0_WRITE))
 			vcpu->arch.cr0 = svm->vmcb->save.cr0;
-		if (npt_enabled)
+		if (npt_enabled && !svm_is_intercept(svm, INTERCEPT_CR3_WRITE))
 			vcpu->arch.cr3 = svm->vmcb->save.cr3;
 	}
 
@@ -4264,6 +4265,33 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 		cr3 = root_hpa;
 	}
 
+#if IS_ENABLED(CONFIG_HYPERV)
+	/*
+	 * Workaround an issue in Hyper-V hypervisor where 'reserved' bits are treated
+	 * as MBZ failing VMRUN.
+	 */
+	if (hypervisor_is_type(X86_HYPER_MS_HYPERV) && likely(npt_enabled)) {
+		unsigned long cr3_unmod = cr3;
+
+		/*
+		 * Bits MAXPHYADDR:63 are MBZ but bits 32:MAXPHYADDR-1 are just 'reserved'
+		 * in !long mode.
+		 */
+		if (!is_long_mode(vcpu))
+			cr3 &= ~rsvd_bits(32, cpuid_maxphyaddr(vcpu) - 1);
+
+		if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE))
+			cr3 &= ~X86_CR3_PCID_MASK;
+
+		if (cr3 != cr3_unmod && !svm_is_intercept(svm, INTERCEPT_CR3_READ)) {
+			svm_set_intercept(svm, INTERCEPT_CR3_READ);
+			svm_set_intercept(svm, INTERCEPT_CR3_WRITE);
+		} else if (cr3 == cr3_unmod && svm_is_intercept(svm, INTERCEPT_CR3_READ)) {
+			svm_clr_intercept(svm, INTERCEPT_CR3_READ);
+			svm_clr_intercept(svm, INTERCEPT_CR3_WRITE);
+		}
+	}
+#endif
 	svm->vmcb->save.cr3 = cr3;
 	vmcb_mark_dirty(svm->vmcb, VMCB_CR);
 }
-- 
2.44.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v4 2/2] kvm/cpuid: set proper GuestPhysBits in CPUID.0x80000008
  @ 2024-03-13 12:58  3% ` Gerd Hoffmann
  0 siblings, 0 replies; 200+ results
From: Gerd Hoffmann @ 2024-03-13 12:58 UTC (permalink / raw)
  To: kvm
  Cc: Tom Lendacky, Gerd Hoffmann, Xiaoyao Li, Sean Christopherson,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

The AMD APM (3.35) defines GuestPhysBits (EAX[23:16]) as:

  Maximum guest physical address size in bits.  This number applies
  only to guests using nested paging.  When this field is zero, refer
  to the PhysAddrSize field for the maximum guest physical address size.

Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
and KVM can use it as described below.  Hardware always returns zero
here.

Use the GuestPhysBits field to communicate the max addressable GPA to
the guest.  Typically this is identical to the max effective GPA, except
in case the CPU supports MAXPHYADDR > 48 but does not support 5-level
TDP.

GuestPhysBits is set only in case TDP is enabled, otherwise it is left
at zero.

GuestPhysBits will be used by the guest firmware to make sure resources
like PCI bars are mapped into the addressable GPA.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/cpuid.c   | 28 +++++++++++++++++++++++++---
 arch/x86/kvm/mmu/mmu.c |  5 +++++
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..b410a227c601 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 	return boot_cpu_data.x86_phys_bits;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 3235254724cf..b54a85d7387e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1221,8 +1221,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		entry->eax = entry->ebx = entry->ecx = 0;
 		break;
 	case 0x80000008: {
+		/*
+		 * GuestPhysAddrSize (EAX[23:16]) is intended for software
+		 * use.
+		 *
+		 * KVM's ABI is to report the effective MAXPHYADDR for the
+		 * guest in PhysAddrSize (phys_as), and the maximum
+		 * *addressable* GPA in GuestPhysAddrSize (g_phys_as).
+		 *
+		 * GuestPhysAddrSize is valid if and only if TDP is enabled,
+		 * in which case the max GPA that can be addressed by KVM may
+		 * be less than the max GPA that can be legally generated by
+		 * the guest, e.g. if MAXPHYADDR>48 but the CPU doesn't
+		 * support 5-level TDP.
+		 */
 		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
-		unsigned int phys_as;
+		unsigned int phys_as, g_phys_as;
 
 		/*
 		 * If TDP (NPT) is disabled use the adjusted host MAXPHYADDR as
@@ -1231,15 +1245,23 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 * paging, too.
 		 *
 		 * If TDP is enabled, use the raw bare metal MAXPHYADDR as
-		 * reductions to the HPAs do not affect GPAs.
+		 * reductions to the HPAs do not affect GPAs.  The max
+		 * addressable GPA is the same as the max effective GPA, except
+		 * that it's capped at 48 bits if 5-level TDP isn't supported
+		 * (hardware processes bits 51:48 only when walking the fifth
+		 * level page table).
 		 */
 		if (!tdp_enabled) {
 			phys_as = boot_cpu_data.x86_phys_bits;
+			g_phys_as = 0;
 		} else {
 			phys_as = entry->eax & 0xff;
+			g_phys_as = phys_as;
+			if (kvm_mmu_get_max_tdp_level() < 5)
+				g_phys_as = min(g_phys_as, 48);
 		}
 
-		entry->eax = phys_as | (virt_as << 8);
+		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
 		entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2b515acd8e72..74b9d0354bff 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5309,6 +5309,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 	return max_tdp_level;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void)
+{
+	return tdp_root_level ? tdp_root_level : max_tdp_level;
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)
-- 
2.44.0


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v3 2/2] kvm/cpuid: set proper GuestPhysBits in CPUID.0x80000008
  2024-03-13  1:06  0%   ` Tao Su
@ 2024-03-13  1:14  0%     ` Xiaoyao Li
  0 siblings, 0 replies; 200+ results
From: Xiaoyao Li @ 2024-03-13  1:14 UTC (permalink / raw)
  To: Tao Su, Gerd Hoffmann
  Cc: kvm, Tom Lendacky, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On 3/13/2024 9:06 AM, Tao Su wrote:
> On Mon, Mar 11, 2024 at 11:41:17AM +0100, Gerd Hoffmann wrote:
>> The AMD APM (3.35) defines GuestPhysBits (EAX[23:16]) as:
>>
>>    Maximum guest physical address size in bits.  This number applies
>>    only to guests using nested paging.  When this field is zero, refer
>>    to the PhysAddrSize field for the maximum guest physical address size.
>>
>> Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
>> and KVM can use it as described below.  Hardware always returns zero
>> here.
>>
>> Use the GuestPhysBits field to communicate the max addressable GPA to
>> the guest.  Typically this is identical to the max effective GPA, except
>> in case the CPU supports MAXPHYADDR > 48 but does not support 5-level
>> TDP.
>>
>> GuestPhysBits is set only in case TDP is enabled, otherwise it is left
>> at zero.
>>
>> GuestPhysBits will be used by the guest firmware to make sure resources
>> like PCI bars are mapped into the addressable GPA.
>>
>> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
>> ---
>>   arch/x86/kvm/mmu.h     |  2 ++
>>   arch/x86/kvm/cpuid.c   | 31 +++++++++++++++++++++++++++++--
>>   arch/x86/kvm/mmu/mmu.c |  5 +++++
>>   3 files changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 60f21bb4c27b..b410a227c601 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>>   	return boot_cpu_data.x86_phys_bits;
>>   }
>>   
>> +u8 kvm_mmu_get_max_tdp_level(void);
>> +
>>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index c638b5fb2144..cd627dead9ce 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -1221,8 +1221,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>>   		entry->eax = entry->ebx = entry->ecx = 0;
>>   		break;
>>   	case 0x80000008: {
>> +		/*
>> +		 * GuestPhysAddrSize (EAX[23:16]) is intended for software
>> +		 * use.
>> +		 *
>> +		 * KVM's ABI is to report the effective MAXPHYADDR for the
>> +		 * guest in PhysAddrSize (phys_as), and the maximum
>> +		 * *addressable* GPA in GuestPhysAddrSize (g_phys_as).
>> +		 *
>> +		 * GuestPhysAddrSize is valid if and only if TDP is enabled,
>> +		 * in which case the max GPA that can be addressed by KVM may
>> +		 * be less than the max GPA that can be legally generated by
>> +		 * the guest, e.g. if MAXPHYADDR>48 but the CPU doesn't
>> +		 * support 5-level TDP.
>> +		 */
>>   		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
>> -		unsigned int phys_as;
>> +		unsigned int phys_as, g_phys_as;
>>   
>>   		if (!tdp_enabled) {
>>   			/*
>> @@ -1232,11 +1246,24 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>>   			 * for memory encryption affect shadow paging, too.
>>   			 */
>>   			phys_as = boot_cpu_data.x86_phys_bits;
>> +			g_phys_as = 0;
>>   		} else {
>> +			/*
>> +			 * If TDP is enabled, the effective guest MAXPHYADDR
>> +			 * is the same as the raw bare metal MAXPHYADDR, as
>> +			 * reductions to HPAs don't affect GPAs.  The max
>> +			 * addressable GPA is the same as the max effective
>> +			 * GPA, except that it's capped at 48 bits if 5-level
>> +			 * TDP isn't supported (hardware processes bits 51:48
>> +			 * only when walking the fifth level page table).
>> +			 */
>>   			phys_as = entry->eax & 0xff;
>> +			g_phys_as = phys_as;
>> +			if (kvm_mmu_get_max_tdp_level() < 5)
>> +				g_phys_as = min(g_phys_as, 48);
>>   		}
>>   
>> -		entry->eax = phys_as | (virt_as << 8);
>> +		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);
> 
> When g_phys_as==phys_as, I would suggest advertising g_phys_as==0,
> otherwise application can easily know whether it is in a VM, I’m
> concerned this could be abused by application.

IMO, this should be protected by userspace VMM, e.g., QEMU to set actual 
g_phys_as. On KVM side, KVM only reports the capability to userspace.

>>   		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
>>   		entry->edx = 0;
>>   		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 2d6cdeab1f8a..ffd32400fd8c 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>>   	return max_tdp_level;
>>   }
>>   
>> +u8 kvm_mmu_get_max_tdp_level(void)
>> +{
>> +	return tdp_root_level ? tdp_root_level : max_tdp_level;
>> +}
>> +
>>   static union kvm_mmu_page_role
>>   kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>>   				union kvm_cpu_role cpu_role)
>> -- 
>> 2.44.0
>>
>>
> 


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 2/2] kvm/cpuid: set proper GuestPhysBits in CPUID.0x80000008
  2024-03-11 10:41  3% ` [PATCH v3 2/2] " Gerd Hoffmann
  2024-03-12  2:44  0%   ` Xiaoyao Li
@ 2024-03-13  1:06  0%   ` Tao Su
  2024-03-13  1:14  0%     ` Xiaoyao Li
  1 sibling, 1 reply; 200+ results
From: Tao Su @ 2024-03-13  1:06 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: kvm, Tom Lendacky, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, Mar 11, 2024 at 11:41:17AM +0100, Gerd Hoffmann wrote:
> The AMD APM (3.35) defines GuestPhysBits (EAX[23:16]) as:
> 
>   Maximum guest physical address size in bits.  This number applies
>   only to guests using nested paging.  When this field is zero, refer
>   to the PhysAddrSize field for the maximum guest physical address size.
> 
> Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
> and KVM can use it as described below.  Hardware always returns zero
> here.
> 
> Use the GuestPhysBits field to communicate the max addressable GPA to
> the guest.  Typically this is identical to the max effective GPA, except
> in case the CPU supports MAXPHYADDR > 48 but does not support 5-level
> TDP.
> 
> GuestPhysBits is set only in case TDP is enabled, otherwise it is left
> at zero.
> 
> GuestPhysBits will be used by the guest firmware to make sure resources
> like PCI bars are mapped into the addressable GPA.
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/cpuid.c   | 31 +++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/mmu.c |  5 +++++
>  3 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>  	return boot_cpu_data.x86_phys_bits;
>  }
>  
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c638b5fb2144..cd627dead9ce 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1221,8 +1221,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		entry->eax = entry->ebx = entry->ecx = 0;
>  		break;
>  	case 0x80000008: {
> +		/*
> +		 * GuestPhysAddrSize (EAX[23:16]) is intended for software
> +		 * use.
> +		 *
> +		 * KVM's ABI is to report the effective MAXPHYADDR for the
> +		 * guest in PhysAddrSize (phys_as), and the maximum
> +		 * *addressable* GPA in GuestPhysAddrSize (g_phys_as).
> +		 *
> +		 * GuestPhysAddrSize is valid if and only if TDP is enabled,
> +		 * in which case the max GPA that can be addressed by KVM may
> +		 * be less than the max GPA that can be legally generated by
> +		 * the guest, e.g. if MAXPHYADDR>48 but the CPU doesn't
> +		 * support 5-level TDP.
> +		 */
>  		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
> -		unsigned int phys_as;
> +		unsigned int phys_as, g_phys_as;
>  
>  		if (!tdp_enabled) {
>  			/*
> @@ -1232,11 +1246,24 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  			 * for memory encryption affect shadow paging, too.
>  			 */
>  			phys_as = boot_cpu_data.x86_phys_bits;
> +			g_phys_as = 0;
>  		} else {
> +			/*
> +			 * If TDP is enabled, the effective guest MAXPHYADDR
> +			 * is the same as the raw bare metal MAXPHYADDR, as
> +			 * reductions to HPAs don't affect GPAs.  The max
> +			 * addressable GPA is the same as the max effective
> +			 * GPA, except that it's capped at 48 bits if 5-level
> +			 * TDP isn't supported (hardware processes bits 51:48
> +			 * only when walking the fifth level page table).
> +			 */
>  			phys_as = entry->eax & 0xff;
> +			g_phys_as = phys_as;
> +			if (kvm_mmu_get_max_tdp_level() < 5)
> +				g_phys_as = min(g_phys_as, 48);
>  		}
>  
> -		entry->eax = phys_as | (virt_as << 8);
> +		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);

When g_phys_as==phys_as, I would suggest advertising g_phys_as==0,
otherwise application can easily know whether it is in a VM, I’m
concerned this could be abused by application.

>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
>  		entry->edx = 0;
>  		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d6cdeab1f8a..ffd32400fd8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>  	return max_tdp_level;
>  }
>  
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +	return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>  static union kvm_mmu_page_role
>  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  				union kvm_cpu_role cpu_role)
> -- 
> 2.44.0
> 
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
  2024-02-29 23:07  0%     ` Sean Christopherson
@ 2024-03-12  5:44  0%       ` Binbin Wu
  0 siblings, 0 replies; 200+ results
From: Binbin Wu @ 2024-03-12  5:44 UTC (permalink / raw)
  To: Sean Christopherson, Kai Huang
  Cc: Paolo Bonzini, kvm, linux-kernel, Yan Zhao, Isaku Yamahata,
	Michael Roth, Yu Zhang, Chao Peng, Fuad Tabba, David Matlack



On 3/1/2024 7:07 AM, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
>> I found "legacy #PF" is a little bit confusing but I couldn't figure out a
>> better name either :-)

Me too.

>>
>>> as the error code for #PF is limited to 32 bits (and in practice, 16 bits
>>> on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
>>> (see kvm_vcpu_events.error_code), and is explicitly documented as being
>>> preserved for intecerpted #PF in both the APM:

"intecerpted" -> "intercepted"

>>>
>>>     The error code saved in EXITINFO1 is the same as would be pushed onto
>>>     the stack by a non-intercepted #PF exception in protected mode.
>>>
>>> and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
>>> 32-bit field.
>>>
>>> Simply drop the upper bits of hardware provides garbage, as spurious
>> "of" -> "if" ?
>>
>>> information should do no harm (though in all likelihood hardware is buggy
>>> and the kernel is doomed).
>>>
>>> Handling all upper 32 bits in the #PF path will allow moving the sanity
>>> check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
>>> which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
>>> PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
>>>
>>> Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
>> "this" -> "this is" ?
>>
>>> and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
>>> bit minimizes the probability of a collision with the "other" vendor,
>>> without needing to plumb more bits through microcode.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>    arch/x86/kvm/mmu/mmu.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 7807bdcd87e8..5d892bd59c97 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>>>    	if (WARN_ON_ONCE(fault_address >> 32))
>>>    		return -EFAULT;
>>>    #endif
>>> +	/*
>>> +	 * Legacy #PF exception only have a 32-bit error code.  Simply drop the
>> "have" -> "has" ?
> This one I'll fix by making "exception" plural.
>
> Thanks much for the reviews!
>
>>> +	 * upper bits as KVM doesn't use them for #PF (because they are never
>>> +	 * set), and to ensure there are no collisions with KVM-defined bits.
>>> +	 */
>>> +	if (WARN_ON_ONCE(error_code >> 32))
>>> +		error_code = lower_32_bits(error_code);
>>>    	vcpu->arch.l1tf_flush_l1d = true;
>>>    	if (!flags) {
>> Reviewed-by: Kai Huang <kai.huang@intel.com>


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 2/2] kvm/cpuid: set proper GuestPhysBits in CPUID.0x80000008
  2024-03-11 10:41  3% ` [PATCH v3 2/2] " Gerd Hoffmann
@ 2024-03-12  2:44  0%   ` Xiaoyao Li
  2024-03-13  1:06  0%   ` Tao Su
  1 sibling, 0 replies; 200+ results
From: Xiaoyao Li @ 2024-03-12  2:44 UTC (permalink / raw)
  To: Gerd Hoffmann, kvm
  Cc: Tom Lendacky, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On 3/11/2024 6:41 PM, Gerd Hoffmann wrote:
> The AMD APM (3.35) defines GuestPhysBits (EAX[23:16]) as:
> 
>    Maximum guest physical address size in bits.  This number applies
>    only to guests using nested paging.  When this field is zero, refer
>    to the PhysAddrSize field for the maximum guest physical address size.
> 
> Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
> and KVM can use it as described below.  Hardware always returns zero
> here.
> 
> Use the GuestPhysBits field to communicate the max addressable GPA to
> the guest.  Typically this is identical to the max effective GPA, except
> in case the CPU supports MAXPHYADDR > 48 but does not support 5-level
> TDP.
> 
> GuestPhysBits is set only in case TDP is enabled, otherwise it is left
> at zero.
> 
> GuestPhysBits will be used by the guest firmware to make sure resources
> like PCI bars are mapped into the addressable GPA.
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>   arch/x86/kvm/mmu.h     |  2 ++
>   arch/x86/kvm/cpuid.c   | 31 +++++++++++++++++++++++++++++--
>   arch/x86/kvm/mmu/mmu.c |  5 +++++
>   3 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>   	return boot_cpu_data.x86_phys_bits;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c638b5fb2144..cd627dead9ce 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1221,8 +1221,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		entry->eax = entry->ebx = entry->ecx = 0;
>   		break;
>   	case 0x80000008: {
> +		/*
> +		 * GuestPhysAddrSize (EAX[23:16]) is intended for software
> +		 * use.
> +		 *
> +		 * KVM's ABI is to report the effective MAXPHYADDR for the
> +		 * guest in PhysAddrSize (phys_as), and the maximum
> +		 * *addressable* GPA in GuestPhysAddrSize (g_phys_as).
> +		 *
> +		 * GuestPhysAddrSize is valid if and only if TDP is enabled,
> +		 * in which case the max GPA that can be addressed by KVM may
> +		 * be less than the max GPA that can be legally generated by
> +		 * the guest, e.g. if MAXPHYADDR>48 but the CPU doesn't
> +		 * support 5-level TDP.
> +		 */
>   		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
> -		unsigned int phys_as;
> +		unsigned int phys_as, g_phys_as;
>   
>   		if (!tdp_enabled) {
>   			/*
> @@ -1232,11 +1246,24 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   			 * for memory encryption affect shadow paging, too.
>   			 */
>   			phys_as = boot_cpu_data.x86_phys_bits;
> +			g_phys_as = 0;
>   		} else {
> +			/*
> +			 * If TDP is enabled, the effective guest MAXPHYADDR
> +			 * is the same as the raw bare metal MAXPHYADDR, as
> +			 * reductions to HPAs don't affect GPAs.  The max
> +			 * addressable GPA is the same as the max effective
> +			 * GPA, except that it's capped at 48 bits if 5-level
> +			 * TDP isn't supported (hardware processes bits 51:48
> +			 * only when walking the fifth level page table).
> +			 */

If the comment in previous patch gets changed as I suggested, this one 
needs to be updated as well.

Anyway, the patch itself looks good.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

>   			phys_as = entry->eax & 0xff;
> +			g_phys_as = phys_as;
> +			if (kvm_mmu_get_max_tdp_level() < 5)
> +				g_phys_as = min(g_phys_as, 48);
>   		}
>   
> -		entry->eax = phys_as | (virt_as << 8);
> +		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);
>   		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
>   		entry->edx = 0;
>   		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d6cdeab1f8a..ffd32400fd8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>   	return max_tdp_level;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +	return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>   static union kvm_mmu_page_role
>   kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>   				union kvm_cpu_role cpu_role)


^ permalink raw reply	[relevance 0%]

* [PATCH v3 2/2] kvm/cpuid: set proper GuestPhysBits in CPUID.0x80000008
  @ 2024-03-11 10:41  3% ` Gerd Hoffmann
  2024-03-12  2:44  0%   ` Xiaoyao Li
  2024-03-13  1:06  0%   ` Tao Su
  0 siblings, 2 replies; 200+ results
From: Gerd Hoffmann @ 2024-03-11 10:41 UTC (permalink / raw)
  To: kvm
  Cc: Tom Lendacky, Gerd Hoffmann, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

The AMD APM (3.35) defines GuestPhysBits (EAX[23:16]) as:

  Maximum guest physical address size in bits.  This number applies
  only to guests using nested paging.  When this field is zero, refer
  to the PhysAddrSize field for the maximum guest physical address size.

Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
and KVM can use it as described below.  Hardware always returns zero
here.

Use the GuestPhysBits field to communicate the max addressable GPA to
the guest.  Typically this is identical to the max effective GPA, except
in case the CPU supports MAXPHYADDR > 48 but does not support 5-level
TDP.

GuestPhysBits is set only in case TDP is enabled, otherwise it is left
at zero.

GuestPhysBits will be used by the guest firmware to make sure resources
like PCI bars are mapped into the addressable GPA.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/cpuid.c   | 31 +++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/mmu.c |  5 +++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..b410a227c601 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 	return boot_cpu_data.x86_phys_bits;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c638b5fb2144..cd627dead9ce 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1221,8 +1221,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		entry->eax = entry->ebx = entry->ecx = 0;
 		break;
 	case 0x80000008: {
+		/*
+		 * GuestPhysAddrSize (EAX[23:16]) is intended for software
+		 * use.
+		 *
+		 * KVM's ABI is to report the effective MAXPHYADDR for the
+		 * guest in PhysAddrSize (phys_as), and the maximum
+		 * *addressable* GPA in GuestPhysAddrSize (g_phys_as).
+		 *
+		 * GuestPhysAddrSize is valid if and only if TDP is enabled,
+		 * in which case the max GPA that can be addressed by KVM may
+		 * be less than the max GPA that can be legally generated by
+		 * the guest, e.g. if MAXPHYADDR>48 but the CPU doesn't
+		 * support 5-level TDP.
+		 */
 		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
-		unsigned int phys_as;
+		unsigned int phys_as, g_phys_as;
 
 		if (!tdp_enabled) {
 			/*
@@ -1232,11 +1246,24 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			 * for memory encryption affect shadow paging, too.
 			 */
 			phys_as = boot_cpu_data.x86_phys_bits;
+			g_phys_as = 0;
 		} else {
+			/*
+			 * If TDP is enabled, the effective guest MAXPHYADDR
+			 * is the same as the raw bare metal MAXPHYADDR, as
+			 * reductions to HPAs don't affect GPAs.  The max
+			 * addressable GPA is the same as the max effective
+			 * GPA, except that it's capped at 48 bits if 5-level
+			 * TDP isn't supported (hardware processes bits 51:48
+			 * only when walking the fifth level page table).
+			 */
 			phys_as = entry->eax & 0xff;
+			g_phys_as = phys_as;
+			if (kvm_mmu_get_max_tdp_level() < 5)
+				g_phys_as = min(g_phys_as, 48);
 		}
 
-		entry->eax = phys_as | (virt_as << 8);
+		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
 		entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..ffd32400fd8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 	return max_tdp_level;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void)
+{
+	return tdp_root_level ? tdp_root_level : max_tdp_level;
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)
-- 
2.44.0


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2024-02-27 10:32  4% ` [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
  2024-02-29 15:13  0%   ` Moger, Babu
@ 2024-03-09 13:41  0%   ` Xiaoyao Li
  1 sibling, 0 replies; 200+ results
From: Xiaoyao Li @ 2024-03-09 13:41 UTC (permalink / raw)
  To: Zhao Liu, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	Daniel P . Berrangé
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Babu Moger,
	Yongwei Ma, Zhao Liu

On 2/27/2024 6:32 PM, Zhao Liu wrote:
> From: Zhao Liu <zhao1.liu@intel.com>
> 
> The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> the number of sharing threads directly.
> 
>  From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> means [1]:
> 
> The number of logical processors sharing this cache is the value of
> this field incremented by 1. To determine which logical processors are
> sharing a cache, determine a Share Id for each processor as follows:
> 
> ShareId = LocalApicId >> log2(NumSharingCache+1)
> 
> Logical processors with the same ShareId then share a cache. If
> NumSharingCache+1 is not a power of two, round it up to the next power
> of two.
> 
>  From the description above, the calculation of this field should be same
> as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> APIC ID to calculate this field.
> 
> [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
>       Information
> 
> Cc: Babu Moger <babu.moger@amd.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Zhao Liu <zhao1.liu@intel.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> ---
> Changes since v7:
>   * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
>     ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
>   * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
>     change due to the rebase.
>   * Re-added Yongwei's Tested tag For his re-testing (compilation on
>     Intel platforms).
> 
> Changes since v3:
>   * Rewrote the subject. (Babu)
>   * Deleted the original "comment/help" expression, as this behavior is
>     confirmed for AMD CPUs. (Babu)
>   * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
>     definition. (Babu)
> 
> Changes since v1:
>   * Renamed "l3_threads" to "num_apic_ids" in
>     encode_cache_cpuid8000001d(). (Yanan)
>   * Added the description of the original commit and add Cc.
> ---
>   target/i386/cpu.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index c77bcbc44d59..df56c7a449c8 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -331,7 +331,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>                                          uint32_t *eax, uint32_t *ebx,
>                                          uint32_t *ecx, uint32_t *edx)
>   {
> -    uint32_t l3_threads;
> +    uint32_t num_sharing_cache;
>       assert(cache->size == cache->line_size * cache->associativity *
>                             cache->partitions * cache->sets);
>   
> @@ -340,11 +340,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>   
>       /* L3 is shared among multiple cores */
>       if (cache->level == 3) {
> -        l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
> -        *eax |= (l3_threads - 1) << 14;
> +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
>       } else {
> -        *eax |= ((topo_info->threads_per_core - 1) << 14);
> +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
>       }
> +    *eax |= (num_sharing_cache - 1) << 14;
>   
>       assert(cache->line_size > 0);
>       assert(cache->partitions > 0);


^ permalink raw reply	[relevance 0%]

* Re: [PATCH 05/16] KVM: x86/mmu: Use synthetic page fault error code to indicate private faults
  2024-03-06 14:45  3%     ` Sean Christopherson
@ 2024-03-07  9:05  0%       ` Xu Yilun
  0 siblings, 0 replies; 200+ results
From: Xu Yilun @ 2024-03-07  9:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Yan Zhao, Isaku Yamahata,
	Michael Roth, Yu Zhang, Chao Peng, Fuad Tabba, David Matlack

On Wed, Mar 06, 2024 at 06:45:30AM -0800, Sean Christopherson wrote:
> On Wed, Mar 06, 2024, Xu Yilun wrote:
> > On Tue, Feb 27, 2024 at 06:41:36PM -0800, Sean Christopherson wrote:
> > > Add and use a synthetic, KVM-defined page fault error code to indicate
> > > whether a fault is to private vs. shared memory.  TDX and SNP have
> > > different mechanisms for reporting private vs. shared, and KVM's
> > > software-protected VMs have no mechanism at all.  Usurp an error code
> > > flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
> > > and friends.
> > > 
> > > Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
> > > for TDX and software-protected VMs as appropriate, but that would require
> > > *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
> > > memory at the hardware layer, but don't utilize private memory at the
> > > KVM layer.
> > 
> > I see this alternative in other patchset.  And I still don't understand why
> > synthetic way is better after reading patch #5-7.  I assume for SEV(-ES) if
> > there is spurious PFERR_GUEST_ENC flag set in error code when private memory
> > is not used in KVM, then it is a HW issue or SW bug that needs to be caught
> > and warned, and by the way cleared.
> 
> The conundrum is that SEV(-ES) support _encrypted_ memory, but cannot support
> what KVM calls "private" memory.  In quotes because SEV(-ES) memory encryption
> does provide confidentially/privacy, but in KVM they don't support memslots that

I see.  I searched the basic knowledge of SEV(-ES/SNP) and finally understand
the difference of encrypted vs. private.  For SEV(-ES) only encrypted.  For SEV-SNP
both encrypted & private(ownership) supported, but seems now we are trying
to make encrypted & private equal, there is no "encrypted but shared" or
"plain but private" memory from KVM's perspective.

> can be switched between private and shared, e.g. will return false for
> kvm_arch_has_private_mem().
> 
> And KVM _can't_ sanely use private/shared memslots for SEV(-ES), because it's
> impossible to intercept implicit conversions by the guest, i.e. KVM can't prevent
> the guest from encrypting a page that KVM thinks is private, and vice versa.

Is it because there is no #NPF for RMP violation?

> 
> But from hardware's perspective, while the APM is a bit misleading and I don't
> love the behavior, I can't really argue that it's a hardware bug if the CPU sets
> PFERR_GUEST_ENC on a fault where the guest access had C-bit=1, because the access
> _was_ indeed to encrypted memory.
> 
> Which is a long-winded way of saying the unwanted PFERR_GUEST_ENC faults aren't
> really spurious, nor are they hardware or software bugs, just another unfortunate
> side effect of the fact that the hypervisor can't intercept shared<->encrypted
> conversions for SEV(-ES) guests.  And that means that KVM can't WARN, because
> literally every SNP-capable CPU would yell when running SEV(-ES) guests.
> 
> I also don't like the idea of usurping PFERR_GUEST_ENC to have it mean something
> different in KVM as compared to how hardware defines/uses the flag.

Thanks for your clue.  I agree PFERR_GUEST_ENC just for encrypted and a
synthetic flag for private.

Yilun

> 
> Lastly, the approach in Paolo's series to propagate PFERR_GUEST_ENC to .is_private
> iff the VM has private memory is brittle, in that it would be too easy for KVM
> code that has access to the error code to do the wrong thing and interpret
> PFERR_GUEST_ENC has meaning "private".

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 05/16] KVM: x86/mmu: Use synthetic page fault error code to indicate private faults
  @ 2024-03-06 14:45  3%     ` Sean Christopherson
  2024-03-07  9:05  0%       ` Xu Yilun
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-03-06 14:45 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Paolo Bonzini, kvm, linux-kernel, Yan Zhao, Isaku Yamahata,
	Michael Roth, Yu Zhang, Chao Peng, Fuad Tabba, David Matlack

On Wed, Mar 06, 2024, Xu Yilun wrote:
> On Tue, Feb 27, 2024 at 06:41:36PM -0800, Sean Christopherson wrote:
> > Add and use a synthetic, KVM-defined page fault error code to indicate
> > whether a fault is to private vs. shared memory.  TDX and SNP have
> > different mechanisms for reporting private vs. shared, and KVM's
> > software-protected VMs have no mechanism at all.  Usurp an error code
> > flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
> > and friends.
> > 
> > Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
> > for TDX and software-protected VMs as appropriate, but that would require
> > *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
> > memory at the hardware layer, but don't utilize private memory at the
> > KVM layer.
> 
> I see this alternative in other patchset.  And I still don't understand why
> synthetic way is better after reading patch #5-7.  I assume for SEV(-ES) if
> there is spurious PFERR_GUEST_ENC flag set in error code when private memory
> is not used in KVM, then it is a HW issue or SW bug that needs to be caught
> and warned, and by the way cleared.

The conundrum is that SEV(-ES) support _encrypted_ memory, but cannot support
what KVM calls "private" memory.  In quotes because SEV(-ES) memory encryption
does provide confidentially/privacy, but in KVM they don't support memslots that
can be switched between private and shared, e.g. will return false for
kvm_arch_has_private_mem().

And KVM _can't_ sanely use private/shared memslots for SEV(-ES), because it's
impossible to intercept implicit conversions by the guest, i.e. KVM can't prevent
the guest from encrypting a page that KVM thinks is private, and vice versa.

But from hardware's perspective, while the APM is a bit misleading and I don't
love the behavior, I can't really argue that it's a hardware bug if the CPU sets
PFERR_GUEST_ENC on a fault where the guest access had C-bit=1, because the access
_was_ indeed to encrypted memory.

Which is a long-winded way of saying the unwanted PFERR_GUEST_ENC faults aren't
really spurious, nor are they hardware or software bugs, just another unfortunate
side effect of the fact that the hypervisor can't intercept shared<->encrypted
conversions for SEV(-ES) guests.  And that means that KVM can't WARN, because
literally every SNP-capable CPU would yell when running SEV(-ES) guests.

I also don't like the idea of usurping PFERR_GUEST_ENC to have it mean something
different in KVM as compared to how hardware defines/uses the flag.

Lastly, the approach in Paolo's series to propagate PFERR_GUEST_ENC to .is_private
iff the VM has private memory is brittle, in that it would be too easy for KVM
code that has access to the error code to do the wrong thing and interpret
PFERR_GUEST_ENC has meaning "private".

^ permalink raw reply	[relevance 3%]

* Re: [PATCH v2] kvm: set guest physical bits in CPUID.0x80000008
  2024-03-05 16:35  3% ` Sean Christopherson
@ 2024-03-06  5:43  0%   ` Xiaoyao Li
  0 siblings, 0 replies; 200+ results
From: Xiaoyao Li @ 2024-03-06  5:43 UTC (permalink / raw)
  To: Sean Christopherson, Gerd Hoffmann
  Cc: kvm, Tom Lendacky, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On 3/6/2024 12:35 AM, Sean Christopherson wrote:
> KVM: x86:
> 
> On Tue, Mar 05, 2024, Gerd Hoffmann wrote:
>> Set CPUID.0x80000008:EAX[23:16] to guest phys bits, i.e. the bits which
>> are actually addressable.  In most cases this is identical to the host
>> phys bits, but tdp restrictions (no 5-level paging) can limit this to
>> 48.
>>
>> Quoting AMD APM (revision 3.35):
>>
>>    23:16 GuestPhysAddrSize Maximum guest physical address size in bits.
>>                            This number applies only to guests using nested
>>                            paging. When this field is zero, refer to the
>>                            PhysAddrSize field for the maximum guest
>>                            physical address size. See “Secure Virtual
>>                            Machine” in APM Volume 2.
>>
>> Tom Lendacky confirmed the purpose of this field is software use,
>> hardware always returns zero here.
>>
>> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
>> ---
>>   arch/x86/kvm/mmu.h     |  2 ++
>>   arch/x86/kvm/cpuid.c   |  3 ++-
>>   arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++++
>>   3 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index 60f21bb4c27b..42b5212561c8 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>>   	return boot_cpu_data.x86_phys_bits;
>>   }
>>   
>> +int kvm_mmu_get_guest_phys_bits(void);
>> +
>>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index adba49afb5fe..12037f1b017e 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -1240,7 +1240,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>>   		else if (!g_phys_as)
> 
> Based on the new information that GuestPhysAddrSize is software-defined, and the
> fact that KVM and QEMU are planning on using GuestPhysAddrSize to communicate
> the maximum *addressable* GPA, deriving PhysAddrSize from GuestPhysAddrSize is
> wrong.
> 
> E.g. if KVM is running as L1 on top of a new KVM, on a CPU with MAXPHYADDR=52,
> and on a CPU without 5-level TDP, then KVM (as L1) will see:
> 
>    PhysAddrSize      = 52
>    GuestPhysAddrSize = 48
> 
> Propagating GuestPhysAddrSize to PhysAddrSize (which is confusingly g_phys_as)
> will yield an L2 with
> 
>    PhysAddrSize      = 48
>    GuestPhysAddrSize = 48
> 
> which is broken, because GPAs with bits 51:48!=0 are *legal*, but not addressable.
> 
>>   			g_phys_as = phys_as;
>>   
>> -		entry->eax = g_phys_as | (virt_as << 8);
>> +		entry->eax = g_phys_as | (virt_as << 8)
>> +			| kvm_mmu_get_guest_phys_bits() << 16;
> 
> The APM explicitly states that GuestPhysAddrSize only applies to NPT.  KVM should
> follow suit to avoid creating unnecessary ABI, and because KVM can address any
> legal GPA when using shadow paging.
> 
>>   		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
>>   		entry->edx = 0;
>>   		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 2d6cdeab1f8a..8bebb3e96c8a 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5267,6 +5267,21 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>>   	return max_tdp_level;
>>   }
>>   
>> +/*
>> + * return the actually addressable guest phys bits, which might be
>> + * less than host phys bits due to tdp restrictions.
>> + */
>> +int kvm_mmu_get_guest_phys_bits(void)
>> +{
>> +	if (tdp_enabled && shadow_phys_bits > 48) {
>> +		if (tdp_root_level && tdp_root_level != PT64_ROOT_5LEVEL)
>> +			return 48;
>> +		if (max_tdp_level != PT64_ROOT_5LEVEL)
>> +			return 48;
> 
> I would prefer to not use shadow_phys_bits to cap the reported CPUID.0x8000_0008,
> so that the logic isn't spread across the CPUID code and the MMU.  I don't love
> that the two have duplicate logic, but there's no great way to handle that since
> the MMU needs to be able to determine the effective host MAXPHYADDR even if
> CPUID.0x8000_0008 is unsupported.
> 
> I'm thinking this, maybe spread across two patches: one to undo KVM's usage of
> GuestPhysAddrSize, and a second to then set GuestPhysAddrSize for userspace?

Below code looks good to me. And make it into two patches makes sense.

> ---
>   arch/x86/kvm/cpuid.c   | 38 ++++++++++++++++++++++++++++----------
>   arch/x86/kvm/mmu.h     |  2 ++
>   arch/x86/kvm/mmu/mmu.c |  5 +++++
>   3 files changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index adba49afb5fe..ae03e69d7fb9 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1221,9 +1221,18 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		entry->eax = entry->ebx = entry->ecx = 0;
>   		break;
>   	case 0x80000008: {
> -		unsigned g_phys_as = (entry->eax >> 16) & 0xff;
> -		unsigned virt_as = max((entry->eax >> 8) & 0xff, 48U);
> -		unsigned phys_as = entry->eax & 0xff;
> +		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
> +
> +		/*
> +		 * KVM's ABI is to report the effective MAXPHYADDR for the guest
> +		 * in PhysAddrSize (phys_as), and the maximum *addressable* GPA
> +		 * in GuestPhysAddrSize (g_phys_as).  GuestPhysAddrSize is valid
> +		 * if and only if TDP is enabled, in which case the max GPA that
> +		 * can be addressed by KVM may be less than the max GPA that can
> +		 * be legally generated by the guest, e.g. if MAXPHYADDR>48 but
> +		 * the CPU doesn't support 5-level TDP.
> +		 */
> +		unsigned int phys_as, g_phys_as;
>   
>   		/*
>   		 * If TDP (NPT) is disabled use the adjusted host MAXPHYADDR as
> @@ -1231,16 +1240,25 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		 * reductions in MAXPHYADDR for memory encryption affect shadow
>   		 * paging, too.
>   		 *
> -		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
> -		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * If TDP is enabled, the effective guest MAXPHYADDR is the same
> +		 * as the raw bare metal MAXPHYADDR, as reductions to HPAs don't
> +		 * affect GPAs.  The max addressable GPA is the same as the max
> +		 * effective GPA, except that it's capped at 48 bits if 5-level
> +		 * TDP isn't supported (hardware processes bits 51:48 only when
> +		 * walking the fifth level page table).
>   		 */
> -		if (!tdp_enabled)
> -			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> +		if (!tdp_enabled) {
> +			phys_as = boot_cpu_data.x86_phys_bits;
> +			g_phys_as = 0;
> +		} else {
> +			phys_as = entry->eax & 0xff;
>   			g_phys_as = phys_as;
>   
> -		entry->eax = g_phys_as | (virt_as << 8);
> +			if (kvm_mmu_get_max_tdp_level() < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
> +
> +		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);
>   		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
>   		entry->edx = 0;
>   		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>   	return boot_cpu_data.x86_phys_bits;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d6cdeab1f8a..ffd32400fd8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>   	return max_tdp_level;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +	return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>   static union kvm_mmu_page_role
>   kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>   				union kvm_cpu_role cpu_role)
> 
> base-commit: c0372e747726ce18a5fba8cdc71891bd795148f6


^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly
  2024-03-04 19:46  4%       ` Sean Christopherson
  2024-03-04 20:17  0%         ` Mingwei Zhang
  2024-03-05  2:31  4%         ` Like Xu
@ 2024-03-06  2:48  0%         ` Mi, Dapeng
  2 siblings, 0 replies; 200+ results
From: Mi, Dapeng @ 2024-03-06  2:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Sandipan Das, Like Xu, pbonzini, mizhang, jmattson,
	ravi.bangoria, nikunj.dadhania, santosh.shukla, manali.shukla,
	babu.moger, kvm list, linux-kernel


On 3/5/2024 3:46 AM, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Dapeng Mi wrote:
>> On 3/1/2024 5:00 PM, Sandipan Das wrote:
>>> On 3/1/2024 2:07 PM, Like Xu wrote:
>>>> On 1/3/2024 3:50 pm, Sandipan Das wrote:
>>>>> With PerfMonV2, a performance monitoring counter will start operating
>>>>> only when both the PERF_CTLx enable bit as well as the corresponding
>>>>> PerfCntrGlobalCtl enable bit are set.
>>>>>
>>>>> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
>>>>> for a guest but the guest kernel does not support PerfMonV2 (such as
>>>>> kernels older than v5.19), the guest counters do not count since the
>>>>> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
>>>>> writes to it.
>>>> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
>>>> PMU does. Users need to use the new driver to operate the new hardware,
>>>> don't they ? One practical approach is that the hypervisor should not set
>>>> the PerfMonV2 bit for this unpatched 'v5.19' guest.
>>>>
>>> My understanding is that the legacy method of managing the counters should
>>> still work because the enable bits in PerfCntrGlobalCtl are expected to be
>>> set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
>>> is set to 0x3f after a system reset. That way, the guest kernel can use either
>> If so, please add the PPR description here as comments.
> Or even better, make that architectural behavior that's documented in the APM.
>
>>>>> ---
>>>>>     arch/x86/kvm/svm/pmu.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>>>> index b6a7ad4d6914..14709c564d6a 100644
>>>>> --- a/arch/x86/kvm/svm/pmu.c
>>>>> +++ b/arch/x86/kvm/svm/pmu.c
>>>>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>>         if (pmu->version > 1) {
>>>>>             pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>>>             pmu->global_status_mask = pmu->global_ctrl_mask;
>>>>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
>> It seems to be more easily understand to calculate global_ctrl firstly and
>> then derive the globol_ctrl_mask (negative logic).
> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> something like global_ctrl_rsvd_bits), the computation of the reserved bits should

Yeah, same feeling here. global_ctrl_mask is ambiguous and users are 
hard to get its real meaning just from the name and have to read the all 
the code. global_ctrl_rsvd_bits looks  to be a better name. There are 
several other variables with similar name "xxx_mask". Sean, do you think 
it's a good time to change the name for all these variables? Thanks.


> come from the capabilities of the PMU, not from the RESET value.
>
> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> Sandipan's approach.
>
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index e886300f0f97..7ac9b080aba6 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>> kvm_pmu_cap.num_counters_gp);
>>
>>          if (pmu->version > 1) {
>> -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
>> - 1);
>> +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
>> +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
>>                  pmu->global_status_mask = pmu->global_ctrl_mask;
>>          }
>>
>>>>>         }
>>>>>           pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] kvm: set guest physical bits in CPUID.0x80000008
  2024-03-05 10:35  5% [PATCH v2] kvm: set guest physical bits in CPUID.0x80000008 Gerd Hoffmann
@ 2024-03-05 16:35  3% ` Sean Christopherson
  2024-03-06  5:43  0%   ` Xiaoyao Li
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-03-05 16:35 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: kvm, Tom Lendacky, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

KVM: x86:

On Tue, Mar 05, 2024, Gerd Hoffmann wrote:
> Set CPUID.0x80000008:EAX[23:16] to guest phys bits, i.e. the bits which
> are actually addressable.  In most cases this is identical to the host
> phys bits, but tdp restrictions (no 5-level paging) can limit this to
> 48.
> 
> Quoting AMD APM (revision 3.35):
> 
>   23:16 GuestPhysAddrSize Maximum guest physical address size in bits.
>                           This number applies only to guests using nested
>                           paging. When this field is zero, refer to the
>                           PhysAddrSize field for the maximum guest
>                           physical address size. See “Secure Virtual
>                           Machine” in APM Volume 2.
> 
> Tom Lendacky confirmed the purpose of this field is software use,
> hardware always returns zero here.
> 
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/cpuid.c   |  3 ++-
>  arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++++
>  3 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..42b5212561c8 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>  	return boot_cpu_data.x86_phys_bits;
>  }
>  
> +int kvm_mmu_get_guest_phys_bits(void);
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index adba49afb5fe..12037f1b017e 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1240,7 +1240,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		else if (!g_phys_as)

Based on the new information that GuestPhysAddrSize is software-defined, and the
fact that KVM and QEMU are planning on using GuestPhysAddrSize to communicate
the maximum *addressable* GPA, deriving PhysAddrSize from GuestPhysAddrSize is
wrong.

E.g. if KVM is running as L1 on top of a new KVM, on a CPU with MAXPHYADDR=52,
and on a CPU without 5-level TDP, then KVM (as L1) will see:

  PhysAddrSize      = 52
  GuestPhysAddrSize = 48

Propagating GuestPhysAddrSize to PhysAddrSize (which is confusingly g_phys_as)
will yield an L2 with

  PhysAddrSize      = 48 
  GuestPhysAddrSize = 48

which is broken, because GPAs with bits 51:48!=0 are *legal*, but not addressable.

>  			g_phys_as = phys_as;
>  
> -		entry->eax = g_phys_as | (virt_as << 8);
> +		entry->eax = g_phys_as | (virt_as << 8)
> +			| kvm_mmu_get_guest_phys_bits() << 16;

The APM explicitly states that GuestPhysAddrSize only applies to NPT.  KVM should
follow suit to avoid creating unnecessary ABI, and because KVM can address any
legal GPA when using shadow paging.

>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
>  		entry->edx = 0;
>  		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2d6cdeab1f8a..8bebb3e96c8a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,21 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>  	return max_tdp_level;
>  }
>  
> +/*
> + * return the actually addressable guest phys bits, which might be
> + * less than host phys bits due to tdp restrictions.
> + */
> +int kvm_mmu_get_guest_phys_bits(void)
> +{
> +	if (tdp_enabled && shadow_phys_bits > 48) {
> +		if (tdp_root_level && tdp_root_level != PT64_ROOT_5LEVEL)
> +			return 48;
> +		if (max_tdp_level != PT64_ROOT_5LEVEL)
> +			return 48;

I would prefer to not use shadow_phys_bits to cap the reported CPUID.0x8000_0008,
so that the logic isn't spread across the CPUID code and the MMU.  I don't love
that the two have duplicate logic, but there's no great way to handle that since
the MMU needs to be able to determine the effective host MAXPHYADDR even if
CPUID.0x8000_0008 is unsupported.

I'm thinking this, maybe spread across two patches: one to undo KVM's usage of
GuestPhysAddrSize, and a second to then set GuestPhysAddrSize for userspace?

---
 arch/x86/kvm/cpuid.c   | 38 ++++++++++++++++++++++++++++----------
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c |  5 +++++
 3 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..ae03e69d7fb9 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1221,9 +1221,18 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		entry->eax = entry->ebx = entry->ecx = 0;
 		break;
 	case 0x80000008: {
-		unsigned g_phys_as = (entry->eax >> 16) & 0xff;
-		unsigned virt_as = max((entry->eax >> 8) & 0xff, 48U);
-		unsigned phys_as = entry->eax & 0xff;
+		unsigned int virt_as = max((entry->eax >> 8) & 0xff, 48U);
+
+		/*
+		 * KVM's ABI is to report the effective MAXPHYADDR for the guest
+		 * in PhysAddrSize (phys_as), and the maximum *addressable* GPA
+		 * in GuestPhysAddrSize (g_phys_as).  GuestPhysAddrSize is valid
+		 * if and only if TDP is enabled, in which case the max GPA that
+		 * can be addressed by KVM may be less than the max GPA that can
+		 * be legally generated by the guest, e.g. if MAXPHYADDR>48 but
+		 * the CPU doesn't support 5-level TDP.
+		 */
+		unsigned int phys_as, g_phys_as;
 
 		/*
 		 * If TDP (NPT) is disabled use the adjusted host MAXPHYADDR as
@@ -1231,16 +1240,25 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 * reductions in MAXPHYADDR for memory encryption affect shadow
 		 * paging, too.
 		 *
-		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
-		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * If TDP is enabled, the effective guest MAXPHYADDR is the same
+		 * as the raw bare metal MAXPHYADDR, as reductions to HPAs don't
+		 * affect GPAs.  The max addressable GPA is the same as the max
+		 * effective GPA, except that it's capped at 48 bits if 5-level
+		 * TDP isn't supported (hardware processes bits 51:48 only when
+		 * walking the fifth level page table).
 		 */
-		if (!tdp_enabled)
-			g_phys_as = boot_cpu_data.x86_phys_bits;
-		else if (!g_phys_as)
+		if (!tdp_enabled) {
+			phys_as = boot_cpu_data.x86_phys_bits;
+			g_phys_as = 0;
+		} else {
+			phys_as = entry->eax & 0xff;
 			g_phys_as = phys_as;
 
-		entry->eax = g_phys_as | (virt_as << 8);
+			if (kvm_mmu_get_max_tdp_level() < 5)
+				g_phys_as = min(g_phys_as, 48);
+		}
+
+		entry->eax = phys_as | (virt_as << 8) | (g_phys_as << 16);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
 		entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..b410a227c601 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 	return boot_cpu_data.x86_phys_bits;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..ffd32400fd8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 	return max_tdp_level;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void)
+{
+	return tdp_root_level ? tdp_root_level : max_tdp_level;
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)

base-commit: c0372e747726ce18a5fba8cdc71891bd795148f6
-- 


^ permalink raw reply related	[relevance 3%]

* [PATCH v2] kvm: set guest physical bits in CPUID.0x80000008
@ 2024-03-05 10:35  5% Gerd Hoffmann
  2024-03-05 16:35  3% ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Gerd Hoffmann @ 2024-03-05 10:35 UTC (permalink / raw)
  To: kvm
  Cc: Tom Lendacky, Gerd Hoffmann, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)

Set CPUID.0x80000008:EAX[23:16] to guest phys bits, i.e. the bits which
are actually addressable.  In most cases this is identical to the host
phys bits, but tdp restrictions (no 5-level paging) can limit this to
48.

Quoting AMD APM (revision 3.35):

  23:16 GuestPhysAddrSize Maximum guest physical address size in bits.
                          This number applies only to guests using nested
                          paging. When this field is zero, refer to the
                          PhysAddrSize field for the maximum guest
                          physical address size. See “Secure Virtual
                          Machine” in APM Volume 2.

Tom Lendacky confirmed the purpose of this field is software use,
hardware always returns zero here.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/cpuid.c   |  3 ++-
 arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..42b5212561c8 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 	return boot_cpu_data.x86_phys_bits;
 }
 
+int kvm_mmu_get_guest_phys_bits(void);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..12037f1b017e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1240,7 +1240,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		else if (!g_phys_as)
 			g_phys_as = phys_as;
 
-		entry->eax = g_phys_as | (virt_as << 8);
+		entry->eax = g_phys_as | (virt_as << 8)
+			| kvm_mmu_get_guest_phys_bits() << 16;
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
 		entry->edx = 0;
 		cpuid_entry_override(entry, CPUID_8000_0008_EBX);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..8bebb3e96c8a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5267,6 +5267,21 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 	return max_tdp_level;
 }
 
+/*
+ * return the actually addressable guest phys bits, which might be
+ * less than host phys bits due to tdp restrictions.
+ */
+int kvm_mmu_get_guest_phys_bits(void)
+{
+	if (tdp_enabled && shadow_phys_bits > 48) {
+		if (tdp_root_level && tdp_root_level != PT64_ROOT_5LEVEL)
+			return 48;
+		if (max_tdp_level != PT64_ROOT_5LEVEL)
+			return 48;
+	}
+	return shadow_phys_bits;
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)
-- 
2.44.0


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH] KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly
  2024-03-04 19:46  4%       ` Sean Christopherson
  2024-03-04 20:17  0%         ` Mingwei Zhang
@ 2024-03-05  2:31  4%         ` Like Xu
  2024-03-06  2:48  0%         ` Mi, Dapeng
  2 siblings, 0 replies; 200+ results
From: Like Xu @ 2024-03-05  2:31 UTC (permalink / raw)
  To: Sean Christopherson, Dapeng Mi
  Cc: Sandipan Das, pbonzini, mizhang, jmattson, ravi.bangoria,
	nikunj.dadhania, santosh.shukla, manali.shukla, babu.moger,
	kvm list, linux-kernel

On 5/3/2024 3:46 am, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Dapeng Mi wrote:
>>
>> On 3/1/2024 5:00 PM, Sandipan Das wrote:
>>> On 3/1/2024 2:07 PM, Like Xu wrote:
>>>> On 1/3/2024 3:50 pm, Sandipan Das wrote:
>>>>> With PerfMonV2, a performance monitoring counter will start operating
>>>>> only when both the PERF_CTLx enable bit as well as the corresponding
>>>>> PerfCntrGlobalCtl enable bit are set.
>>>>>
>>>>> When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
>>>>> for a guest but the guest kernel does not support PerfMonV2 (such as
>>>>> kernels older than v5.19), the guest counters do not count since the
>>>>> PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
>>>>> writes to it.
>>>> If the vcpu has the PerfMonV2 feature, it should not work the way legacy
>>>> PMU does. Users need to use the new driver to operate the new hardware,
>>>> don't they ? One practical approach is that the hypervisor should not set
>>>> the PerfMonV2 bit for this unpatched 'v5.19' guest.
>>>>
>>> My understanding is that the legacy method of managing the counters should
>>> still work because the enable bits in PerfCntrGlobalCtl are expected to be
>>> set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
>>> is set to 0x3f after a system reset. That way, the guest kernel can use either
>>
>> If so, please add the PPR description here as comments.
> 
> Or even better, make that architectural behavior that's documented in the APM.

On the AMD side, we can't even reason that "PerfMonV3" will be compatible with
"PerfMonV2" w/o APM clarification which is a concern for both driver/virt impl.

> 
>>>>> ---
>>>>>     arch/x86/kvm/svm/pmu.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>>>>> index b6a7ad4d6914..14709c564d6a 100644
>>>>> --- a/arch/x86/kvm/svm/pmu.c
>>>>> +++ b/arch/x86/kvm/svm/pmu.c
>>>>> @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>>         if (pmu->version > 1) {
>>>>>             pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
>>>>>             pmu->global_status_mask = pmu->global_ctrl_mask;
>>>>> +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
>>
>> It seems to be more easily understand to calculate global_ctrl firstly and
>> then derive the globol_ctrl_mask (negative logic).
> 
> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> something like global_ctrl_rsvd_bits), the computation of the reserved bits should
> come from the capabilities of the PMU, not from the RESET value.
> 
> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> Sandipan's approach.

I suspect that Intel hardware also has this behaviour [*] although guest
kernels using Intel pmu version 1 are pretty much non-existent.

[*] Table 10-1. IA-32 and Intel® 64 Processor States Following Power-up, Reset, 
or INIT (Contd.)

We need to update the selftest to guard this.

> 
>> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
>> index e886300f0f97..7ac9b080aba6 100644
>> --- a/arch/x86/kvm/svm/pmu.c
>> +++ b/arch/x86/kvm/svm/pmu.c
>> @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
>> kvm_pmu_cap.num_counters_gp);
>>
>>          if (pmu->version > 1) {
>> -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
>> - 1);
>> +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
>> +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
>>                  pmu->global_status_mask = pmu->global_ctrl_mask;
>>          }
>>
>>>>>         }
>>>>>           pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
>>>
> 

^ permalink raw reply	[relevance 4%]

* Re: [PATCH] KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly
  2024-03-04 19:46  4%       ` Sean Christopherson
@ 2024-03-04 20:17  0%         ` Mingwei Zhang
  2024-03-05  2:31  4%         ` Like Xu
  2024-03-06  2:48  0%         ` Mi, Dapeng
  2 siblings, 0 replies; 200+ results
From: Mingwei Zhang @ 2024-03-04 20:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dapeng Mi, Sandipan Das, Like Xu, pbonzini, jmattson,
	ravi.bangoria, nikunj.dadhania, santosh.shukla, manali.shukla,
	babu.moger, kvm list, linux-kernel

On Mon, Mar 04, 2024, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Dapeng Mi wrote:
> > 
> > On 3/1/2024 5:00 PM, Sandipan Das wrote:
> > > On 3/1/2024 2:07 PM, Like Xu wrote:
> > > > On 1/3/2024 3:50 pm, Sandipan Das wrote:
> > > > > With PerfMonV2, a performance monitoring counter will start operating
> > > > > only when both the PERF_CTLx enable bit as well as the corresponding
> > > > > PerfCntrGlobalCtl enable bit are set.
> > > > > 
> > > > > When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
> > > > > for a guest but the guest kernel does not support PerfMonV2 (such as
> > > > > kernels older than v5.19), the guest counters do not count since the
> > > > > PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
> > > > > writes to it.
> > > > If the vcpu has the PerfMonV2 feature, it should not work the way legacy
> > > > PMU does. Users need to use the new driver to operate the new hardware,
> > > > don't they ? One practical approach is that the hypervisor should not set
> > > > the PerfMonV2 bit for this unpatched 'v5.19' guest.
> > > > 
> > > My understanding is that the legacy method of managing the counters should
> > > still work because the enable bits in PerfCntrGlobalCtl are expected to be
> > > set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
> > > is set to 0x3f after a system reset. That way, the guest kernel can use either
> > 
> > If so, please add the PPR description here as comments.
> 
> Or even better, make that architectural behavior that's documented in the APM.
> 
> > > > > ---
> > > > >    arch/x86/kvm/svm/pmu.c | 1 +
> > > > >    1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > > > > index b6a7ad4d6914..14709c564d6a 100644
> > > > > --- a/arch/x86/kvm/svm/pmu.c
> > > > > +++ b/arch/x86/kvm/svm/pmu.c
> > > > > @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > > > >        if (pmu->version > 1) {
> > > > >            pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> > > > >            pmu->global_status_mask = pmu->global_ctrl_mask;
> > > > > +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
> > 
> > It seems to be more easily understand to calculate global_ctrl firstly and
> > then derive the globol_ctrl_mask (negative logic).
> 
> Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
> something like global_ctrl_rsvd_bits), the computation of the reserved bits should
> come from the capabilities of the PMU, not from the RESET value.
> 
+1

> On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
> if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
> counters.  But, that's a future theoretical problem, so I'm inclined to vote for
> Sandipan's approach.
> 

right. I am ok with either approach.

Thanks.
-Mingwei
> > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > index e886300f0f97..7ac9b080aba6 100644
> > --- a/arch/x86/kvm/svm/pmu.c
> > +++ b/arch/x86/kvm/svm/pmu.c
> > @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > kvm_pmu_cap.num_counters_gp);
> > 
> >         if (pmu->version > 1) {
> > -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
> > - 1);
> > +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
> > +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
> >                 pmu->global_status_mask = pmu->global_ctrl_mask;
> >         }
> > 
> > > > >        }
> > > > >          pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
> > > 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly
  @ 2024-03-04 19:46  4%       ` Sean Christopherson
  2024-03-04 20:17  0%         ` Mingwei Zhang
                           ` (2 more replies)
  0 siblings, 3 replies; 200+ results
From: Sean Christopherson @ 2024-03-04 19:46 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Sandipan Das, Like Xu, pbonzini, mizhang, jmattson,
	ravi.bangoria, nikunj.dadhania, santosh.shukla, manali.shukla,
	babu.moger, kvm list, linux-kernel

On Mon, Mar 04, 2024, Dapeng Mi wrote:
> 
> On 3/1/2024 5:00 PM, Sandipan Das wrote:
> > On 3/1/2024 2:07 PM, Like Xu wrote:
> > > On 1/3/2024 3:50 pm, Sandipan Das wrote:
> > > > With PerfMonV2, a performance monitoring counter will start operating
> > > > only when both the PERF_CTLx enable bit as well as the corresponding
> > > > PerfCntrGlobalCtl enable bit are set.
> > > > 
> > > > When the PerfMonV2 CPUID feature bit (leaf 0x80000022 EAX bit 0) is set
> > > > for a guest but the guest kernel does not support PerfMonV2 (such as
> > > > kernels older than v5.19), the guest counters do not count since the
> > > > PerfCntrGlobalCtl MSR is initialized to zero and the guest kernel never
> > > > writes to it.
> > > If the vcpu has the PerfMonV2 feature, it should not work the way legacy
> > > PMU does. Users need to use the new driver to operate the new hardware,
> > > don't they ? One practical approach is that the hypervisor should not set
> > > the PerfMonV2 bit for this unpatched 'v5.19' guest.
> > > 
> > My understanding is that the legacy method of managing the counters should
> > still work because the enable bits in PerfCntrGlobalCtl are expected to be
> > set. The AMD PPR does mention that the PerfCntrEn bitfield of PerfCntrGlobalCtl
> > is set to 0x3f after a system reset. That way, the guest kernel can use either
> 
> If so, please add the PPR description here as comments.

Or even better, make that architectural behavior that's documented in the APM.

> > > > ---
> > > >    arch/x86/kvm/svm/pmu.c | 1 +
> > > >    1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > > > index b6a7ad4d6914..14709c564d6a 100644
> > > > --- a/arch/x86/kvm/svm/pmu.c
> > > > +++ b/arch/x86/kvm/svm/pmu.c
> > > > @@ -205,6 +205,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > > >        if (pmu->version > 1) {
> > > >            pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
> > > >            pmu->global_status_mask = pmu->global_ctrl_mask;
> > > > +        pmu->global_ctrl = ~pmu->global_ctrl_mask;
> 
> It seems to be more easily understand to calculate global_ctrl firstly and
> then derive the globol_ctrl_mask (negative logic).

Hrm, I'm torn.  On one hand, awful name aside (global_ctrl_mask should really be
something like global_ctrl_rsvd_bits), the computation of the reserved bits should
come from the capabilities of the PMU, not from the RESET value.

On the other hand, setting _all_ non-reserved bits will likely do the wrong thing
if AMD ever adds bits in PerfCntGlobalCtl that aren't tied to general purpose
counters.  But, that's a future theoretical problem, so I'm inclined to vote for
Sandipan's approach.

> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index e886300f0f97..7ac9b080aba6 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -199,7 +199,8 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> kvm_pmu_cap.num_counters_gp);
> 
>         if (pmu->version > 1) {
> -               pmu->global_ctrl_mask = ~((1ull << pmu->nr_arch_gp_counters)
> - 1);
> +               pmu->global_ctrl = (1ull << pmu->nr_arch_gp_counters) - 1;
> +               pmu->global_ctrl_mask = ~pmu->global_ctrl;
>                 pmu->global_status_mask = pmu->global_ctrl_mask;
>         }
> 
> > > >        }
> > > >          pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1;
> > 

^ permalink raw reply	[relevance 4%]

* Re: [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
  2024-02-29 22:11  0%   ` Huang, Kai
@ 2024-02-29 23:07  0%     ` Sean Christopherson
  2024-03-12  5:44  0%       ` Binbin Wu
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-02-29 23:07 UTC (permalink / raw)
  To: Kai Huang
  Cc: Paolo Bonzini, kvm, linux-kernel, Yan Zhao, Isaku Yamahata,
	Michael Roth, Yu Zhang, Chao Peng, Fuad Tabba, David Matlack

On Fri, Mar 01, 2024, Kai Huang wrote:
> 
> 
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
> 
> I found "legacy #PF" is a little bit confusing but I couldn't figure out a
> better name either :-)
> 
> > as the error code for #PF is limited to 32 bits (and in practice, 16 bits
> > on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
> > (see kvm_vcpu_events.error_code), and is explicitly documented as being
> > preserved for intecerpted #PF in both the APM:
> > 
> >    The error code saved in EXITINFO1 is the same as would be pushed onto
> >    the stack by a non-intercepted #PF exception in protected mode.
> > 
> > and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
> > 32-bit field.
> > 
> > Simply drop the upper bits of hardware provides garbage, as spurious
> 
> "of" -> "if" ?
> 
> > information should do no harm (though in all likelihood hardware is buggy
> > and the kernel is doomed).
> > 
> > Handling all upper 32 bits in the #PF path will allow moving the sanity
> > check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
> > which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
> > PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
> > 
> > Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
> 
> "this" -> "this is" ?
> 
> > and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
> > bit minimizes the probability of a collision with the "other" vendor,
> > without needing to plumb more bits through microcode.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 7 +++++++
> >   1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 7807bdcd87e8..5d892bd59c97 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >   	if (WARN_ON_ONCE(fault_address >> 32))
> >   		return -EFAULT;
> >   #endif
> > +	/*
> > +	 * Legacy #PF exception only have a 32-bit error code.  Simply drop the
> 
> "have" -> "has" ?

This one I'll fix by making "exception" plural.

Thanks much for the reviews!

> 
> > +	 * upper bits as KVM doesn't use them for #PF (because they are never
> > +	 * set), and to ensure there are no collisions with KVM-defined bits.
> > +	 */
> > +	if (WARN_ON_ONCE(error_code >> 32))
> > +		error_code = lower_32_bits(error_code);
> >   	vcpu->arch.l1tf_flush_l1d = true;
> >   	if (!flags) {
> Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
  2024-02-28  2:41  3% ` [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero Sean Christopherson
@ 2024-02-29 22:11  0%   ` Huang, Kai
  2024-02-29 23:07  0%     ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Huang, Kai @ 2024-02-29 22:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Yan Zhao, Isaku Yamahata, Michael Roth,
	Yu Zhang, Chao Peng, Fuad Tabba, David Matlack



On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,

I found "legacy #PF" is a little bit confusing but I couldn't figure out 
a better name either :-)

> as the error code for #PF is limited to 32 bits (and in practice, 16 bits
> on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
> (see kvm_vcpu_events.error_code), and is explicitly documented as being
> preserved for intecerpted #PF in both the APM:
> 
>    The error code saved in EXITINFO1 is the same as would be pushed onto
>    the stack by a non-intercepted #PF exception in protected mode.
> 
> and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
> 32-bit field.
> 
> Simply drop the upper bits of hardware provides garbage, as spurious

"of" -> "if" ?

> information should do no harm (though in all likelihood hardware is buggy
> and the kernel is doomed).
> 
> Handling all upper 32 bits in the #PF path will allow moving the sanity
> check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
> which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
> PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
> 
> Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)

"this" -> "this is" ?

> and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
> bit minimizes the probability of a collision with the "other" vendor,
> without needing to plumb more bits through microcode.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7807bdcd87e8..5d892bd59c97 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>   	if (WARN_ON_ONCE(fault_address >> 32))
>   		return -EFAULT;
>   #endif
> +	/*
> +	 * Legacy #PF exception only have a 32-bit error code.  Simply drop the

"have" -> "has" ?

> +	 * upper bits as KVM doesn't use them for #PF (because they are never
> +	 * set), and to ensure there are no collisions with KVM-defined bits.
> +	 */
> +	if (WARN_ON_ONCE(error_code >> 32))
> +		error_code = lower_32_bits(error_code);
>   
>   	vcpu->arch.l1tf_flush_l1d = true;
>   	if (!flags) {
Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2024-02-27 10:32  4% ` [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
@ 2024-02-29 15:13  0%   ` Moger, Babu
  2024-03-09 13:41  0%   ` Xiaoyao Li
  1 sibling, 0 replies; 200+ results
From: Moger, Babu @ 2024-02-29 15:13 UTC (permalink / raw)
  To: Zhao Liu, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	Daniel P . Berrangé,
	Xiaoyao Li
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Yongwei Ma, Zhao Liu



On 2/27/24 04:32, Zhao Liu wrote:
> From: Zhao Liu <zhao1.liu@intel.com>
> 
> The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> the number of sharing threads directly.
> 
> From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> means [1]:
> 
> The number of logical processors sharing this cache is the value of
> this field incremented by 1. To determine which logical processors are
> sharing a cache, determine a Share Id for each processor as follows:
> 
> ShareId = LocalApicId >> log2(NumSharingCache+1)
> 
> Logical processors with the same ShareId then share a cache. If
> NumSharingCache+1 is not a power of two, round it up to the next power
> of two.
> 
> From the description above, the calculation of this field should be same
> as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> APIC ID to calculate this field.
> 
> [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
>      Information
> 
> Cc: Babu Moger <babu.moger@amd.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Signed-off-by: Zhao Liu <zhao1.liu@intel.com>


Reviewed-by: Babu Moger <babu.moger@amd.com>

> ---
> Changes since v7:
>  * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
>    ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
>  * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
>    change due to the rebase.
>  * Re-added Yongwei's Tested tag For his re-testing (compilation on
>    Intel platforms).
> 
> Changes since v3:
>  * Rewrote the subject. (Babu)
>  * Deleted the original "comment/help" expression, as this behavior is
>    confirmed for AMD CPUs. (Babu)
>  * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
>    definition. (Babu)
> 
> Changes since v1:
>  * Renamed "l3_threads" to "num_apic_ids" in
>    encode_cache_cpuid8000001d(). (Yanan)
>  * Added the description of the original commit and add Cc.
> ---
>  target/i386/cpu.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index c77bcbc44d59..df56c7a449c8 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -331,7 +331,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>                                         uint32_t *eax, uint32_t *ebx,
>                                         uint32_t *ecx, uint32_t *edx)
>  {
> -    uint32_t l3_threads;
> +    uint32_t num_sharing_cache;
>      assert(cache->size == cache->line_size * cache->associativity *
>                            cache->partitions * cache->sets);
>  
> @@ -340,11 +340,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>  
>      /* L3 is shared among multiple cores */
>      if (cache->level == 3) {
> -        l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
> -        *eax |= (l3_threads - 1) << 14;
> +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
>      } else {
> -        *eax |= ((topo_info->threads_per_core - 1) << 14);
> +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
>      }
> +    *eax |= (num_sharing_cache - 1) << 14;
>  
>      assert(cache->line_size > 0);
>      assert(cache->partitions > 0);

-- 
Thanks
Babu Moger

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 03/16] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
  2024-02-28  4:43  0%   ` Dongli Zhang
@ 2024-02-28 16:16  5%     ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2024-02-28 16:16 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, Yan Zhao, Isaku Yamahata,
	Michael Roth, Yu Zhang, Chao Peng, Fuad Tabba, David Matlack

On Tue, Feb 27, 2024, Dongli Zhang wrote:
> 
> 
> On 2/27/24 18:41, Sean Christopherson wrote:
> > Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
> > guests, as specified by the APM:
> > 
> >  * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
> >  * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
> >                    PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
> >  * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
> >                    check failure, 0 otherwise.
> >  * Bit 37 (SSS):   Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
> >                    enabled.
> 
> The above bits 34-37 do not match with the bits 31,34-36 in the patch.

Doh, good catch.  I copy+pasted this from the APM, but the RMP bit is defined
slightly earlier in the APM, and I missed SSS.  I'll fixup the changelog to talk
about RMO, and I think I'll add SSS in v2; at the very least, having the #define
will make it clear which bits are used.

Thanks!

^ permalink raw reply	[relevance 5%]

* Re: [PATCH 03/16] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
  2024-02-28  2:41  5% ` [PATCH 03/16] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF Sean Christopherson
@ 2024-02-28  4:43  0%   ` Dongli Zhang
  2024-02-28 16:16  5%     ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Dongli Zhang @ 2024-02-28  4:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Yan Zhao, Isaku Yamahata, Michael Roth,
	Yu Zhang, Chao Peng, Fuad Tabba, David Matlack



On 2/27/24 18:41, Sean Christopherson wrote:
> Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
> guests, as specified by the APM:
> 
>  * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
>  * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
>                    PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
>  * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
>                    check failure, 0 otherwise.
>  * Bit 37 (SSS):   Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
>                    enabled.

The above bits 34-37 do not match with the bits 31,34-36 in the patch.

Dongli Zhang

> 
> Note, the APM is *extremely* misleading, and strongly implies that the
> above flags can _only_ be set for #NPF exits from SNP guests.  That is a
> lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
> flavor of SEV guest on SNP capable hardware.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88cc523bafa8..1e69743ef0fb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -261,8 +261,12 @@ enum x86_intercept_stage;
>  #define PFERR_FETCH_MASK	BIT(4)
>  #define PFERR_PK_MASK		BIT(5)
>  #define PFERR_SGX_MASK		BIT(15)
> +#define PFERR_GUEST_RMP_MASK	BIT_ULL(31)
>  #define PFERR_GUEST_FINAL_MASK	BIT_ULL(32)
>  #define PFERR_GUEST_PAGE_MASK	BIT_ULL(33)
> +#define PFERR_GUEST_ENC_MASK	BIT_ULL(34)
> +#define PFERR_GUEST_SIZEM_MASK	BIT_ULL(35)
> +#define PFERR_GUEST_VMPL_MASK	BIT_ULL(36)
>  #define PFERR_IMPLICIT_ACCESS	BIT_ULL(48)
>  
>  #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\

^ permalink raw reply	[relevance 0%]

* [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero
    2024-02-28  2:41  5% ` [PATCH 03/16] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF Sean Christopherson
  @ 2024-02-28  2:41  3% ` Sean Christopherson
  2024-02-29 22:11  0%   ` Huang, Kai
  2 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-02-28  2:41 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Yan Zhao, Isaku Yamahata, Michael Roth,
	Yu Zhang, Chao Peng, Fuad Tabba, David Matlack

WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
as the error code for #PF is limited to 32 bits (and in practice, 16 bits
on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
(see kvm_vcpu_events.error_code), and is explicitly documented as being
preserved for intecerpted #PF in both the APM:

  The error code saved in EXITINFO1 is the same as would be pushed onto
  the stack by a non-intercepted #PF exception in protected mode.

and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
32-bit field.

Simply drop the upper bits of hardware provides garbage, as spurious
information should do no harm (though in all likelihood hardware is buggy
and the kernel is doomed).

Handling all upper 32 bits in the #PF path will allow moving the sanity
check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
PFERR_GUEST_ENC_MASK without running afoul of the sanity check.

Note, this also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
bit minimizes the probability of a collision with the "other" vendor,
without needing to plumb more bits through microcode.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7807bdcd87e8..5d892bd59c97 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4553,6 +4553,13 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	if (WARN_ON_ONCE(fault_address >> 32))
 		return -EFAULT;
 #endif
+	/*
+	 * Legacy #PF exception only have a 32-bit error code.  Simply drop the
+	 * upper bits as KVM doesn't use them for #PF (because they are never
+	 * set), and to ensure there are no collisions with KVM-defined bits.
+	 */
+	if (WARN_ON_ONCE(error_code >> 32))
+		error_code = lower_32_bits(error_code);
 
 	vcpu->arch.l1tf_flush_l1d = true;
 	if (!flags) {
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[relevance 3%]

* [PATCH 03/16] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF
  @ 2024-02-28  2:41  5% ` Sean Christopherson
  2024-02-28  4:43  0%   ` Dongli Zhang
    2024-02-28  2:41  3% ` [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero Sean Christopherson
  2 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-02-28  2:41 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Yan Zhao, Isaku Yamahata, Michael Roth,
	Yu Zhang, Chao Peng, Fuad Tabba, David Matlack

Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
guests, as specified by the APM:

 * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
 * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
                   PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
 * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
                   check failure, 0 otherwise.
 * Bit 37 (SSS):   Set to VMPL permission mask SSS (bit 4) value if VmplSSS is
                   enabled.

Note, the APM is *extremely* misleading, and strongly implies that the
above flags can _only_ be set for #NPF exits from SNP guests.  That is a
lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
flavor of SEV guest on SNP capable hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88cc523bafa8..1e69743ef0fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -261,8 +261,12 @@ enum x86_intercept_stage;
 #define PFERR_FETCH_MASK	BIT(4)
 #define PFERR_PK_MASK		BIT(5)
 #define PFERR_SGX_MASK		BIT(15)
+#define PFERR_GUEST_RMP_MASK	BIT_ULL(31)
 #define PFERR_GUEST_FINAL_MASK	BIT_ULL(32)
 #define PFERR_GUEST_PAGE_MASK	BIT_ULL(33)
+#define PFERR_GUEST_ENC_MASK	BIT_ULL(34)
+#define PFERR_GUEST_SIZEM_MASK	BIT_ULL(35)
+#define PFERR_GUEST_VMPL_MASK	BIT_ULL(36)
 #define PFERR_IMPLICIT_ACCESS	BIT_ULL(48)
 
 #define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
-- 
2.44.0.278.ge034bb2e1d-goog


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2024-02-27 20:03  4% [PATCH] KVM: SVM: Rename vmplX_ssp -> plX_ssp John Allen
@ 2024-02-27 21:28  0% ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2024-02-27 21:28 UTC (permalink / raw)
  To: Sean Christopherson, kvm, John Allen
  Cc: pbonzini, mlevitsk, bp, x86, thomas.lendacky, linux-kernel

On Tue, 27 Feb 2024 20:03:56 +0000, John Allen wrote:
> The SSP fields in the SEV-ES save area were mistakenly named vmplX_ssp
> instead of plX_ssp. Rename these to the correct names as defined in the
> APM.

Applied to kvm-x86 misc, thanks!  I put this into "misc" instead of "svm"
purely because I already sent the SVM pull request for 6.9, and the only other
patches that are SVM specific is my SEV-ES VMRUN series, and I'm not sure I'll
queue that for 6.9 or wait for 6.10 (and I'll probably throw it in its own
branch regardless).

[1/1] KVM: SVM: Rename vmplX_ssp -> plX_ssp
      https://github.com/kvm-x86/linux/commit/78ccfce77443

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[relevance 0%]

* [PATCH] KVM: SVM: Rename vmplX_ssp -> plX_ssp
@ 2024-02-27 20:03  4% John Allen
  2024-02-27 21:28  0% ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: John Allen @ 2024-02-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, mlevitsk, bp, x86, thomas.lendacky,
	linux-kernel, John Allen

The SSP fields in the SEV-ES save area were mistakenly named vmplX_ssp
instead of plX_ssp. Rename these to the correct names as defined in the
APM.

Fixes: 6d3b3d34e39e ("KVM: SVM: Update the SEV-ES save area mapping")
Signed-off-by: John Allen <john.allen@amd.com>
---
 arch/x86/include/asm/svm.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 87a7b917d30e..728c98175b9c 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -358,10 +358,10 @@ struct sev_es_save_area {
 	struct vmcb_seg ldtr;
 	struct vmcb_seg idtr;
 	struct vmcb_seg tr;
-	u64 vmpl0_ssp;
-	u64 vmpl1_ssp;
-	u64 vmpl2_ssp;
-	u64 vmpl3_ssp;
+	u64 pl0_ssp;
+	u64 pl1_ssp;
+	u64 pl2_ssp;
+	u64 pl3_ssp;
 	u64 u_cet;
 	u8 reserved_0xc8[2];
 	u8 vmpl;
-- 
2.40.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2024-02-27 19:23  0%         ` Sean Christopherson
@ 2024-02-27 19:25  0%           ` John Allen
  0 siblings, 0 replies; 200+ results
From: John Allen @ 2024-02-27 19:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tom Lendacky, kvm, weijiang.yang, rick.p.edgecombe, bp, pbonzini,
	mlevitsk, linux-kernel, x86

On Tue, Feb 27, 2024 at 11:23:33AM -0800, Sean Christopherson wrote:
> On Tue, Feb 27, 2024, John Allen wrote:
> > On Tue, Feb 27, 2024 at 01:15:09PM -0600, Tom Lendacky wrote:
> > > On 2/27/24 12:14, Sean Christopherson wrote:
> > > > On Mon, Feb 26, 2024, John Allen wrote:
> > > > > Rename SEV-ES save area SSP fields to be consistent with the APM.
> > > > > 
> > > > > Signed-off-by: John Allen <john.allen@amd.com>
> > > > > ---
> > > > >   arch/x86/include/asm/svm.h | 8 ++++----
> > > > >   1 file changed, 4 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> > > > > index 87a7b917d30e..728c98175b9c 100644
> > > > > --- a/arch/x86/include/asm/svm.h
> > > > > +++ b/arch/x86/include/asm/svm.h
> > > > > @@ -358,10 +358,10 @@ struct sev_es_save_area {
> > > > >   	struct vmcb_seg ldtr;
> > > > >   	struct vmcb_seg idtr;
> > > > >   	struct vmcb_seg tr;
> > > > > -	u64 vmpl0_ssp;
> > > > > -	u64 vmpl1_ssp;
> > > > > -	u64 vmpl2_ssp;
> > > > > -	u64 vmpl3_ssp;
> > > > > +	u64 pl0_ssp;
> > > > > +	u64 pl1_ssp;
> > > > > +	u64 pl2_ssp;
> > > > > +	u64 pl3_ssp;
> > > > 
> > > > Are these CPL fields, or VMPL fields?  Presumably it's the former since this is
> > > > a single save area.  If so, the changelog should call that out, i.e. make it clear
> > > > that the current names are outright bugs.  If these somehow really are VMPL fields,
> > > > I would prefer to diverge from the APM, because pl[0..3] is way to ambiguous in
> > > > that case.
> > > 
> > > Definitely not VMPL fields...  I guess I had VMPL levels on my mind when I
> > > was typing those names.
> > 
> > FWIW, the patch that accessed these fields has been omitted in this
> > version so if we just want to correct the names of these fields, this
> > patch can be pulled in separately from this series.
> 
> Nice!  Can you post this as a standalone patch, with a massage changelog to
> explain that the vmpl prefix was just a braino?
> 
> Thanks!

Will do.

Thanks,
John

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2024-02-27 19:19  0%       ` John Allen
@ 2024-02-27 19:23  0%         ` Sean Christopherson
  2024-02-27 19:25  0%           ` John Allen
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-02-27 19:23 UTC (permalink / raw)
  To: John Allen
  Cc: Tom Lendacky, kvm, weijiang.yang, rick.p.edgecombe, bp, pbonzini,
	mlevitsk, linux-kernel, x86

On Tue, Feb 27, 2024, John Allen wrote:
> On Tue, Feb 27, 2024 at 01:15:09PM -0600, Tom Lendacky wrote:
> > On 2/27/24 12:14, Sean Christopherson wrote:
> > > On Mon, Feb 26, 2024, John Allen wrote:
> > > > Rename SEV-ES save area SSP fields to be consistent with the APM.
> > > > 
> > > > Signed-off-by: John Allen <john.allen@amd.com>
> > > > ---
> > > >   arch/x86/include/asm/svm.h | 8 ++++----
> > > >   1 file changed, 4 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> > > > index 87a7b917d30e..728c98175b9c 100644
> > > > --- a/arch/x86/include/asm/svm.h
> > > > +++ b/arch/x86/include/asm/svm.h
> > > > @@ -358,10 +358,10 @@ struct sev_es_save_area {
> > > >   	struct vmcb_seg ldtr;
> > > >   	struct vmcb_seg idtr;
> > > >   	struct vmcb_seg tr;
> > > > -	u64 vmpl0_ssp;
> > > > -	u64 vmpl1_ssp;
> > > > -	u64 vmpl2_ssp;
> > > > -	u64 vmpl3_ssp;
> > > > +	u64 pl0_ssp;
> > > > +	u64 pl1_ssp;
> > > > +	u64 pl2_ssp;
> > > > +	u64 pl3_ssp;
> > > 
> > > Are these CPL fields, or VMPL fields?  Presumably it's the former since this is
> > > a single save area.  If so, the changelog should call that out, i.e. make it clear
> > > that the current names are outright bugs.  If these somehow really are VMPL fields,
> > > I would prefer to diverge from the APM, because pl[0..3] is way to ambiguous in
> > > that case.
> > 
> > Definitely not VMPL fields...  I guess I had VMPL levels on my mind when I
> > was typing those names.
> 
> FWIW, the patch that accessed these fields has been omitted in this
> version so if we just want to correct the names of these fields, this
> patch can be pulled in separately from this series.

Nice!  Can you post this as a standalone patch, with a massage changelog to
explain that the vmpl prefix was just a braino?

Thanks!

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2024-02-27 19:15  0%     ` Tom Lendacky
@ 2024-02-27 19:19  0%       ` John Allen
  2024-02-27 19:23  0%         ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: John Allen @ 2024-02-27 19:19 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Sean Christopherson, kvm, weijiang.yang, rick.p.edgecombe, bp,
	pbonzini, mlevitsk, linux-kernel, x86

On Tue, Feb 27, 2024 at 01:15:09PM -0600, Tom Lendacky wrote:
> On 2/27/24 12:14, Sean Christopherson wrote:
> > On Mon, Feb 26, 2024, John Allen wrote:
> > > Rename SEV-ES save area SSP fields to be consistent with the APM.
> > > 
> > > Signed-off-by: John Allen <john.allen@amd.com>
> > > ---
> > >   arch/x86/include/asm/svm.h | 8 ++++----
> > >   1 file changed, 4 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> > > index 87a7b917d30e..728c98175b9c 100644
> > > --- a/arch/x86/include/asm/svm.h
> > > +++ b/arch/x86/include/asm/svm.h
> > > @@ -358,10 +358,10 @@ struct sev_es_save_area {
> > >   	struct vmcb_seg ldtr;
> > >   	struct vmcb_seg idtr;
> > >   	struct vmcb_seg tr;
> > > -	u64 vmpl0_ssp;
> > > -	u64 vmpl1_ssp;
> > > -	u64 vmpl2_ssp;
> > > -	u64 vmpl3_ssp;
> > > +	u64 pl0_ssp;
> > > +	u64 pl1_ssp;
> > > +	u64 pl2_ssp;
> > > +	u64 pl3_ssp;
> > 
> > Are these CPL fields, or VMPL fields?  Presumably it's the former since this is
> > a single save area.  If so, the changelog should call that out, i.e. make it clear
> > that the current names are outright bugs.  If these somehow really are VMPL fields,
> > I would prefer to diverge from the APM, because pl[0..3] is way to ambiguous in
> > that case.
> 
> Definitely not VMPL fields...  I guess I had VMPL levels on my mind when I
> was typing those names.

FWIW, the patch that accessed these fields has been omitted in this
version so if we just want to correct the names of these fields, this
patch can be pulled in separately from this series.

Thanks,
John

> 
> Thanks,
> Tom
> 
> > 
> > It's borderline if they're CPL fields, but Intel calls them PL[0..3]_SSP, so I'm
> > much less inclined to diverge from two other things in that case.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2024-02-27 18:14  4%   ` Sean Christopherson
@ 2024-02-27 19:15  0%     ` Tom Lendacky
  2024-02-27 19:19  0%       ` John Allen
  0 siblings, 1 reply; 200+ results
From: Tom Lendacky @ 2024-02-27 19:15 UTC (permalink / raw)
  To: Sean Christopherson, John Allen
  Cc: kvm, weijiang.yang, rick.p.edgecombe, bp, pbonzini, mlevitsk,
	linux-kernel, x86

On 2/27/24 12:14, Sean Christopherson wrote:
> On Mon, Feb 26, 2024, John Allen wrote:
>> Rename SEV-ES save area SSP fields to be consistent with the APM.
>>
>> Signed-off-by: John Allen <john.allen@amd.com>
>> ---
>>   arch/x86/include/asm/svm.h | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
>> index 87a7b917d30e..728c98175b9c 100644
>> --- a/arch/x86/include/asm/svm.h
>> +++ b/arch/x86/include/asm/svm.h
>> @@ -358,10 +358,10 @@ struct sev_es_save_area {
>>   	struct vmcb_seg ldtr;
>>   	struct vmcb_seg idtr;
>>   	struct vmcb_seg tr;
>> -	u64 vmpl0_ssp;
>> -	u64 vmpl1_ssp;
>> -	u64 vmpl2_ssp;
>> -	u64 vmpl3_ssp;
>> +	u64 pl0_ssp;
>> +	u64 pl1_ssp;
>> +	u64 pl2_ssp;
>> +	u64 pl3_ssp;
> 
> Are these CPL fields, or VMPL fields?  Presumably it's the former since this is
> a single save area.  If so, the changelog should call that out, i.e. make it clear
> that the current names are outright bugs.  If these somehow really are VMPL fields,
> I would prefer to diverge from the APM, because pl[0..3] is way to ambiguous in
> that case.

Definitely not VMPL fields...  I guess I had VMPL levels on my mind when I 
was typing those names.

Thanks,
Tom

> 
> It's borderline if they're CPL fields, but Intel calls them PL[0..3]_SSP, so I'm
> much less inclined to diverge from two other things in that case.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2024-02-26 21:32  4% ` [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp John Allen
@ 2024-02-27 18:14  4%   ` Sean Christopherson
  2024-02-27 19:15  0%     ` Tom Lendacky
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-02-27 18:14 UTC (permalink / raw)
  To: John Allen
  Cc: kvm, weijiang.yang, rick.p.edgecombe, thomas.lendacky, bp,
	pbonzini, mlevitsk, linux-kernel, x86

On Mon, Feb 26, 2024, John Allen wrote:
> Rename SEV-ES save area SSP fields to be consistent with the APM.
> 
> Signed-off-by: John Allen <john.allen@amd.com>
> ---
>  arch/x86/include/asm/svm.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 87a7b917d30e..728c98175b9c 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -358,10 +358,10 @@ struct sev_es_save_area {
>  	struct vmcb_seg ldtr;
>  	struct vmcb_seg idtr;
>  	struct vmcb_seg tr;
> -	u64 vmpl0_ssp;
> -	u64 vmpl1_ssp;
> -	u64 vmpl2_ssp;
> -	u64 vmpl3_ssp;
> +	u64 pl0_ssp;
> +	u64 pl1_ssp;
> +	u64 pl2_ssp;
> +	u64 pl3_ssp;

Are these CPL fields, or VMPL fields?  Presumably it's the former since this is
a single save area.  If so, the changelog should call that out, i.e. make it clear
that the current names are outright bugs.  If these somehow really are VMPL fields,
I would prefer to diverge from the APM, because pl[0..3] is way to ambiguous in
that case.

It's borderline if they're CPL fields, but Intel calls them PL[0..3]_SSP, so I'm
much less inclined to diverge from two other things in that case.

^ permalink raw reply	[relevance 4%]

* [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  @ 2024-02-27 10:32  4% ` Zhao Liu
  2024-02-29 15:13  0%   ` Moger, Babu
  2024-03-09 13:41  0%   ` Xiaoyao Li
  0 siblings, 2 replies; 200+ results
From: Zhao Liu @ 2024-02-27 10:32 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	Daniel P . Berrangé,
	Xiaoyao Li
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Babu Moger,
	Yongwei Ma, Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Cc: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
---
Changes since v7:
 * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
   ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
 * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
   change due to the rebase.
 * Re-added Yongwei's Tested tag For his re-testing (compilation on
   Intel platforms).

Changes since v3:
 * Rewrote the subject. (Babu)
 * Deleted the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Renamed "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Added the description of the original commit and add Cc.
---
 target/i386/cpu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index c77bcbc44d59..df56c7a449c8 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -331,7 +331,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -340,11 +340,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  @ 2024-02-26 21:32  4% ` John Allen
  2024-02-27 18:14  4%   ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: John Allen @ 2024-02-26 21:32 UTC (permalink / raw)
  To: kvm
  Cc: weijiang.yang, rick.p.edgecombe, seanjc, thomas.lendacky, bp,
	pbonzini, mlevitsk, linux-kernel, x86, John Allen

Rename SEV-ES save area SSP fields to be consistent with the APM.

Signed-off-by: John Allen <john.allen@amd.com>
---
 arch/x86/include/asm/svm.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 87a7b917d30e..728c98175b9c 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -358,10 +358,10 @@ struct sev_es_save_area {
 	struct vmcb_seg ldtr;
 	struct vmcb_seg idtr;
 	struct vmcb_seg tr;
-	u64 vmpl0_ssp;
-	u64 vmpl1_ssp;
-	u64 vmpl2_ssp;
-	u64 vmpl3_ssp;
+	u64 pl0_ssp;
+	u64 pl1_ssp;
+	u64 pl2_ssp;
+	u64 pl3_ssp;
 	u64 u_cet;
 	u8 reserved_0xc8[2];
 	u8 vmpl;
-- 
2.40.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH 5/9] KVM: SVM: Save shadow stack host state on VMRUN
  2023-11-02 18:07  4%   ` Maxim Levitsky
@ 2024-02-26 16:56  0%     ` John Allen
  0 siblings, 0 replies; 200+ results
From: John Allen @ 2024-02-26 16:56 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, linux-kernel, pbonzini, weijiang.yang, rick.p.edgecombe,
	seanjc, x86, thomas.lendacky, bp

On Thu, Nov 02, 2023 at 08:07:20PM +0200, Maxim Levitsky wrote:
> On Tue, 2023-10-10 at 20:02 +0000, John Allen wrote:
> > When running as an SEV-ES guest, the PL0_SSP, PL1_SSP, PL2_SSP, PL3_SSP,
> > and U_CET fields in the VMCB save area are type B, meaning the host
> > state is automatically loaded on a VMEXIT, but is not saved on a VMRUN.
> > The other shadow stack MSRs, S_CET, SSP, and ISST_ADDR are type A,
> > meaning they are loaded on VMEXIT and saved on VMRUN. PL0_SSP, PL1_SSP,
> > and PL2_SSP are currently unused. Manually save the other type B host
> > MSR values before VMRUN.
> > 
> > Signed-off-by: John Allen <john.allen@amd.com>
> > ---
> >  arch/x86/kvm/svm/sev.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index b9a0a939d59f..bb4b18baa6f7 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -3098,6 +3098,15 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
> >  		hostsa->dr2_addr_mask = amd_get_dr_addr_mask(2);
> >  		hostsa->dr3_addr_mask = amd_get_dr_addr_mask(3);
> >  	}
> > +
> > +	if (boot_cpu_has(X86_FEATURE_SHSTK)) {
> > +		/*
> > +		 * MSR_IA32_U_CET and MSR_IA32_PL3_SSP are restored on VMEXIT,
> > +		 * save the current host values.
> > +		 */
> > +		rdmsrl(MSR_IA32_U_CET, hostsa->u_cet);
> > +		rdmsrl(MSR_IA32_PL3_SSP, hostsa->pl3_ssp);
> > +	}
> >  }
> 
> 
> Do we actually need this patch?
> 
> Host FPU state is not encrypted, and as far as I understood from the common CET patch series,
> that on return to userspace these msrs will be restored.

Hi Maxim,

I think you're right on this. My next version omits the patch and
testing seems to confirm that it's not needed.

> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> PS: AMD's APM is silent on how 'S_CET, SSP, and ISST_ADDR' are saved/restored for non encrypted guests.
> Are they also type A?
> 
> Can the VMSA table be trusted in general to provide the same swap type as for the non encrypted guests?

From what I gather, for a non-SEV-ES guest using the save area that is part
of the VMCB as opposed to the separate VMCB/VMSA for SEV-ES and SEV-SNP,
anything listed in that save area will effectively be swap type A. Does that
answer your question?

Thanks,
John

> 
> 
> >  
> >  void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> 
> 

^ permalink raw reply	[relevance 0%]

* Re: [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private
  @ 2024-02-22  2:05  4%       ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2024-02-22  2:05 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Michael Roth, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Kai Huang, Zhi Wang,
	chen.bo, linux-coco, Chao Peng, Ackerley Tng, Vishal Annapurve,
	Yuan Yao

On Fri, Jul 21, 2023, Isaku Yamahata wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> Date: Wed, 14 Jun 2023 12:34:00 -0700
> Subject: [PATCH 4/8] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is
>  private
> 
> SEV-SNP defines PFERR_GUEST_ENC_MASK (bit 32) in page-fault error bits to
> represent the guest page is encrypted.  Use the bit to designate that the
> page fault is private and that it requires looking up memory attributes.
> The vendor kvm page fault handler should set PFERR_GUEST_ENC_MASK bit based
> on their fault information.  It may or may not use the hardware value
> directly or parse the hardware value to set the bit.
> 
> For KVM_X86_SW_PROTECTED_VM, ask memory attributes for the fault
> privateness.  For async page fault, carry the bit and use it for kvm page
> fault handler.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

...

> @@ -4315,7 +4316,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	      work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
>  		return;
>  
> -	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
> +	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
> +			      true, NULL);

This is unnecessary, KVM doesn't suppoort async page fault behavior for private
memory (and doesn't need to, because guest_memmfd() doesn't support swap).

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 7f9ec1e5b136..3a423403af01 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -295,13 +295,13 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		.user = err & PFERR_USER_MASK,
>  		.prefetch = prefetch,
>  		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
> +		.is_private = err & PFERR_GUEST_ENC_MASK,

This breaks SEV and SEV-ES guests, because AFAICT, the APM is lying by defining
PFERR_GUEST_ENC_MASK in the context of SNP.  The flag isn't just set when running
SEV-SNP guests, it's set for all C-bit=1 effective accesses when running on SNP
capable hardware (at least, that's my observation).

Grumpiness about discovering yet another problem that I would have expected
_someone_ to stumble upon...

FYI, I'm going to post a rambling series to cleanup code in the page fault path
(it started as a cleanup of the "no slot" code and then grew a few more heads).
One of the patches I'm going to include is something that looks like this patch,
but I'm going to use a KVM-defined synthetic bit, because stuffing a bit that KVM
would need _clear_ on _some_ hardware is gross.

^ permalink raw reply	[relevance 4%]

* [PATCH v8 00/16] Add Secure TSC support for SNP guests
@ 2024-02-15 11:31  3% Nikunj A Dadhania
  0 siblings, 0 replies; 200+ results
From: Nikunj A Dadhania @ 2024-02-15 11:31 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, bp, x86, kvm
  Cc: mingo, tglx, dave.hansen, pgonda, seanjc, pbonzini, nikunj

This patchset is also available at:

  https://github.com/AMDESE/linux-kvm/tree/sectsc-guest-latest

and is based on tip/x86/sev base commit ee8ff8768735
("crypto: ccp - Have it depend on AMD_IOMMU")

Overview
--------

Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
parameters being used cannot be changed by hypervisor once the guest is
launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".

During the boot-up of the secondary cpus, SecureTSC enabled guests need to
query TSC info from AMD Security Processor. This communication channel is
encrypted between the AMD Security Processor and the guest, the hypervisor
is just the conduit to deliver the guest messages to the AMD Security
Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
Secure Nested Paging Firmware ABI Specification" document (currently at
https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"

Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
communicate with the AMD Security Processor which is available at early
boot.

SEV-guest driver has the implementation for guest and AMD Security
Processor communication. As the TSC_INFO needs to be initialized during
early boot before smp cpus are started, move most of the sev-guest driver
code to kernel/sev.c and provide well defined APIs to the sev-guest driver
to use the interface to avoid code-duplication.

Patches:
01-08: Preparation and movement of sev-guest driver code
09-16: SecureTSC enablement patches.

Testing SecureTSC
-----------------

SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
https://github.com/AMDESE/linux-kvm/tree/sectsc-host-latest

QEMU changes:
https://github.com/nikunjad/qemu/tree/snp-securetsc-latest

QEMU commandline SEV-SNP with SecureTSC:

  qemu-system-x86_64 -cpu EPYC-Milan-v2,+securetsc,+invtsc -smp 4 \
    -object memory-backend-memfd,id=ram1,size=1G,share=true,prealloc=false,reserve=false \
    -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
    -machine q35,confidential-guest-support=sev0,memory-backend=ram1 \
    ...

Changelog:
----------
v8:
* Rebased on top of tip/x86/sev
* Use minimum size of IV or msg_seqno in memcpy
* Use arch/x86/include/asm/sev.h instead of sev-guest.h
* Use DEFINE_MUTEX for snp_guest_cmd_mutex
* Added Reviewed-by from Tom.
* Dropped Tested-by from patch 3/16
 
https://lore.kernel.org/kvm/20231220151358.2147066-1-nikunj@amd.com/

v7:
* Drop mutex from the snp_dev and add snp_guest_cmd_{lock,unlock} API
* Added comments for secrets page failure
* Added define for maximum supported VMPCK
* Updated comments why sev_status is used directly instead of
  cpu_feature_enabled()

https://lore.kernel.org/kvm/20231220151358.2147066-1-nikunj@amd.com/

v6: https://lore.kernel.org/lkml/20231128125959.1810039-1-nikunj@amd.com/
v5: https://lore.kernel.org/lkml/20231030063652.68675-1-nikunj@amd.com/
v4: https://lore.kernel.org/lkml/20230814055222.1056404-1-nikunj@amd.com/
v3: https://lore.kernel.org/lkml/20230722111909.15166-1-nikunj@amd.com/
v2: https://lore.kernel.org/r/20230307192449.24732-1-bp@alien8.de/
v1: https://lore.kernel.org/r/20230130120327.977460-1-nikunj@amd.com

Nikunj A Dadhania (16):
  virt: sev-guest: Use AES GCM crypto library
  virt: sev-guest: Replace dev_dbg with pr_debug
  virt: sev-guest: Add SNP guest request structure
  virt: sev-guest: Add vmpck_id to snp_guest_dev struct
  x86/sev: Cache the secrets page address
  virt: sev-guest: Move SNP Guest command mutex
  x86/sev: Move and reorganize sev guest request api
  x86/mm: Add generic guest initialization hook
  x86/cpufeatures: Add synthetic Secure TSC bit
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled
    guests
  x86/kvmclock: Skip kvmclock when Secure TSC is available
  x86/sev: Mark Secure TSC as reliable
  x86/cpu/amd: Do not print FW_BUG for Secure TSC
  x86/sev: Enable Secure TSC for SNP guests

 arch/x86/Kconfig                        |   1 +
 arch/x86/boot/compressed/sev.c          |   3 +-
 arch/x86/include/asm/cpufeatures.h      |   1 +
 arch/x86/include/asm/sev-common.h       |   1 +
 arch/x86/include/asm/sev.h              | 206 ++++++-
 arch/x86/include/asm/svm.h              |   6 +-
 arch/x86/include/asm/x86_init.h         |   2 +
 arch/x86/kernel/cpu/amd.c               |   3 +-
 arch/x86/kernel/kvmclock.c              |   2 +-
 arch/x86/kernel/sev-shared.c            |  10 +
 arch/x86/kernel/sev.c                   | 648 +++++++++++++++++++--
 arch/x86/kernel/x86_init.c              |   2 +
 arch/x86/mm/mem_encrypt.c               |  12 +-
 arch/x86/mm/mem_encrypt_amd.c           |  12 +
 drivers/virt/coco/sev-guest/Kconfig     |   3 -
 drivers/virt/coco/sev-guest/sev-guest.c | 725 ++++--------------------
 drivers/virt/coco/sev-guest/sev-guest.h |  63 --
 17 files changed, 937 insertions(+), 763 deletions(-)
 delete mode 100644 drivers/virt/coco/sev-guest/sev-guest.h


base-commit: ee8ff8768735edc3e013837c4416f819543ddc17
-- 
2.34.1


^ permalink raw reply	[relevance 3%]

* Re: [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type
  @ 2024-02-08 17:27  4%       ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2024-02-08 17:27 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel,
	linux-fsdevel, pbonzini, isaku.yamahata, ackerleytng, vbabka,
	ashish.kalra, nikunj.dadhania, jroedel, pankaj.gupta,
	thomas.lendacky

On Wed, Feb 07, 2024, Michael Roth wrote:
> On Tue, Jan 30, 2024 at 05:13:00PM -0800, Sean Christopherson wrote:
> > On Mon, Oct 16, 2023, Michael Roth wrote:
> > > For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to
> > > determine with an #NPF is due to a private/shared access by the guest.
> > > Implement that handling here. Also add handling needed to deal with
> > > SNP guests which in some cases will make MMIO accesses with the
> > > encryption bit.
> > 
> > ...
> > 
> > > @@ -4356,12 +4357,19 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  			return RET_PF_EMULATE;
> > >  	}
> > >  
> > > -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > +	/*
> > > +	 * In some cases SNP guests will make MMIO accesses with the encryption
> > > +	 * bit set. Handle these via the normal MMIO fault path.
> > > +	 */
> > > +	if (!slot && private_fault && kvm_is_vm_type(vcpu->kvm, KVM_X86_SNP_VM))
> > > +		private_fault = false;
> > 
> > Why?  This is inarguably a guest bug.
> 
> AFAICT this isn't explicitly disallowed by the SNP spec.

There are _lots_ of things that aren't explicitly disallowed by the APM, that
doesn't mean that _KVM_ needs to actively support them.

I am *not* taking on more broken crud in KVM to workaround OVMF's stupidity, the
KVM_X86_QUIRK_CD_NW_CLEARED has taken up literally days of my time at this point.

> So KVM would need to allow for these cases in order to be fully compatible
> with existing SNP guests that do this.

No.  KVM does not yet support SNP, so as far as KVM's ABI goes, there are no
existing guests.  Yes, I realize that I am burying my head in the sand to some
extent, but it is simply not sustainable for KVM to keep trying to pick up the
pieces of poorly defined hardware specs and broken guest firmware.

> > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err)
> > > +{
> > > +	bool private_fault = false;
> > > +
> > > +	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM)) {
> > > +		private_fault = !!(err & PFERR_GUEST_ENC_MASK);
> > > +	} else if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM)) {
> > > +		/*
> > > +		 * This handling is for gmem self-tests and guests that treat
> > > +		 * userspace as the authority on whether a fault should be
> > > +		 * private or not.
> > > +		 */
> > > +		private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > > +	}
> > 
> > This can be more simply:
> > 
> > 	if (kvm_is_vm_type(kvm, KVM_X86_SNP_VM))
> > 		return !!(err & PFERR_GUEST_ENC_MASK);
> > 
> > 	if (kvm_is_vm_type(kvm, KVM_X86_SW_PROTECTED_VM))
> > 		return kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
> > 
> 
> Yes, indeed. But TDX has taken a different approach for SW_PROTECTED_VM
> case where they do this check in kvm_mmu_page_fault() and then synthesize
> the PFERR_GUEST_ENC_MASK into error_code before calling
> kvm_mmu_do_page_fault(). It's not in the v18 patchset AFAICT, but it's
> in the tdx-upstream git branch that corresponds to it:
> 
>   https://github.com/intel/tdx/commit/3717a903ef453aa7b62e7eb65f230566b7f158d4
> 
> Would you prefer that SNP adopt the same approach?

Ah, yes, 'twas my suggestion in the first place.  FWIW, I was just reviewing the
literal code here and wasn't paying much attention to the content.

https://lore.kernel.org/all/f474282d701aca7af00e4f7171445abb5e734c6f.1689893403.git.isaku.yamahata@intel.com

^ permalink raw reply	[relevance 4%]

* Re: [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library
  2024-02-06  8:48  0%     ` Harald Freudenberger
@ 2024-02-06 15:45  0%       ` Anthony Krowiak
  0 siblings, 0 replies; 200+ results
From: Anthony Krowiak @ 2024-02-06 15:45 UTC (permalink / raw)
  To: freude
  Cc: Janosch Frank, kvm, linux-s390, imbrenda, thuth, david, nsg, nrb,
	jjherne


On 2/6/24 3:48 AM, Harald Freudenberger wrote:
> On 2024-02-05 19:15, Anthony Krowiak wrote:
>> I made a few comments and suggestions. I am not very well-versed in
>> the inline assembly code, so I'll leave that up to someone who is more
>> knowledgeable. I copied @Harald since I believe it was him who wrote
>> it.
>>
>> On 2/2/24 9:59 AM, Janosch Frank wrote:
>>> Add functions and definitions needed to test the Adjunct
>>> Processor (AP) crypto interface.
>>>
>>> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
>>> ---
>>>   lib/s390x/ap.c | 97 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   lib/s390x/ap.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>>>   s390x/Makefile |  1 +
>>>   3 files changed, 186 insertions(+)
>>>   create mode 100644 lib/s390x/ap.c
>>>   create mode 100644 lib/s390x/ap.h
>>>
>>> diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
>>> new file mode 100644
>>> index 00000000..9560ff64
>>> --- /dev/null
>>> +++ b/lib/s390x/ap.c
>>> @@ -0,0 +1,97 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * AP crypto functions
>>> + *
>>> + * Some parts taken from the Linux AP driver.
>>> + *
>>> + * Copyright IBM Corp. 2024
>>> + * Author: Janosch Frank <frankja@linux.ibm.com>
>>> + *       Tony Krowiak <akrowia@linux.ibm.com>
>>> + *       Martin Schwidefsky <schwidefsky@de.ibm.com>
>>> + *       Harald Freudenberger <freude@de.ibm.com>
>>> + */
>>> +
>>> +#include <libcflat.h>
>>> +#include <interrupt.h>
>>> +#include <ap.h>
>>> +#include <asm/time.h>
>>> +#include <asm/facility.h>
>>> +
>>> +int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status 
>>> *apqsw,
>>> +         struct pqap_r2 *r2)
>>> +{
>>> +    struct pqap_r0 r0 = {
>>> +        .ap = ap,
>>> +        .qn = qn,
>>> +        .fc = PQAP_TEST_APQ
>>> +    };
>>> +    uint64_t bogus_cc = 2;
>>> +    int cc;
>>> +
>>> +    /*
>>> +     * Test AP Queue
>>> +     *
>>> +     * Writes AP configuration information to the memory pointed
>>> +     * at by GR2.
>>> +     *
>>> +     * Inputs: GR0
>>> +     * Outputs: GR1 (APQSW), GR2 (tapq data)
>>> +     * Synchronous
>>> +     */
>>> +    asm volatile(
>>> +        "       tmll    %[bogus_cc],3\n"
>>> +        "    lgr    0,%[r0]\n"
>>> +        "    .insn    rre,0xb2af0000,0,0\n" /* PQAP */
>>> +        "    stg    1,%[apqsw]\n"
>>> +        "    stg    2,%[r2]\n"
>>> +        "    ipm    %[cc]\n"
>>> +        "    srl    %[cc],28\n"
>>> +        : [apqsw] "=&T" (*apqsw), [r2] "=&T" (*r2), [cc] "=&d" (cc)
>>> +        : [r0] "d" (r0), [bogus_cc] "d" (bogus_cc));
>>> +
>>> +    return cc;
>>> +}
>>> +
>>> +int ap_pqap_qci(struct ap_config_info *info)
>>> +{
>>> +    struct pqap_r0 r0 = { .fc = PQAP_QUERY_AP_CONF_INFO };
>>> +    uint64_t bogus_cc = 2;
>>> +    int cc;
>>> +
>>> +    /*
>>> +     * Query AP Configuration Information
>>> +     *
>>> +     * Writes AP configuration information to the memory pointed
>>> +     * at by GR2.
>>> +     *
>>> +     * Inputs: GR0, GR2 (QCI block address)
>>> +     * Outputs: memory at GR2 address
>>> +     * Synchronous
>>> +     */
>>> +    asm volatile(
>>> +        "       tmll    %[bogus_cc],3\n"
>>> +        "    lgr    0,%[r0]\n"
>>> +        "    lgr    2,%[info]\n"
>>> +        "    .insn    rre,0xb2af0000,0,0\n" /* PQAP */
>>> +        "    ipm    %[cc]\n"
>>> +        "    srl    %[cc],28\n"
>>> +        : [cc] "=&d" (cc)
>>> +        : [r0] "d" (r0), [info] "d" (info), [bogus_cc] "d" (bogus_cc)
>>> +        : "cc", "memory", "0", "2");
>>> +
>>> +    return cc;
>>> +}
>>> +
>>> +/* Will later be extended to a proper setup function */
>>> +bool ap_setup(void)
>>> +{
>>> +    /*
>>> +     * Base AP support has no STFLE or SCLP feature bit but the
>>> +     * PQAP QCI support is indicated via stfle bit 12. As this
>>> +     * library relies on QCI we bail out if it's not available.
>>> +     */
>>> +    if (!test_facility(12))
>>> +        return false;
>>
>>
>> The STFLE.12 can be turned off when starting the guest, so this may
>> not be a valid test.
>>
>> We use the ap_instructions_available function (in ap.h) which executes
>> the TAPQ command to verify whether the AP instructions are installed
>> or not. Maybe you can do something similar here:
>>
>>
>> static inline bool ap_instructions_available(void)
>>
>> {
>>     unsigned long reg0 = 0x0000;
>>     unsigned long reg1 = 0;
>>
>>     asm volatile(
>>         "    lgr    0,%[reg0]\n"        /* qid into gr0 */
>>         "    lghi    1,0\n"            /* 0 into gr1 */
>>         "    lghi    2,0\n"            /* 0 into gr2 */
>>         "    .insn    rre,0xb2af0000,0,0\n"    /* PQAP(TAPQ) */
>>         "0:    la    %[reg1],1\n"        /* 1 into reg1 */
>>         "1:\n"
>>         EX_TABLE(0b, 1b)
>>         : [reg1] "+&d" (reg1)
>>         : [reg0] "d" (reg0)
>>         : "cc", "0", "1", "2");
>>     return reg1 != 0;
>> }
>
> Well, as Janos wrote - this lib functions rely on having QCI available.
> However, be carefully using the above function as it is setting up
> an exception table entry which is to catch the illegal instruction
> which will arise when no AP bus support is available.
>
>>> +
>>> +    return true;
>>> +}
>>> diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
>>> new file mode 100644
>>> index 00000000..b806513f
>>> --- /dev/null
>>> +++ b/lib/s390x/ap.h
>>> @@ -0,0 +1,88 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * AP definitions
>>> + *
>>> + * Some parts taken from the Linux AP driver.
>>> + *
>>> + * Copyright IBM Corp. 2024
>>> + * Author: Janosch Frank <frankja@linux.ibm.com>
>>> + *       Tony Krowiak <akrowia@linux.ibm.com>
>>> + *       Martin Schwidefsky <schwidefsky@de.ibm.com>
>>> + *       Harald Freudenberger <freude@de.ibm.com>
>>> + */
>>> +
>>> +#ifndef _S390X_AP_H_
>>> +#define _S390X_AP_H_
>>> +
>>> +enum PQAP_FC {
>>> +    PQAP_TEST_APQ,
>>> +    PQAP_RESET_APQ,
>>> +    PQAP_ZEROIZE_APQ,
>>> +    PQAP_QUEUE_INT_CONTRL,
>>> +    PQAP_QUERY_AP_CONF_INFO,
>>> +    PQAP_QUERY_AP_COMP_TYPE,
>
> What is that       ^^^^^^^^^^^^^ ?
> The QACT ? But that's only one PQAP subfuntion.


PQAP(QBAP)?


>
>>> +    PQAP_BEST_AP,
>
> And this ^^^^^^^^^^^^ ? Never heard about.
>
>>
>>
>> Maybe use abbreviations like your function names above?
>>
>>     PQAP_TAPQ,
>>     PQAP_RAPQ,
>>     PQAP_ZAPQ,
>>     PQAP_AQIC,
>>     PQAP_QCI,
>>     PQAP_QACT,
>>     PQAP_QBAP
>>
>>
>>> +};
>>> +
>>> +struct ap_queue_status {
>>> +    uint32_t pad0;            /* Ignored padding for zArch */
>>
>>
>> Bit 0 is the E (queue empty) bit.
>>
>>
>>> +    uint32_t empty        : 1;
>>> +    uint32_t replies_waiting: 1;
>>> +    uint32_t full        : 1;
>>> +    uint32_t pad1        : 4;
>>> +    uint32_t irq_enabled    : 1;
>>> +    uint32_t rc        : 8;
>>> +    uint32_t pad2        : 16;
>>> +} __attribute__((packed))  __attribute__((aligned(8)));
>>> +_Static_assert(sizeof(struct ap_queue_status) == sizeof(uint64_t), 
>>> "APQSW size");
>>> +
>>> +struct ap_config_info {
>>> +    uint8_t apsc     : 1;    /* S bit */
>>> +    uint8_t apxa     : 1;    /* N bit */
>>> +    uint8_t qact     : 1;    /* C bit */
>>> +    uint8_t rc8a     : 1;    /* R bit */
>>> +    uint8_t l     : 1;    /* L bit */
>>> +    uint8_t lext     : 3;    /* Lext bits */
>>> +    uint8_t reserved2[3];
>>> +    uint8_t Na;        /* max # of APs - 1 */
>>> +    uint8_t Nd;        /* max # of Domains - 1 */
>>> +    uint8_t reserved6[10];
>>> +    uint32_t apm[8];    /* AP ID mask */
>>> +    uint32_t aqm[8];    /* AP (usage) queue mask */
>>> +    uint32_t adm[8];    /* AP (control) domain mask */
>>> +    uint8_t _reserved4[16];
>>> +} __attribute__((aligned(8))) __attribute__ ((__packed__));
>>> +_Static_assert(sizeof(struct ap_config_info) == 128, "PQAP QCI size");
>>> +
>>> +struct pqap_r0 {
>>> +    uint32_t pad0;
>>> +    uint8_t fc;
>>> +    uint8_t t : 1;        /* Test facilities (TAPQ)*/
>>> +    uint8_t pad1 : 7;
>>> +    uint8_t ap;
>>
>>
>> This is the APID part of an APQN, so how about renaming to 'apid'
>>
>>
>>> +    uint8_t qn;
>>
>>
>> This is the APQI  part of an APQN, so how about renaming to 'apqi'
>>
>>
>>> +} __attribute__((packed)) __attribute__((aligned(8)));
>>> +
>>> +struct pqap_r2 {
>>> +    uint8_t s : 1;        /* Special Command facility */
>>> +    uint8_t m : 1;        /* AP4KM */
>>> +    uint8_t c : 1;        /* AP4KC */
>>> +    uint8_t cop : 1;    /* AP is in coprocessor mode */
>>> +    uint8_t acc : 1;    /* AP is in accelerator mode */
>>> +    uint8_t xcp : 1;    /* AP is in XCP-mode */
>>> +    uint8_t n : 1;        /* AP extended addressing facility */
>>> +    uint8_t pad_0 : 1;
>>> +    uint8_t pad_1[3];
>>
>>
>> Is there a reason why the 'Classification'  field is left out?
>>
>>
>>> +    uint8_t at;
>>> +    uint8_t nd;
>>> +    uint8_t pad_6;
>>> +    uint8_t pad_7 : 4;
>>> +    uint8_t qd : 4;
>>> +} __attribute__((packed))  __attribute__((aligned(8)));
>>> +_Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 
>>> size");
>>> +
>
> Isn't this all going into the kernel? So consider using the
> arch/s390/include/asm/ap.h header file instead of re-defining the 
> structs.
> Feel free to improve this file with reworked structs and field names if
> that's required.
>
>>> +bool ap_setup(void);
>>> +int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status 
>>> *apqsw,
>>> +         struct pqap_r2 *r2);
>>> +int ap_pqap_qci(struct ap_config_info *info);
>>> +#endif
>>> diff --git a/s390x/Makefile b/s390x/Makefile
>>> index 7fce9f9d..4f6c627d 100644
>>> --- a/s390x/Makefile
>>> +++ b/s390x/Makefile
>>> @@ -110,6 +110,7 @@ cflatobjs += lib/s390x/malloc_io.o
>>>   cflatobjs += lib/s390x/uv.o
>>>   cflatobjs += lib/s390x/sie.o
>>>   cflatobjs += lib/s390x/fault.o
>>> +cflatobjs += lib/s390x/ap.o
>>>     OBJDIRS += lib/s390x
>>>
>

^ permalink raw reply	[relevance 0%]

* Re: [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library
  2024-02-05 18:15  0%   ` Anthony Krowiak
@ 2024-02-06  8:48  0%     ` Harald Freudenberger
  2024-02-06 15:45  0%       ` Anthony Krowiak
  0 siblings, 1 reply; 200+ results
From: Harald Freudenberger @ 2024-02-06  8:48 UTC (permalink / raw)
  To: Anthony Krowiak
  Cc: Janosch Frank, kvm, linux-s390, imbrenda, thuth, david, nsg, nrb,
	jjherne

On 2024-02-05 19:15, Anthony Krowiak wrote:
> I made a few comments and suggestions. I am not very well-versed in
> the inline assembly code, so I'll leave that up to someone who is more
> knowledgeable. I copied @Harald since I believe it was him who wrote
> it.
> 
> On 2/2/24 9:59 AM, Janosch Frank wrote:
>> Add functions and definitions needed to test the Adjunct
>> Processor (AP) crypto interface.
>> 
>> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
>> ---
>>   lib/s390x/ap.c | 97 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   lib/s390x/ap.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>>   s390x/Makefile |  1 +
>>   3 files changed, 186 insertions(+)
>>   create mode 100644 lib/s390x/ap.c
>>   create mode 100644 lib/s390x/ap.h
>> 
>> diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
>> new file mode 100644
>> index 00000000..9560ff64
>> --- /dev/null
>> +++ b/lib/s390x/ap.c
>> @@ -0,0 +1,97 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * AP crypto functions
>> + *
>> + * Some parts taken from the Linux AP driver.
>> + *
>> + * Copyright IBM Corp. 2024
>> + * Author: Janosch Frank <frankja@linux.ibm.com>
>> + *	   Tony Krowiak <akrowia@linux.ibm.com>
>> + *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
>> + *	   Harald Freudenberger <freude@de.ibm.com>
>> + */
>> +
>> +#include <libcflat.h>
>> +#include <interrupt.h>
>> +#include <ap.h>
>> +#include <asm/time.h>
>> +#include <asm/facility.h>
>> +
>> +int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status 
>> *apqsw,
>> +		 struct pqap_r2 *r2)
>> +{
>> +	struct pqap_r0 r0 = {
>> +		.ap = ap,
>> +		.qn = qn,
>> +		.fc = PQAP_TEST_APQ
>> +	};
>> +	uint64_t bogus_cc = 2;
>> +	int cc;
>> +
>> +	/*
>> +	 * Test AP Queue
>> +	 *
>> +	 * Writes AP configuration information to the memory pointed
>> +	 * at by GR2.
>> +	 *
>> +	 * Inputs: GR0
>> +	 * Outputs: GR1 (APQSW), GR2 (tapq data)
>> +	 * Synchronous
>> +	 */
>> +	asm volatile(
>> +		"       tmll	%[bogus_cc],3\n"
>> +		"	lgr	0,%[r0]\n"
>> +		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
>> +		"	stg	1,%[apqsw]\n"
>> +		"	stg	2,%[r2]\n"
>> +		"	ipm	%[cc]\n"
>> +		"	srl	%[cc],28\n"
>> +		: [apqsw] "=&T" (*apqsw), [r2] "=&T" (*r2), [cc] "=&d" (cc)
>> +		: [r0] "d" (r0), [bogus_cc] "d" (bogus_cc));
>> +
>> +	return cc;
>> +}
>> +
>> +int ap_pqap_qci(struct ap_config_info *info)
>> +{
>> +	struct pqap_r0 r0 = { .fc = PQAP_QUERY_AP_CONF_INFO };
>> +	uint64_t bogus_cc = 2;
>> +	int cc;
>> +
>> +	/*
>> +	 * Query AP Configuration Information
>> +	 *
>> +	 * Writes AP configuration information to the memory pointed
>> +	 * at by GR2.
>> +	 *
>> +	 * Inputs: GR0, GR2 (QCI block address)
>> +	 * Outputs: memory at GR2 address
>> +	 * Synchronous
>> +	 */
>> +	asm volatile(
>> +		"       tmll	%[bogus_cc],3\n"
>> +		"	lgr	0,%[r0]\n"
>> +		"	lgr	2,%[info]\n"
>> +		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
>> +		"	ipm	%[cc]\n"
>> +		"	srl	%[cc],28\n"
>> +		: [cc] "=&d" (cc)
>> +		: [r0] "d" (r0), [info] "d" (info), [bogus_cc] "d" (bogus_cc)
>> +		: "cc", "memory", "0", "2");
>> +
>> +	return cc;
>> +}
>> +
>> +/* Will later be extended to a proper setup function */
>> +bool ap_setup(void)
>> +{
>> +	/*
>> +	 * Base AP support has no STFLE or SCLP feature bit but the
>> +	 * PQAP QCI support is indicated via stfle bit 12. As this
>> +	 * library relies on QCI we bail out if it's not available.
>> +	 */
>> +	if (!test_facility(12))
>> +		return false;
> 
> 
> The STFLE.12 can be turned off when starting the guest, so this may
> not be a valid test.
> 
> We use the ap_instructions_available function (in ap.h) which executes
> the TAPQ command to verify whether the AP instructions are installed
> or not. Maybe you can do something similar here:
> 
> 
> static inline bool ap_instructions_available(void)
> 
> {
>     unsigned long reg0 = 0x0000;
>     unsigned long reg1 = 0;
> 
>     asm volatile(
>         "    lgr    0,%[reg0]\n"        /* qid into gr0 */
>         "    lghi    1,0\n"            /* 0 into gr1 */
>         "    lghi    2,0\n"            /* 0 into gr2 */
>         "    .insn    rre,0xb2af0000,0,0\n"    /* PQAP(TAPQ) */
>         "0:    la    %[reg1],1\n"        /* 1 into reg1 */
>         "1:\n"
>         EX_TABLE(0b, 1b)
>         : [reg1] "+&d" (reg1)
>         : [reg0] "d" (reg0)
>         : "cc", "0", "1", "2");
>     return reg1 != 0;
> }

Well, as Janos wrote - this lib functions rely on having QCI available.
However, be carefully using the above function as it is setting up
an exception table entry which is to catch the illegal instruction
which will arise when no AP bus support is available.

>> +
>> +	return true;
>> +}
>> diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
>> new file mode 100644
>> index 00000000..b806513f
>> --- /dev/null
>> +++ b/lib/s390x/ap.h
>> @@ -0,0 +1,88 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * AP definitions
>> + *
>> + * Some parts taken from the Linux AP driver.
>> + *
>> + * Copyright IBM Corp. 2024
>> + * Author: Janosch Frank <frankja@linux.ibm.com>
>> + *	   Tony Krowiak <akrowia@linux.ibm.com>
>> + *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
>> + *	   Harald Freudenberger <freude@de.ibm.com>
>> + */
>> +
>> +#ifndef _S390X_AP_H_
>> +#define _S390X_AP_H_
>> +
>> +enum PQAP_FC {
>> +	PQAP_TEST_APQ,
>> +	PQAP_RESET_APQ,
>> +	PQAP_ZEROIZE_APQ,
>> +	PQAP_QUEUE_INT_CONTRL,
>> +	PQAP_QUERY_AP_CONF_INFO,
>> +	PQAP_QUERY_AP_COMP_TYPE,

What is that       ^^^^^^^^^^^^^ ?
The QACT ? But that's only one PQAP subfuntion.

>> +	PQAP_BEST_AP,

And this ^^^^^^^^^^^^ ? Never heard about.

> 
> 
> Maybe use abbreviations like your function names above?
> 
> 	PQAP_TAPQ,
> 	PQAP_RAPQ,
> 	PQAP_ZAPQ,
> 	PQAP_AQIC,
> 	PQAP_QCI,
> 	PQAP_QACT,
> 	PQAP_QBAP
> 
> 
>> +};
>> +
>> +struct ap_queue_status {
>> +	uint32_t pad0;			/* Ignored padding for zArch  */
> 
> 
> Bit 0 is the E (queue empty) bit.
> 
> 
>> +	uint32_t empty		: 1;
>> +	uint32_t replies_waiting: 1;
>> +	uint32_t full		: 1;
>> +	uint32_t pad1		: 4;
>> +	uint32_t irq_enabled	: 1;
>> +	uint32_t rc		: 8;
>> +	uint32_t pad2		: 16;
>> +} __attribute__((packed))  __attribute__((aligned(8)));
>> +_Static_assert(sizeof(struct ap_queue_status) == sizeof(uint64_t), 
>> "APQSW size");
>> +
>> +struct ap_config_info {
>> +	uint8_t apsc	 : 1;	/* S bit */
>> +	uint8_t apxa	 : 1;	/* N bit */
>> +	uint8_t qact	 : 1;	/* C bit */
>> +	uint8_t rc8a	 : 1;	/* R bit */
>> +	uint8_t l	 : 1;	/* L bit */
>> +	uint8_t lext	 : 3;	/* Lext bits */
>> +	uint8_t reserved2[3];
>> +	uint8_t Na;		/* max # of APs - 1 */
>> +	uint8_t Nd;		/* max # of Domains - 1 */
>> +	uint8_t reserved6[10];
>> +	uint32_t apm[8];	/* AP ID mask */
>> +	uint32_t aqm[8];	/* AP (usage) queue mask */
>> +	uint32_t adm[8];	/* AP (control) domain mask */
>> +	uint8_t _reserved4[16];
>> +} __attribute__((aligned(8))) __attribute__ ((__packed__));
>> +_Static_assert(sizeof(struct ap_config_info) == 128, "PQAP QCI 
>> size");
>> +
>> +struct pqap_r0 {
>> +	uint32_t pad0;
>> +	uint8_t fc;
>> +	uint8_t t : 1;		/* Test facilities (TAPQ)*/
>> +	uint8_t pad1 : 7;
>> +	uint8_t ap;
> 
> 
> This is the APID part of an APQN, so how about renaming to 'apid'
> 
> 
>> +	uint8_t qn;
> 
> 
> This is the APQI  part of an APQN, so how about renaming to 'apqi'
> 
> 
>> +} __attribute__((packed))  __attribute__((aligned(8)));
>> +
>> +struct pqap_r2 {
>> +	uint8_t s : 1;		/* Special Command facility */
>> +	uint8_t m : 1;		/* AP4KM */
>> +	uint8_t c : 1;		/* AP4KC */
>> +	uint8_t cop : 1;	/* AP is in coprocessor mode */
>> +	uint8_t acc : 1;	/* AP is in accelerator mode */
>> +	uint8_t xcp : 1;	/* AP is in XCP-mode */
>> +	uint8_t n : 1;		/* AP extended addressing facility */
>> +	uint8_t pad_0 : 1;
>> +	uint8_t pad_1[3];
> 
> 
> Is there a reason why the 'Classification'  field is left out?
> 
> 
>> +	uint8_t at;
>> +	uint8_t nd;
>> +	uint8_t pad_6;
>> +	uint8_t pad_7 : 4;
>> +	uint8_t qd : 4;
>> +} __attribute__((packed))  __attribute__((aligned(8)));
>> +_Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 
>> size");
>> +

Isn't this all going into the kernel? So consider using the
arch/s390/include/asm/ap.h header file instead of re-defining the 
structs.
Feel free to improve this file with reworked structs and field names if
that's required.

>> +bool ap_setup(void);
>> +int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status 
>> *apqsw,
>> +		 struct pqap_r2 *r2);
>> +int ap_pqap_qci(struct ap_config_info *info);
>> +#endif
>> diff --git a/s390x/Makefile b/s390x/Makefile
>> index 7fce9f9d..4f6c627d 100644
>> --- a/s390x/Makefile
>> +++ b/s390x/Makefile
>> @@ -110,6 +110,7 @@ cflatobjs += lib/s390x/malloc_io.o
>>   cflatobjs += lib/s390x/uv.o
>>   cflatobjs += lib/s390x/sie.o
>>   cflatobjs += lib/s390x/fault.o
>> +cflatobjs += lib/s390x/ap.o
>>     OBJDIRS += lib/s390x
>> 

^ permalink raw reply	[relevance 0%]

* Re: [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library
  2024-02-02 14:59  4% ` [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library Janosch Frank
@ 2024-02-05 18:15  0%   ` Anthony Krowiak
  2024-02-06  8:48  0%     ` Harald Freudenberger
  0 siblings, 1 reply; 200+ results
From: Anthony Krowiak @ 2024-02-05 18:15 UTC (permalink / raw)
  To: Janosch Frank, kvm
  Cc: linux-s390, imbrenda, thuth, david, nsg, nrb, jjherne,
	Harald Freudenberger

I made a few comments and suggestions. I am not very well-versed in the 
inline assembly code, so I'll leave that up to someone who is more 
knowledgeable. I copied @Harald since I believe it was him who wrote it.

On 2/2/24 9:59 AM, Janosch Frank wrote:
> Add functions and definitions needed to test the Adjunct
> Processor (AP) crypto interface.
>
> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
> ---
>   lib/s390x/ap.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   lib/s390x/ap.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>   s390x/Makefile |  1 +
>   3 files changed, 186 insertions(+)
>   create mode 100644 lib/s390x/ap.c
>   create mode 100644 lib/s390x/ap.h
>
> diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
> new file mode 100644
> index 00000000..9560ff64
> --- /dev/null
> +++ b/lib/s390x/ap.c
> @@ -0,0 +1,97 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * AP crypto functions
> + *
> + * Some parts taken from the Linux AP driver.
> + *
> + * Copyright IBM Corp. 2024
> + * Author: Janosch Frank <frankja@linux.ibm.com>
> + *	   Tony Krowiak <akrowia@linux.ibm.com>
> + *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
> + *	   Harald Freudenberger <freude@de.ibm.com>
> + */
> +
> +#include <libcflat.h>
> +#include <interrupt.h>
> +#include <ap.h>
> +#include <asm/time.h>
> +#include <asm/facility.h>
> +
> +int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
> +		 struct pqap_r2 *r2)
> +{
> +	struct pqap_r0 r0 = {
> +		.ap = ap,
> +		.qn = qn,
> +		.fc = PQAP_TEST_APQ
> +	};
> +	uint64_t bogus_cc = 2;
> +	int cc;
> +
> +	/*
> +	 * Test AP Queue
> +	 *
> +	 * Writes AP configuration information to the memory pointed
> +	 * at by GR2.
> +	 *
> +	 * Inputs: GR0
> +	 * Outputs: GR1 (APQSW), GR2 (tapq data)
> +	 * Synchronous
> +	 */
> +	asm volatile(
> +		"       tmll	%[bogus_cc],3\n"
> +		"	lgr	0,%[r0]\n"
> +		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
> +		"	stg	1,%[apqsw]\n"
> +		"	stg	2,%[r2]\n"
> +		"	ipm	%[cc]\n"
> +		"	srl	%[cc],28\n"
> +		: [apqsw] "=&T" (*apqsw), [r2] "=&T" (*r2), [cc] "=&d" (cc)
> +		: [r0] "d" (r0), [bogus_cc] "d" (bogus_cc));
> +
> +	return cc;
> +}
> +
> +int ap_pqap_qci(struct ap_config_info *info)
> +{
> +	struct pqap_r0 r0 = { .fc = PQAP_QUERY_AP_CONF_INFO };
> +	uint64_t bogus_cc = 2;
> +	int cc;
> +
> +	/*
> +	 * Query AP Configuration Information
> +	 *
> +	 * Writes AP configuration information to the memory pointed
> +	 * at by GR2.
> +	 *
> +	 * Inputs: GR0, GR2 (QCI block address)
> +	 * Outputs: memory at GR2 address
> +	 * Synchronous
> +	 */
> +	asm volatile(
> +		"       tmll	%[bogus_cc],3\n"
> +		"	lgr	0,%[r0]\n"
> +		"	lgr	2,%[info]\n"
> +		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
> +		"	ipm	%[cc]\n"
> +		"	srl	%[cc],28\n"
> +		: [cc] "=&d" (cc)
> +		: [r0] "d" (r0), [info] "d" (info), [bogus_cc] "d" (bogus_cc)
> +		: "cc", "memory", "0", "2");
> +
> +	return cc;
> +}
> +
> +/* Will later be extended to a proper setup function */
> +bool ap_setup(void)
> +{
> +	/*
> +	 * Base AP support has no STFLE or SCLP feature bit but the
> +	 * PQAP QCI support is indicated via stfle bit 12. As this
> +	 * library relies on QCI we bail out if it's not available.
> +	 */
> +	if (!test_facility(12))
> +		return false;


The STFLE.12 can be turned off when starting the guest, so this may not 
be a valid test.

We use the ap_instructions_available function (in ap.h) which executes 
the TAPQ command to verify whether the AP instructions are installed or 
not. Maybe you can do something similar here:


static inline bool ap_instructions_available(void)

{
     unsigned long reg0 = 0x0000;
     unsigned long reg1 = 0;

     asm volatile(
         "    lgr    0,%[reg0]\n"        /* qid into gr0 */
         "    lghi    1,0\n"            /* 0 into gr1 */
         "    lghi    2,0\n"            /* 0 into gr2 */
         "    .insn    rre,0xb2af0000,0,0\n"    /* PQAP(TAPQ) */
         "0:    la    %[reg1],1\n"        /* 1 into reg1 */
         "1:\n"
         EX_TABLE(0b, 1b)
         : [reg1] "+&d" (reg1)
         : [reg0] "d" (reg0)
         : "cc", "0", "1", "2");
     return reg1 != 0;
}
> +
> +	return true;
> +}
> diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
> new file mode 100644
> index 00000000..b806513f
> --- /dev/null
> +++ b/lib/s390x/ap.h
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * AP definitions
> + *
> + * Some parts taken from the Linux AP driver.
> + *
> + * Copyright IBM Corp. 2024
> + * Author: Janosch Frank <frankja@linux.ibm.com>
> + *	   Tony Krowiak <akrowia@linux.ibm.com>
> + *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
> + *	   Harald Freudenberger <freude@de.ibm.com>
> + */
> +
> +#ifndef _S390X_AP_H_
> +#define _S390X_AP_H_
> +
> +enum PQAP_FC {
> +	PQAP_TEST_APQ,
> +	PQAP_RESET_APQ,
> +	PQAP_ZEROIZE_APQ,
> +	PQAP_QUEUE_INT_CONTRL,
> +	PQAP_QUERY_AP_CONF_INFO,
> +	PQAP_QUERY_AP_COMP_TYPE,
> +	PQAP_BEST_AP,


Maybe use abbreviations like your function names above?

	PQAP_TAPQ,
	PQAP_RAPQ,
	PQAP_ZAPQ,
	PQAP_AQIC,
	PQAP_QCI,
	PQAP_QACT,
	PQAP_QBAP


> +};
> +
> +struct ap_queue_status {
> +	uint32_t pad0;			/* Ignored padding for zArch  */


Bit 0 is the E (queue empty) bit.


> +	uint32_t empty		: 1;
> +	uint32_t replies_waiting: 1;
> +	uint32_t full		: 1;
> +	uint32_t pad1		: 4;
> +	uint32_t irq_enabled	: 1;
> +	uint32_t rc		: 8;
> +	uint32_t pad2		: 16;
> +} __attribute__((packed))  __attribute__((aligned(8)));
> +_Static_assert(sizeof(struct ap_queue_status) == sizeof(uint64_t), "APQSW size");
> +
> +struct ap_config_info {
> +	uint8_t apsc	 : 1;	/* S bit */
> +	uint8_t apxa	 : 1;	/* N bit */
> +	uint8_t qact	 : 1;	/* C bit */
> +	uint8_t rc8a	 : 1;	/* R bit */
> +	uint8_t l	 : 1;	/* L bit */
> +	uint8_t lext	 : 3;	/* Lext bits */
> +	uint8_t reserved2[3];
> +	uint8_t Na;		/* max # of APs - 1 */
> +	uint8_t Nd;		/* max # of Domains - 1 */
> +	uint8_t reserved6[10];
> +	uint32_t apm[8];	/* AP ID mask */
> +	uint32_t aqm[8];	/* AP (usage) queue mask */
> +	uint32_t adm[8];	/* AP (control) domain mask */
> +	uint8_t _reserved4[16];
> +} __attribute__((aligned(8))) __attribute__ ((__packed__));
> +_Static_assert(sizeof(struct ap_config_info) == 128, "PQAP QCI size");
> +
> +struct pqap_r0 {
> +	uint32_t pad0;
> +	uint8_t fc;
> +	uint8_t t : 1;		/* Test facilities (TAPQ)*/
> +	uint8_t pad1 : 7;
> +	uint8_t ap;


This is the APID part of an APQN, so how about renaming to 'apid'


> +	uint8_t qn;


This is the APQI  part of an APQN, so how about renaming to 'apqi'


> +} __attribute__((packed))  __attribute__((aligned(8)));
> +
> +struct pqap_r2 {
> +	uint8_t s : 1;		/* Special Command facility */
> +	uint8_t m : 1;		/* AP4KM */
> +	uint8_t c : 1;		/* AP4KC */
> +	uint8_t cop : 1;	/* AP is in coprocessor mode */
> +	uint8_t acc : 1;	/* AP is in accelerator mode */
> +	uint8_t xcp : 1;	/* AP is in XCP-mode */
> +	uint8_t n : 1;		/* AP extended addressing facility */
> +	uint8_t pad_0 : 1;
> +	uint8_t pad_1[3];


Is there a reason why the 'Classification'  field is left out?


> +	uint8_t at;
> +	uint8_t nd;
> +	uint8_t pad_6;
> +	uint8_t pad_7 : 4;
> +	uint8_t qd : 4;
> +} __attribute__((packed))  __attribute__((aligned(8)));
> +_Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
> +
> +bool ap_setup(void);
> +int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
> +		 struct pqap_r2 *r2);
> +int ap_pqap_qci(struct ap_config_info *info);
> +#endif
> diff --git a/s390x/Makefile b/s390x/Makefile
> index 7fce9f9d..4f6c627d 100644
> --- a/s390x/Makefile
> +++ b/s390x/Makefile
> @@ -110,6 +110,7 @@ cflatobjs += lib/s390x/malloc_io.o
>   cflatobjs += lib/s390x/uv.o
>   cflatobjs += lib/s390x/sie.o
>   cflatobjs += lib/s390x/fault.o
> +cflatobjs += lib/s390x/ap.o
>   
>   OBJDIRS += lib/s390x
>   

^ permalink raw reply	[relevance 0%]

* [kvm-unit-tests PATCH v4 3/7] lib: s390x: ap: Add proper ap setup code
    2024-02-02 14:59  4% ` [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library Janosch Frank
@ 2024-02-02 14:59  5% ` Janosch Frank
  1 sibling, 0 replies; 200+ results
From: Janosch Frank @ 2024-02-02 14:59 UTC (permalink / raw)
  To: kvm; +Cc: linux-s390, imbrenda, thuth, david, nsg, nrb, akrowiak, jjherne

For the next tests we need a valid queue which means we need to grab
the qci info and search the first set bit in the ap and aq masks.

Let's move from the stfle 12 check to proper setup code that also
returns arrays for the aps and qns.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 lib/s390x/ap.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++---
 lib/s390x/ap.h | 17 +++++++++-
 s390x/ap.c     |  4 ++-
 3 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
index 9560ff64..c85172a8 100644
--- a/lib/s390x/ap.c
+++ b/lib/s390x/ap.c
@@ -13,10 +13,18 @@
 
 #include <libcflat.h>
 #include <interrupt.h>
+#include <alloc.h>
+#include <bitops.h>
 #include <ap.h>
 #include <asm/time.h>
 #include <asm/facility.h>
 
+static uint8_t num_ap;
+static uint8_t num_queue;
+static uint8_t *array_ap;
+static uint8_t *array_queue;
+static struct ap_config_info qci;
+
 int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
 		 struct pqap_r2 *r2)
 {
@@ -82,16 +90,90 @@ int ap_pqap_qci(struct ap_config_info *info)
 	return cc;
 }
 
-/* Will later be extended to a proper setup function */
-bool ap_setup(void)
+static int get_entry(uint8_t *ptr, int i, size_t len)
 {
+	/* Skip over the last entry */
+	if (i)
+		i++;
+
+	for (; i < 8 * len; i++) {
+		/* Do we even need to check the byte? */
+		if (!ptr[i / 8]) {
+			i += 7;
+			continue;
+		}
+
+		/* Check the bit in big-endian order */
+		if (ptr[i / 8] & BIT(7 - (i % 8)))
+			return i;
+	}
+	return -1;
+}
+
+static void setup_info(void)
+{
+	int i, entry = 0;
+
+	ap_pqap_qci(&qci);
+
+	for (i = 0; i < AP_ARRAY_MAX_NUM; i++) {
+		entry = get_entry((uint8_t *)qci.apm, entry, sizeof(qci.apm));
+		if (entry == -1)
+			break;
+
+		if (!num_ap) {
+			/*
+			 * We have at least one AP, time to
+			 * allocate the queue arrays
+			 */
+			array_ap = malloc(AP_ARRAY_MAX_NUM);
+			array_queue = malloc(AP_ARRAY_MAX_NUM);
+		}
+		array_ap[num_ap] = entry;
+		num_ap += 1;
+	}
+
+	/* Without an AP we don't even need to look at the queues */
+	if (!num_ap)
+		return;
+
+	entry = 0;
+	for (i = 0; i < AP_ARRAY_MAX_NUM; i++) {
+		entry = get_entry((uint8_t *)qci.aqm, entry, sizeof(qci.aqm));
+		if (entry == -1)
+			return;
+
+		array_queue[num_queue] = entry;
+		num_queue += 1;
+	}
+
+}
+
+int ap_setup(uint8_t **ap_array, uint8_t **qn_array, uint8_t *naps, uint8_t *nqns)
+{
+	assert(!num_ap);
+
 	/*
 	 * Base AP support has no STFLE or SCLP feature bit but the
 	 * PQAP QCI support is indicated via stfle bit 12. As this
 	 * library relies on QCI we bail out if it's not available.
 	 */
 	if (!test_facility(12))
-		return false;
+		return AP_SETUP_NOINSTR;
 
-	return true;
+	/* Setup ap and queue arrays */
+	setup_info();
+
+	if (!num_ap)
+		return AP_SETUP_NOAPQN;
+
+	/* Only provide the data if the caller actually needs it. */
+	if (ap_array && qn_array && naps && nqns) {
+		*ap_array = array_ap;
+		*qn_array = array_queue;
+		*naps = num_ap;
+		*nqns = num_queue;
+	}
+
+	return AP_SETUP_READY;
 }
diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
index b806513f..8feca43a 100644
--- a/lib/s390x/ap.h
+++ b/lib/s390x/ap.h
@@ -81,7 +81,22 @@ struct pqap_r2 {
 } __attribute__((packed))  __attribute__((aligned(8)));
 _Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
 
-bool ap_setup(void);
+/*
+ * Maximum number of APQNs that we keep track of.
+ *
+ * This value is already way larger than the number of APQNs a AP test
+ * is probably going to use.
+ */
+#define AP_ARRAY_MAX_NUM	16
+
+/* Return values of ap_setup() */
+enum {
+	AP_SETUP_NOINSTR,	/* AP instructions not available */
+	AP_SETUP_NOAPQN,	/* Instructions available but no APQNs */
+	AP_SETUP_READY		/* Full setup complete, at least one APQN */
+};
+
+int ap_setup(uint8_t **ap_array, uint8_t **qn_array, uint8_t *naps, uint8_t *nqns);
 int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
 		 struct pqap_r2 *r2);
 int ap_pqap_qci(struct ap_config_info *info);
diff --git a/s390x/ap.c b/s390x/ap.c
index b3cee37a..5c458e7e 100644
--- a/s390x/ap.c
+++ b/s390x/ap.c
@@ -292,8 +292,10 @@ static void test_priv(void)
 
 int main(void)
 {
+	int setup_rc = ap_setup(NULL, NULL, NULL, NULL);
+
 	report_prefix_push("ap");
-	if (!ap_check()) {
+	if (setup_rc == AP_SETUP_NOINSTR) {
 		report_skip("AP instructions not available");
 		goto done;
 	}
-- 
2.40.1


^ permalink raw reply related	[relevance 5%]

* [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library
  @ 2024-02-02 14:59  4% ` Janosch Frank
  2024-02-05 18:15  0%   ` Anthony Krowiak
  2024-02-02 14:59  5% ` [kvm-unit-tests PATCH v4 3/7] lib: s390x: ap: Add proper ap setup code Janosch Frank
  1 sibling, 1 reply; 200+ results
From: Janosch Frank @ 2024-02-02 14:59 UTC (permalink / raw)
  To: kvm; +Cc: linux-s390, imbrenda, thuth, david, nsg, nrb, akrowiak, jjherne

Add functions and definitions needed to test the Adjunct
Processor (AP) crypto interface.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 lib/s390x/ap.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/s390x/ap.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
 s390x/Makefile |  1 +
 3 files changed, 186 insertions(+)
 create mode 100644 lib/s390x/ap.c
 create mode 100644 lib/s390x/ap.h

diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
new file mode 100644
index 00000000..9560ff64
--- /dev/null
+++ b/lib/s390x/ap.c
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * AP crypto functions
+ *
+ * Some parts taken from the Linux AP driver.
+ *
+ * Copyright IBM Corp. 2024
+ * Author: Janosch Frank <frankja@linux.ibm.com>
+ *	   Tony Krowiak <akrowia@linux.ibm.com>
+ *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *	   Harald Freudenberger <freude@de.ibm.com>
+ */
+
+#include <libcflat.h>
+#include <interrupt.h>
+#include <ap.h>
+#include <asm/time.h>
+#include <asm/facility.h>
+
+int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
+		 struct pqap_r2 *r2)
+{
+	struct pqap_r0 r0 = {
+		.ap = ap,
+		.qn = qn,
+		.fc = PQAP_TEST_APQ
+	};
+	uint64_t bogus_cc = 2;
+	int cc;
+
+	/*
+	 * Test AP Queue
+	 *
+	 * Writes AP configuration information to the memory pointed
+	 * at by GR2.
+	 *
+	 * Inputs: GR0
+	 * Outputs: GR1 (APQSW), GR2 (tapq data)
+	 * Synchronous
+	 */
+	asm volatile(
+		"       tmll	%[bogus_cc],3\n"
+		"	lgr	0,%[r0]\n"
+		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
+		"	stg	1,%[apqsw]\n"
+		"	stg	2,%[r2]\n"
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [apqsw] "=&T" (*apqsw), [r2] "=&T" (*r2), [cc] "=&d" (cc)
+		: [r0] "d" (r0), [bogus_cc] "d" (bogus_cc));
+
+	return cc;
+}
+
+int ap_pqap_qci(struct ap_config_info *info)
+{
+	struct pqap_r0 r0 = { .fc = PQAP_QUERY_AP_CONF_INFO };
+	uint64_t bogus_cc = 2;
+	int cc;
+
+	/*
+	 * Query AP Configuration Information
+	 *
+	 * Writes AP configuration information to the memory pointed
+	 * at by GR2.
+	 *
+	 * Inputs: GR0, GR2 (QCI block address)
+	 * Outputs: memory at GR2 address
+	 * Synchronous
+	 */
+	asm volatile(
+		"       tmll	%[bogus_cc],3\n"
+		"	lgr	0,%[r0]\n"
+		"	lgr	2,%[info]\n"
+		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [cc] "=&d" (cc)
+		: [r0] "d" (r0), [info] "d" (info), [bogus_cc] "d" (bogus_cc)
+		: "cc", "memory", "0", "2");
+
+	return cc;
+}
+
+/* Will later be extended to a proper setup function */
+bool ap_setup(void)
+{
+	/*
+	 * Base AP support has no STFLE or SCLP feature bit but the
+	 * PQAP QCI support is indicated via stfle bit 12. As this
+	 * library relies on QCI we bail out if it's not available.
+	 */
+	if (!test_facility(12))
+		return false;
+
+	return true;
+}
diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
new file mode 100644
index 00000000..b806513f
--- /dev/null
+++ b/lib/s390x/ap.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * AP definitions
+ *
+ * Some parts taken from the Linux AP driver.
+ *
+ * Copyright IBM Corp. 2024
+ * Author: Janosch Frank <frankja@linux.ibm.com>
+ *	   Tony Krowiak <akrowia@linux.ibm.com>
+ *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *	   Harald Freudenberger <freude@de.ibm.com>
+ */
+
+#ifndef _S390X_AP_H_
+#define _S390X_AP_H_
+
+enum PQAP_FC {
+	PQAP_TEST_APQ,
+	PQAP_RESET_APQ,
+	PQAP_ZEROIZE_APQ,
+	PQAP_QUEUE_INT_CONTRL,
+	PQAP_QUERY_AP_CONF_INFO,
+	PQAP_QUERY_AP_COMP_TYPE,
+	PQAP_BEST_AP,
+};
+
+struct ap_queue_status {
+	uint32_t pad0;			/* Ignored padding for zArch  */
+	uint32_t empty		: 1;
+	uint32_t replies_waiting: 1;
+	uint32_t full		: 1;
+	uint32_t pad1		: 4;
+	uint32_t irq_enabled	: 1;
+	uint32_t rc		: 8;
+	uint32_t pad2		: 16;
+} __attribute__((packed))  __attribute__((aligned(8)));
+_Static_assert(sizeof(struct ap_queue_status) == sizeof(uint64_t), "APQSW size");
+
+struct ap_config_info {
+	uint8_t apsc	 : 1;	/* S bit */
+	uint8_t apxa	 : 1;	/* N bit */
+	uint8_t qact	 : 1;	/* C bit */
+	uint8_t rc8a	 : 1;	/* R bit */
+	uint8_t l	 : 1;	/* L bit */
+	uint8_t lext	 : 3;	/* Lext bits */
+	uint8_t reserved2[3];
+	uint8_t Na;		/* max # of APs - 1 */
+	uint8_t Nd;		/* max # of Domains - 1 */
+	uint8_t reserved6[10];
+	uint32_t apm[8];	/* AP ID mask */
+	uint32_t aqm[8];	/* AP (usage) queue mask */
+	uint32_t adm[8];	/* AP (control) domain mask */
+	uint8_t _reserved4[16];
+} __attribute__((aligned(8))) __attribute__ ((__packed__));
+_Static_assert(sizeof(struct ap_config_info) == 128, "PQAP QCI size");
+
+struct pqap_r0 {
+	uint32_t pad0;
+	uint8_t fc;
+	uint8_t t : 1;		/* Test facilities (TAPQ)*/
+	uint8_t pad1 : 7;
+	uint8_t ap;
+	uint8_t qn;
+} __attribute__((packed))  __attribute__((aligned(8)));
+
+struct pqap_r2 {
+	uint8_t s : 1;		/* Special Command facility */
+	uint8_t m : 1;		/* AP4KM */
+	uint8_t c : 1;		/* AP4KC */
+	uint8_t cop : 1;	/* AP is in coprocessor mode */
+	uint8_t acc : 1;	/* AP is in accelerator mode */
+	uint8_t xcp : 1;	/* AP is in XCP-mode */
+	uint8_t n : 1;		/* AP extended addressing facility */
+	uint8_t pad_0 : 1;
+	uint8_t pad_1[3];
+	uint8_t at;
+	uint8_t nd;
+	uint8_t pad_6;
+	uint8_t pad_7 : 4;
+	uint8_t qd : 4;
+} __attribute__((packed))  __attribute__((aligned(8)));
+_Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
+
+bool ap_setup(void);
+int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
+		 struct pqap_r2 *r2);
+int ap_pqap_qci(struct ap_config_info *info);
+#endif
diff --git a/s390x/Makefile b/s390x/Makefile
index 7fce9f9d..4f6c627d 100644
--- a/s390x/Makefile
+++ b/s390x/Makefile
@@ -110,6 +110,7 @@ cflatobjs += lib/s390x/malloc_io.o
 cflatobjs += lib/s390x/uv.o
 cflatobjs += lib/s390x/sie.o
 cflatobjs += lib/s390x/fault.o
+cflatobjs += lib/s390x/ap.o
 
 OBJDIRS += lib/s390x
 
-- 
2.40.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v8 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  @ 2024-01-31 10:13  4% ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2024-01-31 10:13 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Paolo Bonzini,
	Richard Henderson, Eric Blake, Markus Armbruster,
	Marcelo Tosatti
  Cc: qemu-devel, kvm, Babu Moger, Xiaoyao Li, Zhenyu Wang,
	Zhuocheng Ding, Yongwei Ma, Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Cc: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
---
Changes since v7:
 * Moved this patch after CPUID[4]'s similar change ("i386/cpu: Use APIC
   ID offset to encode cache topo in CPUID[4]"). (Xiaoyao)
 * Dropped Michael/Babu's Acked/Reviewed/Tested tags since the code
   change due to the rebase.
 * Re-added Yongwei's Tested tag For his re-testing (compilation on
   Intel platforms).

Changes since v3:
 * Rewrote the subject. (Babu)
 * Deleted the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Renamed "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Renamed "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Added the description of the original commit and add Cc.
---
 target/i386/cpu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 747cfb6cac03..65944645db5c 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -331,7 +331,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -340,11 +340,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v2 11/25] x86/sev: Adjust directmap to avoid inadvertant RMP faults
  2024-01-27 16:02  4%               ` Borislav Petkov
@ 2024-01-29 11:59  3%                 ` Borislav Petkov
  0 siblings, 0 replies; 200+ results
From: Borislav Petkov @ 2024-01-29 11:59 UTC (permalink / raw)
  To: Michael Roth
  Cc: x86, kvm, linux-coco, linux-mm, linux-crypto, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, tobin, vbabka, kirill, ak,
	tony.luck, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick

On Sat, Jan 27, 2024 at 05:02:49PM +0100, Borislav Petkov wrote:
> This function takes any PFN it gets passed in as it is. I don't care
> who its users are now or in the future and whether they pay attention
> what they pass into - it needs to be properly defined.

Ok, we solved it offlist, here's the final version I have. It has
a comment explaining what I was asking.

---
From: Michael Roth <michael.roth@amd.com>
Date: Thu, 25 Jan 2024 22:11:11 -0600
Subject: [PATCH] x86/sev: Adjust the directmap to avoid inadvertent RMP faults

If the kernel uses a 2MB or larger directmap mapping to write to an
address, and that mapping contains any 4KB pages that are set to private
in the RMP table, an RMP #PF will trigger and cause a host crash.

SNP-aware code that owns the private PFNs will never attempt such
a write, but other kernel tasks writing to other PFNs in the range may
trigger these checks inadvertently due to writing to those other PFNs
via a large directmap mapping that happens to also map a private PFN.

Prevent this by splitting any 2MB+ mappings that might end up containing
a mix of private/shared PFNs as a result of a subsequent RMPUPDATE for
the PFN/rmp_level passed in.

Another way to handle this would be to limit the directmap to 4K
mappings in the case of hosts that support SNP, but there is potential
risk for performance regressions of certain host workloads.

Handling it as-needed results in the directmap being slowly split over
time, which lessens the risk of a performance regression since the more
the directmap gets split as a result of running SNP guests, the more
likely the host is being used primarily to run SNP guests, where
a mostly-split directmap is actually beneficial since there is less
chance of TLB flushing and cpa_lock contention being needed to perform
these splits.

Cases where a host knows in advance it wants to primarily run SNP guests
and wishes to pre-split the directmap can be handled by adding
a tuneable in the future, but preliminary testing has shown this to not
provide a signficant benefit in the common case of guests that are
backed primarily by 2MB THPs, so it does not seem to be warranted
currently and can be added later if a need arises in the future.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240126041126.1927228-12-michael.roth@amd.com
---
 arch/x86/virt/svm/sev.c | 86 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index c0b4c2306e8d..8da9c5330ff0 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -368,6 +368,82 @@ int psmash(u64 pfn)
 }
 EXPORT_SYMBOL_GPL(psmash);
 
+/*
+ * If the kernel uses a 2MB or larger directmap mapping to write to an address,
+ * and that mapping contains any 4KB pages that are set to private in the RMP
+ * table, an RMP #PF will trigger and cause a host crash. Hypervisor code that
+ * owns the PFNs being transitioned will never attempt such a write, but other
+ * kernel tasks writing to other PFNs in the range may trigger these checks
+ * inadvertently due a large directmap mapping that happens to overlap such a
+ * PFN.
+ *
+ * Prevent this by splitting any 2MB+ mappings that might end up containing a
+ * mix of private/shared PFNs as a result of a subsequent RMPUPDATE for the
+ * PFN/rmp_level passed in.
+ *
+ * Note that there is no attempt here to scan all the RMP entries for the 2MB
+ * physical range, since it would only be worthwhile in determining if a
+ * subsequent RMPUPDATE for a 4KB PFN would result in all the entries being of
+ * the same shared/private state, thus avoiding the need to split the mapping.
+ * But that would mean the entries are currently in a mixed state, and so the
+ * mapping would have already been split as a result of prior transitions.
+ * And since the 4K split is only done if the mapping is 2MB+, and there isn't
+ * currently a mechanism in place to restore 2MB+ mappings, such a check would
+ * not provide any usable benefit.
+ *
+ * More specifics on how these checks are carried out can be found in APM
+ * Volume 2, "RMP and VMPL Access Checks".
+ */
+static int adjust_direct_map(u64 pfn, int rmp_level)
+{
+	unsigned long vaddr;
+	unsigned int level;
+	int npages, ret;
+	pte_t *pte;
+
+	/*
+	 * pfn_to_kaddr() will return a vaddr only within the direct
+	 * map range.
+	 */
+	vaddr = (unsigned long)pfn_to_kaddr(pfn);
+
+	/* Only 4KB/2MB RMP entries are supported by current hardware. */
+	if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M))
+		return -EINVAL;
+
+       if (!pfn_valid(pfn))
+               return -EINVAL;
+
+       if (rmp_level == PG_LEVEL_2M &&
+           (!IS_ALIGNED(pfn, PTRS_PER_PMD) ||
+            !pfn_valid(pfn + PTRS_PER_PMD - 1)))
+		return -EINVAL;
+
+	/*
+	 * If an entire 2MB physical range is being transitioned, then there is
+	 * no risk of RMP #PFs due to write accesses from overlapping mappings,
+	 * since even accesses from 1GB mappings will be treated as 2MB accesses
+	 * as far as RMP table checks are concerned.
+	 */
+	if (rmp_level == PG_LEVEL_2M)
+		return 0;
+
+	pte = lookup_address(vaddr, &level);
+	if (!pte || pte_none(*pte))
+		return 0;
+
+	if (level == PG_LEVEL_4K)
+		return 0;
+
+	npages = page_level_size(rmp_level) / PAGE_SIZE;
+	ret = set_memory_4k(vaddr, npages);
+	if (ret)
+		pr_warn("Failed to split direct map for PFN 0x%llx, ret: %d\n",
+			pfn, ret);
+
+	return ret;
+}
+
 /*
  * It is expected that those operations are seldom enough so that no mutual
  * exclusion of updaters is needed and thus the overlap error condition below
@@ -384,11 +460,16 @@ EXPORT_SYMBOL_GPL(psmash);
 static int rmpupdate(u64 pfn, struct rmp_state *state)
 {
 	unsigned long paddr = pfn << PAGE_SHIFT;
-	int ret;
+	int ret, level;
 
 	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
 		return -ENODEV;
 
+	level = RMP_TO_PG_LEVEL(state->pagesize);
+
+	if (adjust_direct_map(pfn, level))
+		return -EFAULT;
+
 	do {
 		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
 		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
@@ -398,7 +479,8 @@ static int rmpupdate(u64 pfn, struct rmp_state *state)
 	} while (ret == RMPUPDATE_FAIL_OVERLAP);
 
 	if (ret) {
-		pr_err("RMPUPDATE failed for PFN %llx, ret: %d\n", pfn, ret);
+		pr_err("RMPUPDATE failed for PFN %llx, pg_level: %d, ret: %d\n",
+		       pfn, level, ret);
 		dump_rmpentry(pfn);
 		dump_stack();
 		return -EFAULT;
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v2 11/25] x86/sev: Adjust directmap to avoid inadvertant RMP faults
  @ 2024-01-27 16:02  4%               ` Borislav Petkov
  2024-01-29 11:59  3%                 ` Borislav Petkov
  0 siblings, 1 reply; 200+ results
From: Borislav Petkov @ 2024-01-27 16:02 UTC (permalink / raw)
  To: Michael Roth
  Cc: x86, kvm, linux-coco, linux-mm, linux-crypto, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, tobin, vbabka, kirill, ak,
	tony.luck, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick

On Sat, Jan 27, 2024 at 09:45:06AM -0600, Michael Roth wrote:
> directmap maps all physical memory accessible by kernel, including text
> pages, so those are valid PFNs as far as this function is concerned.

Why don't you have a look at

Documentation/arch/x86/x86_64/mm.rst

to sync up on the nomenclature first?

   ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)

   ...

   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0

and so on.

> The expectation is that the caller is aware of what PFNs it is passing in,

There are no expectations. Have you written them down somewhere?

> This function only splits mappings in the 0xffff888000000000 directmap
> range.

This function takes any PFN it gets passed in as it is. I don't care
who its users are now or in the future and whether they pay attention
what they pass into - it needs to be properly defined.

Mike, please get on with the program. Use the right naming for the
function and basta.

IOW, this:

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 0a8f9334ec6e..652ee63e87fd 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -394,7 +394,7 @@ EXPORT_SYMBOL_GPL(psmash);
  * More specifics on how these checks are carried out can be found in APM
  * Volume 2, "RMP and VMPL Access Checks".
  */
-static int adjust_direct_map(u64 pfn, int rmp_level)
+static int split_pfn(u64 pfn, int rmp_level)
 {
 	unsigned long vaddr = (unsigned long)pfn_to_kaddr(pfn);
 	unsigned int level;
@@ -405,7 +405,12 @@ static int adjust_direct_map(u64 pfn, int rmp_level)
 	if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M))
 		return -EINVAL;
 
-	if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD)))
+       if (!pfn_valid(pfn))
+               return -EINVAL;
+
+       if (rmp_level == PG_LEVEL_2M &&
+           (!IS_ALIGNED(pfn, PTRS_PER_PMD) ||
+            !pfn_valid(pfn + PTRS_PER_PMD - 1)))
 		return -EINVAL;
 
 	/*
@@ -456,7 +461,7 @@ static int rmpupdate(u64 pfn, struct rmp_state *state)
 
 	level = RMP_TO_PG_LEVEL(state->pagesize);
 
-	if (adjust_direct_map(pfn, level))
+	if (split_pfn(pfn, level))
 		return -EFAULT;
 
 	do {


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v7 00/16] Add Secure TSC support for SNP guests
  2024-01-26  1:00  0%   ` Dionna Amalie Glaze
@ 2024-01-27  4:10  0%     ` Nikunj A. Dadhania
  0 siblings, 0 replies; 200+ results
From: Nikunj A. Dadhania @ 2024-01-27  4:10 UTC (permalink / raw)
  To: Dionna Amalie Glaze
  Cc: linux-kernel, thomas.lendacky, x86, kvm, bp, mingo, tglx,
	dave.hansen, pgonda, seanjc, pbonzini

On 1/26/2024 6:30 AM, Dionna Amalie Glaze wrote:
> On Wed, Jan 24, 2024 at 10:08 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
>>
>> On 12/20/2023 8:43 PM, Nikunj A Dadhania wrote:
>>> Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
>>> parameters being used cannot be changed by hypervisor once the guest is
>>> launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".
>>>
>>> During the boot-up of the secondary cpus, SecureTSC enabled guests need to
>>> query TSC info from AMD Security Processor. This communication channel is
>>> encrypted between the AMD Security Processor and the guest, the hypervisor
>>> is just the conduit to deliver the guest messages to the AMD Security
>>> Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
>>> Secure Nested Paging Firmware ABI Specification" document (currently at
>>> https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"
>>>
>>> Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
>>> communicate with the AMD Security Processor which is available at early
>>> boot.
>>>
>>> SEV-guest driver has the implementation for guest and AMD Security
>>> Processor communication. As the TSC_INFO needs to be initialized during
>>> early boot before smp cpus are started, move most of the sev-guest driver
>>> code to kernel/sev.c and provide well defined APIs to the sev-guest driver
>>> to use the interface to avoid code-duplication.
>>>
>>> Patches:
>>> 01-08: Preparation and movement of sev-guest driver code
>>> 09-16: SecureTSC enablement patches.
>>>
>>> Testing SecureTSC
>>> -----------------
>>>
>>> SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
>>> https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5
>>>
>>> QEMU changes:
>>> https://github.com/nikunjad/qemu/tree/snp_securetsc_v5
>>>
>>> QEMU commandline SEV-SNP-UPM with SecureTSC:
>>>
>>>   qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc,+invtsc -smp 4 \
>>>     -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
>>>     -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
>>>     -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
>>>     ...
>>>
>>> Changelog:
>>> ----------
>>> v7:
>>> * Drop mutex from the snp_dev and add snp_guest_cmd_{lock,unlock} API
>>> * Added comments for secrets page failure
>>> * Added define for maximum supported VMPCK
>>> * Updated comments why sev_status is used directly instead of
>>>   cpu_feature_enabled()

I missed this in the change log:

    * Added Tested-by from Peter Gonda (https://lore.kernel.org/lkml/CAMkAt6pULjLVUO6Ys4Sq1a79d93_5w5URgLYNXY-aW2jSemruA@mail.gmail.com/)
>>
>> A gentle reminder.
>>
> 
> From the Google testing side of things, we may not get to this for
> another while.

Thanks Dionna 

Regards
Nikunj



^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM: VMX: Correct *intr_info content and *info2 for EPT_VIOLATION in get_exit_info()
  2024-01-12  6:51  4% [PATCH] KVM: VMX: Correct *intr_info content and *info2 for EPT_VIOLATION in get_exit_info() Robert Hoo
@ 2024-01-26 20:45  0% ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2024-01-26 20:45 UTC (permalink / raw)
  To: Robert Hoo; +Cc: pbonzini, kvm

On Fri, Jan 12, 2024, Robert Hoo wrote:
> Fill vmx::idt_vectoring_info in *intr_info, to align with
> svm_get_exit_info(), where *intr_info is for complement information about
> intercepts occurring during event delivery through IDT (APM 15.7.2
> Intercepts During IDT Interrupt Delivery), whose counterpart in
> VMX is IDT_VECTORING_INFO_FIELD (SDM 25.9.3 Information for VM Exits
> That Occur During Event Delivery), rather than VM_EXIT_INTR_INFO.
> 
> Fill *info2 with GUEST_PHYSICAL_ADDRESS in case of EPT_VIOLATION, also
> to align with SVM. It can be filled with other info for different exit
> reasons, like SVM's EXITINFO2.

Nothing in here says *why*.

> Fixes: 235ba74f008d ("KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint")
> Signed-off-by: Robert Hoo <robert.hoo.linux@gmail.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d21f55f323ea..f1bf9f1fc561 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6141,14 +6141,26 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
>  
>  	*reason = vmx->exit_reason.full;
>  	*info1 = vmx_get_exit_qual(vcpu);
> +
>  	if (!(vmx->exit_reason.failed_vmentry)) {
> -		*info2 = vmx->idt_vectoring_info;
> -		*intr_info = vmx_get_intr_info(vcpu);
> +		*intr_info = vmx->idt_vectoring_info;
>  		if (is_exception_with_error_code(*intr_info))
> -			*error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> +			*error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
>  		else
>  			*error_code = 0;
> -	} else {
> +
> +		/* various *info2 semantics according to exit reason */
> +		switch (vmx->exit_reason.basic) {
> +		case EXIT_REASON_EPT_VIOLATION:
> +			*info2 = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> +			break;
> +		/* To do: *info2 for other exit reasons */

I really, really don't want to go down this road.  As I stated in a different,
similar thread[*], I don't think this approach is sustainable.  I would much
rather people put time and effort into (a) developing BPF programs to extract info
from the VM-Entry and VM-Exit paths, (b) enhancing KVM in the areas where it's
painful or impossible for a BPF program to get at interesting data, and (c) start
upstreaming the BPF programs, e.g. somewhere in tools/.

[*] https://lore.kernel.org/all/ZRNC5IKXDq1yv0v3@google.com

^ permalink raw reply	[relevance 0%]

* [PATCH v2 04/25] x86/sev: Add the host SEV-SNP initialization support
  @ 2024-01-26  4:11  2% ` Michael Roth
  2024-01-26  4:11  3% ` [PATCH v2 11/25] x86/sev: Adjust directmap to avoid inadvertant RMP faults Michael Roth
  1 sibling, 0 replies; 200+ results
From: Michael Roth @ 2024-01-26  4:11 UTC (permalink / raw)
  To: x86
  Cc: kvm, linux-coco, linux-mm, linux-crypto, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, tobin, bp, vbabka, kirill, ak,
	tony.luck, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick,
	Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The APM Volume 2 section
on Secure Nested Paging (SEV-SNP) details a number of steps needed to
detect/enable SEV-SNP and RMP table support on the host:

 - Detect SEV-SNP support based on CPUID bit
 - Initialize the RMP table memory reported by the RMP base/end MSR
   registers and configure IOMMU to be compatible with RMP access
   restrictions
 - Set the MtrrFixDramModEn bit in SYSCFG MSR
 - Set the SecureNestedPagingEn and VMPLEn bits in the SYSCFG MSR
 - Configure IOMMU

RMP table entry format is non-architectural and it can vary by
processor. It is defined by the PPR document for each respective CPU
family. Restrict SNP support to CPU models/families which are compatible
with the current RMP table entry format to guard against any undefined
behavior when running on other system types. Future models/support will
handle this through an architectural mechanism to allow for broader
compatibility.

SNP host code depends on CONFIG_KVM_AMD_SEV config flag which may be
enabled even when CONFIG_AMD_MEM_ENCRYPT isn't set, so update the
SNP-specific IOMMU helpers used here to rely on CONFIG_KVM_AMD_SEV
instead of CONFIG_AMD_MEM_ENCRYPT.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/Kbuild                  |   2 +
 arch/x86/include/asm/msr-index.h |  11 +-
 arch/x86/include/asm/sev.h       |   6 +
 arch/x86/kernel/cpu/amd.c        |  16 +++
 arch/x86/virt/svm/Makefile       |   3 +
 arch/x86/virt/svm/sev.c          | 216 +++++++++++++++++++++++++++++++
 6 files changed, 253 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/svm/Makefile
 create mode 100644 arch/x86/virt/svm/sev.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 5a83da703e87..6a1f36df6a18 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -28,5 +28,7 @@ obj-y += net/
 
 obj-$(CONFIG_KEXEC_FILE) += purgatory/
 
+obj-y += virt/svm/
+
 # for cleaning
 subdir- += boot tools
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..f482bc6a5ae7 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -599,6 +599,8 @@
 #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
 #define MSR_AMD64_SEV_ES_ENABLED	BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
 #define MSR_AMD64_SEV_SNP_ENABLED	BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
+#define MSR_AMD64_RMP_BASE		0xc0010132
+#define MSR_AMD64_RMP_END		0xc0010133
 
 /* SNP feature bits enabled by the hypervisor */
 #define MSR_AMD64_SNP_VTOM			BIT_ULL(3)
@@ -708,8 +710,15 @@
 #define MSR_K8_TOP_MEM1			0xc001001a
 #define MSR_K8_TOP_MEM2			0xc001001d
 #define MSR_AMD64_SYSCFG		0xc0010010
-#define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT	23
+#define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
 #define MSR_AMD64_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_AMD64_SYSCFG_SNP_EN_BIT	24
+#define MSR_AMD64_SYSCFG_SNP_EN		BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN	BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
+#define MSR_AMD64_SYSCFG_MFDM_BIT	19
+#define MSR_AMD64_SYSCFG_MFDM		BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
+
 #define MSR_K8_INT_PENDING_MSG		0xc0010055
 /* C1E active bits in int pending message */
 #define K8_INTP_C1E_ACTIVE_MASK		0x18000000
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 5b4a1ce3d368..1f59d8ba9776 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -243,4 +243,10 @@ static inline u64 snp_get_unsupported_features(u64 status) { return 0; }
 static inline u64 sev_get_status(void) { return 0; }
 #endif
 
+#ifdef CONFIG_KVM_AMD_SEV
+bool snp_probe_rmptable_info(void);
+#else
+static inline bool snp_probe_rmptable_info(void) { return false; }
+#endif
+
 #endif
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 79153e9b92b5..f48c51640c65 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -20,6 +20,7 @@
 #include <asm/delay.h>
 #include <asm/debugreg.h>
 #include <asm/resctrl.h>
+#include <asm/sev.h>
 
 #ifdef CONFIG_X86_64
 # include <asm/mmconfig.h>
@@ -584,6 +585,21 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)
 		break;
 	}
 
+	if (cpu_has(c, X86_FEATURE_SEV_SNP)) {
+		/*
+		 * RMP table entry format is not architectural and it can vary by processor
+		 * and is defined by the per-processor PPR. Restrict SNP support on the
+		 * known CPU model and family for which the RMP table entry format is
+		 * currently defined for.
+		 */
+		if (!boot_cpu_has(X86_FEATURE_ZEN3) &&
+		    !boot_cpu_has(X86_FEATURE_ZEN4) &&
+		    !boot_cpu_has(X86_FEATURE_ZEN5))
+			setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+		else if (!snp_probe_rmptable_info())
+			setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+	}
+
 	return;
 
 warn:
diff --git a/arch/x86/virt/svm/Makefile b/arch/x86/virt/svm/Makefile
new file mode 100644
index 000000000000..ef2a31bdcc70
--- /dev/null
+++ b/arch/x86/virt/svm/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_KVM_AMD_SEV) += sev.o
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
new file mode 100644
index 000000000000..575a9ff046cb
--- /dev/null
+++ b/arch/x86/virt/svm/sev.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * AMD SVM-SEV Host Support.
+ *
+ * Copyright (C) 2023 Advanced Micro Devices, Inc.
+ *
+ * Author: Ashish Kalra <ashish.kalra@amd.com>
+ *
+ */
+
+#include <linux/cc_platform.h>
+#include <linux/printk.h>
+#include <linux/mm_types.h>
+#include <linux/set_memory.h>
+#include <linux/memblock.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/cpumask.h>
+#include <linux/iommu.h>
+#include <linux/amd-iommu.h>
+
+#include <asm/sev.h>
+#include <asm/processor.h>
+#include <asm/setup.h>
+#include <asm/svm.h>
+#include <asm/smp.h>
+#include <asm/cpu.h>
+#include <asm/apic.h>
+#include <asm/cpuid.h>
+#include <asm/cmdline.h>
+#include <asm/iommu.h>
+
+/*
+ * The RMP entry format is not architectural. The format is defined in PPR
+ * Family 19h Model 01h, Rev B1 processor.
+ */
+struct rmpentry {
+	u64	assigned	: 1,
+		pagesize	: 1,
+		immutable	: 1,
+		rsvd1		: 9,
+		gpa		: 39,
+		asid		: 10,
+		vmsa		: 1,
+		validated	: 1,
+		rsvd2		: 1;
+	u64 rsvd3;
+} __packed;
+
+/*
+ * The first 16KB from the RMP_BASE is used by the processor for the
+ * bookkeeping, the range needs to be added during the RMP entry lookup.
+ */
+#define RMPTABLE_CPU_BOOKKEEPING_SZ	0x4000
+
+static u64 probed_rmp_base, probed_rmp_size;
+static struct rmpentry *rmptable __ro_after_init;
+static u64 rmptable_max_pfn __ro_after_init;
+
+#undef pr_fmt
+#define pr_fmt(fmt)	"SEV-SNP: " fmt
+
+static int __mfd_enable(unsigned int cpu)
+{
+	u64 val;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return 0;
+
+	rdmsrl(MSR_AMD64_SYSCFG, val);
+
+	val |= MSR_AMD64_SYSCFG_MFDM;
+
+	wrmsrl(MSR_AMD64_SYSCFG, val);
+
+	return 0;
+}
+
+static __init void mfd_enable(void *arg)
+{
+	__mfd_enable(smp_processor_id());
+}
+
+static int __snp_enable(unsigned int cpu)
+{
+	u64 val;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return 0;
+
+	rdmsrl(MSR_AMD64_SYSCFG, val);
+
+	val |= MSR_AMD64_SYSCFG_SNP_EN;
+	val |= MSR_AMD64_SYSCFG_SNP_VMPL_EN;
+
+	wrmsrl(MSR_AMD64_SYSCFG, val);
+
+	return 0;
+}
+
+static __init void snp_enable(void *arg)
+{
+	__snp_enable(smp_processor_id());
+}
+
+#define RMP_ADDR_MASK GENMASK_ULL(51, 13)
+
+bool snp_probe_rmptable_info(void)
+{
+	u64 max_rmp_pfn, calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
+
+	rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
+	rdmsrl(MSR_AMD64_RMP_END, rmp_end);
+
+	if (!(rmp_base & RMP_ADDR_MASK) || !(rmp_end & RMP_ADDR_MASK)) {
+		pr_err("Memory for the RMP table has not been reserved by BIOS\n");
+		return false;
+	}
+
+	if (rmp_base > rmp_end) {
+		pr_err("RMP configuration not valid: base=%#llx, end=%#llx\n", rmp_base, rmp_end);
+		return false;
+	}
+
+	rmp_sz = rmp_end - rmp_base + 1;
+
+	/*
+	 * Calculate the amount the memory that must be reserved by the BIOS to
+	 * address the whole RAM, including the bookkeeping area. The RMP itself
+	 * must also be covered.
+	 */
+	max_rmp_pfn = max_pfn;
+	if (PHYS_PFN(rmp_end) > max_pfn)
+		max_rmp_pfn = PHYS_PFN(rmp_end);
+
+	calc_rmp_sz = (max_rmp_pfn << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
+
+	if (calc_rmp_sz > rmp_sz) {
+		pr_err("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
+		       calc_rmp_sz, rmp_sz);
+		return false;
+	}
+
+	probed_rmp_base = rmp_base;
+	probed_rmp_size = rmp_sz;
+
+	pr_info("RMP table physical range [0x%016llx - 0x%016llx]\n",
+		probed_rmp_base, probed_rmp_base + probed_rmp_size - 1);
+
+	return true;
+}
+
+/*
+ * Do the necessary preparations which are verified by the firmware as
+ * described in the SNP_INIT_EX firmware command description in the SNP
+ * firmware ABI spec.
+ */
+static int __init snp_rmptable_init(void)
+{
+	void *rmptable_start;
+	u64 rmptable_size;
+	u64 val;
+
+	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
+		return 0;
+
+	if (!amd_iommu_snp_en)
+		return 0;
+
+	if (!probed_rmp_size)
+		goto nosnp;
+
+	rmptable_start = memremap(probed_rmp_base, probed_rmp_size, MEMREMAP_WB);
+	if (!rmptable_start) {
+		pr_err("Failed to map RMP table\n");
+		return 1;
+	}
+
+	/*
+	 * Check if SEV-SNP is already enabled, this can happen in case of
+	 * kexec boot.
+	 */
+	rdmsrl(MSR_AMD64_SYSCFG, val);
+	if (val & MSR_AMD64_SYSCFG_SNP_EN)
+		goto skip_enable;
+
+	memset(rmptable_start, 0, probed_rmp_size);
+
+	/* Flush the caches to ensure that data is written before SNP is enabled. */
+	wbinvd_on_all_cpus();
+
+	/* MtrrFixDramModEn must be enabled on all the CPUs prior to enabling SNP. */
+	on_each_cpu(mfd_enable, NULL, 1);
+
+	on_each_cpu(snp_enable, NULL, 1);
+
+skip_enable:
+	rmptable_start += RMPTABLE_CPU_BOOKKEEPING_SZ;
+	rmptable_size = probed_rmp_size - RMPTABLE_CPU_BOOKKEEPING_SZ;
+
+	rmptable = (struct rmpentry *)rmptable_start;
+	rmptable_max_pfn = rmptable_size / sizeof(struct rmpentry) - 1;
+
+	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
+
+	return 0;
+
+nosnp:
+	setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
+	return -ENOSYS;
+}
+
+/*
+ * This must be called after the IOMMU has been initialized.
+ */
+device_initcall(snp_rmptable_init);
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v2 11/25] x86/sev: Adjust directmap to avoid inadvertant RMP faults
    2024-01-26  4:11  2% ` [PATCH v2 04/25] x86/sev: Add the host SEV-SNP initialization support Michael Roth
@ 2024-01-26  4:11  3% ` Michael Roth
    1 sibling, 1 reply; 200+ results
From: Michael Roth @ 2024-01-26  4:11 UTC (permalink / raw)
  To: x86
  Cc: kvm, linux-coco, linux-mm, linux-crypto, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, tobin, bp, vbabka, kirill, ak,
	tony.luck, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick

If the kernel uses a 2MB or larger directmap mapping to write to an
address, and that mapping contains any 4KB pages that are set to private
in the RMP table, an RMP #PF will trigger and cause a host crash.
SNP-aware code that owns the private PFNs will never attempt such a
write, but other kernel tasks writing to other PFNs in the range may
trigger these checks inadvertantly due to writing to those other PFNs
via a large directmap mapping that happens to also map a private PFN.

Prevent this by splitting any 2MB+ mappings that might end up containing
a mix of private/shared PFNs as a result of a subsequent RMPUPDATE for
the PFN/rmp_level passed in.

Another way to handle this would be to limit the directmap to 4K
mappings in the case of hosts that support SNP, but there is potential
risk for performance regressions of certain host workloads. Handling it
as-needed results in the directmap being slowly split over time, which
lessens the risk of a performance regression since the more the
directmap gets split as a result of running SNP guests, the more likely
the host is being used primarily to run SNP guests, where a mostly-split
directmap is actually beneficial since there is less chance of TLB
flushing and cpa_lock contention being needed to perform these splits.

Cases where a host knows in advance it wants to primarily run SNP guests
and wishes to pre-split the directmap can be handled by adding a
tuneable in the future, but preliminary testing has shown this to not
provide a signficant benefit in the common case of guests that are
backed primarily by 2MB THPs, so it does not seem to be warranted
currently and can be added later if a need arises in the future.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/virt/svm/sev.c | 75 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 16b3d8139649..1a13eff78c9d 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -368,6 +368,71 @@ int psmash(u64 pfn)
 }
 EXPORT_SYMBOL_GPL(psmash);
 
+/*
+ * If the kernel uses a 2MB or larger directmap mapping to write to an address,
+ * and that mapping contains any 4KB pages that are set to private in the RMP
+ * table, an RMP #PF will trigger and cause a host crash. Hypervisor code that
+ * owns the PFNs being transitioned will never attempt such a write, but other
+ * kernel tasks writing to other PFNs in the range may trigger these checks
+ * inadvertantly due a large directmap mapping that happens to overlap such a
+ * PFN.
+ *
+ * Prevent this by splitting any 2MB+ mappings that might end up containing a
+ * mix of private/shared PFNs as a result of a subsequent RMPUPDATE for the
+ * PFN/rmp_level passed in.
+ *
+ * Note that there is no attempt here to scan all the RMP entries for the 2MB
+ * physical range, since it would only be worthwhile in determining if a
+ * subsequent RMPUPDATE for a 4KB PFN would result in all the entries being of
+ * the same shared/private state, thus avoiding the need to split the mapping.
+ * But that would mean the entries are currently in a mixed state, and so the
+ * mapping would have already been split as a result of prior transitions.
+ * And since the 4K split is only done if the mapping is 2MB+, and there isn't
+ * currently a mechanism in place to restore 2MB+ mappings, such a check would
+ * not provide any usable benefit.
+ *
+ * More specifics on how these checks are carried out can be found in APM
+ * Volume 2, "RMP and VMPL Access Checks".
+ */
+static int adjust_direct_map(u64 pfn, int rmp_level)
+{
+	unsigned long vaddr = (unsigned long)pfn_to_kaddr(pfn);
+	unsigned int level;
+	int npages, ret;
+	pte_t *pte;
+
+	/* Only 4KB/2MB RMP entries are supported by current hardware. */
+	if (WARN_ON_ONCE(rmp_level > PG_LEVEL_2M))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(rmp_level == PG_LEVEL_2M && !IS_ALIGNED(pfn, PTRS_PER_PMD)))
+		return -EINVAL;
+
+	/*
+	 * If an entire 2MB physical range is being transitioned, then there is
+	 * no risk of RMP #PFs due to write accesses from overlapping mappings,
+	 * since even accesses from 1GB mappings will be treated as 2MB accesses
+	 * as far as RMP table checks are concerned.
+	 */
+	if (rmp_level == PG_LEVEL_2M)
+		return 0;
+
+	pte = lookup_address(vaddr, &level);
+	if (!pte || pte_none(*pte))
+		return 0;
+
+	if (level == PG_LEVEL_4K)
+		return 0;
+
+	npages = page_level_size(rmp_level) / PAGE_SIZE;
+	ret = set_memory_4k(vaddr, npages);
+	if (ret)
+		pr_warn("Failed to split direct map for PFN 0x%llx, ret: %d\n",
+			pfn, ret);
+
+	return ret;
+}
+
 /*
  * It is expected that those operations are seldom enough so that no mutual
  * exclusion of updaters is needed and thus the overlap error condition below
@@ -384,11 +449,16 @@ EXPORT_SYMBOL_GPL(psmash);
 static int rmpupdate(u64 pfn, struct rmp_state *state)
 {
 	unsigned long paddr = pfn << PAGE_SHIFT;
-	int ret;
+	int ret, level;
 
 	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
 		return -ENODEV;
 
+	level = RMP_TO_PG_LEVEL(state->pagesize);
+
+	if (adjust_direct_map(pfn, level))
+		return -EFAULT;
+
 	do {
 		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
 		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
@@ -398,7 +468,8 @@ static int rmpupdate(u64 pfn, struct rmp_state *state)
 	} while (ret == RMPUPDATE_FAIL_OVERLAP);
 
 	if (ret) {
-		pr_err("RMPUPDATE failed for PFN %llx, ret: %d\n", pfn, ret);
+		pr_err("RMPUPDATE failed for PFN %llx, pg_level: %d, ret: %d\n",
+		       pfn, level, ret);
 		dump_rmpentry(pfn);
 		dump_stack();
 		return -EFAULT;
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v7 00/16] Add Secure TSC support for SNP guests
  2024-01-25  6:08  0% ` Nikunj A. Dadhania
@ 2024-01-26  1:00  0%   ` Dionna Amalie Glaze
  2024-01-27  4:10  0%     ` Nikunj A. Dadhania
  0 siblings, 1 reply; 200+ results
From: Dionna Amalie Glaze @ 2024-01-26  1:00 UTC (permalink / raw)
  To: nikunj
  Cc: linux-kernel, thomas.lendacky, x86, kvm, bp, mingo, tglx,
	dave.hansen, pgonda, seanjc, pbonzini

On Wed, Jan 24, 2024 at 10:08 PM Nikunj A. Dadhania <nikunj@amd.com> wrote:
>
> On 12/20/2023 8:43 PM, Nikunj A Dadhania wrote:
> > Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
> > parameters being used cannot be changed by hypervisor once the guest is
> > launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".
> >
> > During the boot-up of the secondary cpus, SecureTSC enabled guests need to
> > query TSC info from AMD Security Processor. This communication channel is
> > encrypted between the AMD Security Processor and the guest, the hypervisor
> > is just the conduit to deliver the guest messages to the AMD Security
> > Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
> > Secure Nested Paging Firmware ABI Specification" document (currently at
> > https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"
> >
> > Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
> > communicate with the AMD Security Processor which is available at early
> > boot.
> >
> > SEV-guest driver has the implementation for guest and AMD Security
> > Processor communication. As the TSC_INFO needs to be initialized during
> > early boot before smp cpus are started, move most of the sev-guest driver
> > code to kernel/sev.c and provide well defined APIs to the sev-guest driver
> > to use the interface to avoid code-duplication.
> >
> > Patches:
> > 01-08: Preparation and movement of sev-guest driver code
> > 09-16: SecureTSC enablement patches.
> >
> > Testing SecureTSC
> > -----------------
> >
> > SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
> > https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5
> >
> > QEMU changes:
> > https://github.com/nikunjad/qemu/tree/snp_securetsc_v5
> >
> > QEMU commandline SEV-SNP-UPM with SecureTSC:
> >
> >   qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc,+invtsc -smp 4 \
> >     -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
> >     -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
> >     -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
> >     ...
> >
> > Changelog:
> > ----------
> > v7:
> > * Drop mutex from the snp_dev and add snp_guest_cmd_{lock,unlock} API
> > * Added comments for secrets page failure
> > * Added define for maximum supported VMPCK
> > * Updated comments why sev_status is used directly instead of
> >   cpu_feature_enabled()
>
> A gentle reminder.
>

From the Google testing side of things, we may not get to this for
another while.

> Regards
> Nikunj
>


-- 
-Dionna Glaze, PhD (she/her)

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v7 00/16] Add Secure TSC support for SNP guests
  2023-12-20 15:13  2% [PATCH v7 00/16] Add Secure TSC support for SNP guests Nikunj A Dadhania
@ 2024-01-25  6:08  0% ` Nikunj A. Dadhania
  2024-01-26  1:00  0%   ` Dionna Amalie Glaze
  0 siblings, 1 reply; 200+ results
From: Nikunj A. Dadhania @ 2024-01-25  6:08 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, x86, kvm
  Cc: bp, mingo, tglx, dave.hansen, dionnaglaze, pgonda, seanjc, pbonzini

On 12/20/2023 8:43 PM, Nikunj A Dadhania wrote:
> Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
> parameters being used cannot be changed by hypervisor once the guest is
> launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".
> 
> During the boot-up of the secondary cpus, SecureTSC enabled guests need to
> query TSC info from AMD Security Processor. This communication channel is
> encrypted between the AMD Security Processor and the guest, the hypervisor
> is just the conduit to deliver the guest messages to the AMD Security
> Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
> Secure Nested Paging Firmware ABI Specification" document (currently at
> https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"
> 
> Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
> communicate with the AMD Security Processor which is available at early
> boot.
> 
> SEV-guest driver has the implementation for guest and AMD Security
> Processor communication. As the TSC_INFO needs to be initialized during
> early boot before smp cpus are started, move most of the sev-guest driver
> code to kernel/sev.c and provide well defined APIs to the sev-guest driver
> to use the interface to avoid code-duplication.
> 
> Patches:
> 01-08: Preparation and movement of sev-guest driver code
> 09-16: SecureTSC enablement patches.
> 
> Testing SecureTSC
> -----------------
> 
> SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
> https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5
> 
> QEMU changes:
> https://github.com/nikunjad/qemu/tree/snp_securetsc_v5
> 
> QEMU commandline SEV-SNP-UPM with SecureTSC:
> 
>   qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc,+invtsc -smp 4 \
>     -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
>     -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
>     -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
>     ...
> 
> Changelog:
> ----------
> v7:
> * Drop mutex from the snp_dev and add snp_guest_cmd_{lock,unlock} API
> * Added comments for secrets page failure
> * Added define for maximum supported VMPCK
> * Updated comments why sev_status is used directly instead of
>   cpu_feature_enabled()

A gentle reminder.

Regards
Nikunj


^ permalink raw reply	[relevance 0%]

* [PATCH v4 6/6] s390/vfio-ap: do not reset queue removed from host config
                     ` (3 preceding siblings ...)
  2024-01-15 18:54  5% ` [PATCH v4 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
@ 2024-01-15 18:54  9% ` Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-15 18:54 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, agordeev, gor, Tony Krowiak, stable

From: Tony Krowiak <akrowiak@linux.ibm.com>

When a queue is unbound from the vfio_ap device driver, it is reset to
ensure its crypto data is not leaked when it is bound to another device
driver. If the queue is unbound due to the fact that the adapter or domain
was removed from the host's AP configuration, then attempting to reset it
will fail with response code 01 (APID not valid) getting returned from the
reset command. Let's ensure that the queue is assigned to the host's
configuration before resetting it.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 550c936c413d..983b3b16196c 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2215,10 +2215,10 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	q = dev_get_drvdata(&apdev->device);
 	get_update_locks_for_queue(q);
 	matrix_mdev = q->matrix_mdev;
+	apid = AP_QID_CARD(q->apqn);
+	apqi = AP_QID_QUEUE(q->apqn);
 
 	if (matrix_mdev) {
-		apid = AP_QID_CARD(q->apqn);
-		apqi = AP_QID_QUEUE(q->apqn);
 		/* If the queue is assigned to the guest's AP configuration */
 		if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
@@ -2234,8 +2234,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 		}
 	}
 
-	vfio_ap_mdev_reset_queue(q);
-	flush_work(&q->reset_work);
+	/*
+	 * If the queue is not in the host's AP configuration, then resetting
+	 * it will fail with response code 01, (APQN not valid); so, let's make
+	 * sure it is in the host's config.
+	 */
+	if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
+	    test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
+		vfio_ap_mdev_reset_queue(q);
+		flush_work(&q->reset_work);
+	}
 
 done:
 	if (matrix_mdev)
-- 
2.43.0


^ permalink raw reply related	[relevance 9%]

* [PATCH v4 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config
    2024-01-15 18:54 20% ` [PATCH v4 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
  2024-01-15 18:54 14% ` [PATCH v4 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
@ 2024-01-15 18:54 12% ` Tony Krowiak
  2024-01-15 18:54  5% ` [PATCH v4 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
  2024-01-15 18:54  9% ` [PATCH v4 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-15 18:54 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, agordeev, gor, Tony Krowiak, stable

From: Tony Krowiak <akrowiak@linux.ibm.com>

When filtering the adapters from the configuration profile for a guest to
create or update a guest's AP configuration, if the APID of an adapter and
the APQI of a domain identify a queue device that is not bound to the
vfio_ap device driver, the APID of the adapter will be filtered because an
individual APQN can not be filtered due to the fact the APQNs are assigned
to an AP configuration as a matrix of APIDs and APQIs. Consequently, a
guest will not have access to all of the queues associated with the
filtered adapter. If the queues are subsequently made available again to
the guest, they should re-appear in a reset state; so, let's make sure all
queues associated with an adapter unplugged from the guest are reset.

In order to identify the set of queues that need to be reset, let's allow a
vfio_ap_queue object to be simultaneously stored in both a hashtable and a
list: A hashtable used to store all of the queues assigned
to a matrix mdev; and/or, a list used to store a subset of the queues that
need to be reset. For example, when an adapter is hot unplugged from a
guest, all guest queues associated with that adapter must be reset. Since
that may be a subset of those assigned to the matrix mdev, they can be
stored in a list that can be passed to the vfio_ap_mdev_reset_queues
function.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c     | 171 +++++++++++++++++++-------
 drivers/s390/crypto/vfio_ap_private.h |  11 +-
 2 files changed, 133 insertions(+), 49 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e45d0bf7b126..44cd29aace8e 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -32,7 +32,8 @@
 
 #define AP_RESET_INTERVAL		20	/* Reset sleep interval (20ms)		*/
 
-static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable);
+static int vfio_ap_mdev_reset_queues(struct ap_matrix_mdev *matrix_mdev);
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist);
 static struct vfio_ap_queue *vfio_ap_find_queue(int apqn);
 static const struct vfio_device_ops vfio_ap_matrix_dev_ops;
 static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q);
@@ -665,16 +666,23 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  *				device driver.
  *
  * @matrix_mdev: the matrix mdev whose matrix is to be filtered.
+ * @apm_filtered: a 256-bit bitmap for storing the APIDs filtered from the
+ *		  guest's AP configuration that are still in the host's AP
+ *		  configuration.
  *
  * Note: If an APQN referencing a queue device that is not bound to the vfio_ap
  *	 driver, its APID will be filtered from the guest's APCB. The matrix
  *	 structure precludes filtering an individual APQN, so its APID will be
- *	 filtered.
+ *	 filtered. Consequently, all queues associated with the adapter that
+ *	 are in the host's AP configuration must be reset. If queues are
+ *	 subsequently made available again to the guest, they should re-appear
+ *	 in a reset state
  *
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev,
+				       unsigned long *apm_filtered)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -684,6 +692,7 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_copy(prev_shadow_apm, matrix_mdev->shadow_apcb.apm, AP_DEVICES);
 	bitmap_copy(prev_shadow_aqm, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS);
 	vfio_ap_matrix_init(&matrix_dev->info, &matrix_mdev->shadow_apcb);
+	bitmap_clear(apm_filtered, 0, AP_DEVICES);
 
 	/*
 	 * Copy the adapters, domains and control domains to the shadow_apcb
@@ -709,8 +718,16 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 			apqn = AP_MKQID(apid, apqi);
 			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
 			if (!q || q->reset_status.response_code) {
-				clear_bit_inv(apid,
-					      matrix_mdev->shadow_apcb.apm);
+				clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+				/*
+				 * If the adapter was previously plugged into
+				 * the guest, let's let the caller know that
+				 * the APID was filtered.
+				 */
+				if (test_bit_inv(apid, prev_shadow_apm))
+					set_bit_inv(apid, apm_filtered);
+
 				break;
 			}
 		}
@@ -812,7 +829,7 @@ static void vfio_ap_mdev_remove(struct mdev_device *mdev)
 
 	mutex_lock(&matrix_dev->guests_lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
-	vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+	vfio_ap_mdev_reset_queues(matrix_mdev);
 	vfio_ap_mdev_unlink_fr_queues(matrix_mdev);
 	list_del(&matrix_mdev->node);
 	mutex_unlock(&matrix_dev->mdevs_lock);
@@ -922,6 +939,47 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long *apm_reset)
+{
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
+	unsigned long apid, apqi;
+	int apqn, ret = 0;
+
+	if (bitmap_empty(apm_reset, AP_DEVICES))
+		return 0;
+
+	INIT_LIST_HEAD(&qlist);
+
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
+			/*
+			 * If the domain is not in the host's AP configuration,
+			 * then resetting it will fail with response code 01
+			 * (APQN not valid).
+			 */
+			if (!test_bit_inv(apqi,
+					  (unsigned long *)matrix_dev->info.aqm))
+				continue;
+
+			apqn = AP_MKQID(apid, apqi);
+			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
+
+			if (q)
+				list_add_tail(&q->reset_qnode, &qlist);
+		}
+	}
+
+	ret = vfio_ap_mdev_reset_qlist(&qlist);
+
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
+		list_del(&q->reset_qnode);
+
+	return ret;
+}
+
 /**
  * assign_adapter_store - parses the APID from @buf and sets the
  * corresponding bit in the mediated matrix device's APM
@@ -962,6 +1020,7 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -991,8 +1050,10 @@ static ssize_t assign_adapter_store(struct device *dev,
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1023,11 +1084,12 @@ static struct vfio_ap_queue
  *				 adapter was assigned.
  * @matrix_mdev: the matrix mediated device to which the adapter was assigned.
  * @apid: the APID of the unassigned adapter.
- * @qtable: table for storing queues associated with unassigned adapter.
+ * @qlist: list for storing queues associated with unassigned adapter that
+ *	   need to be reset.
  */
 static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 					unsigned long apid,
-					struct ap_queue_table *qtable)
+					struct list_head *qlist)
 {
 	unsigned long apqi;
 	struct vfio_ap_queue *q;
@@ -1035,11 +1097,10 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1047,26 +1108,23 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_adapter(struct ap_matrix_mdev *matrix_mdev,
 					    unsigned long apid)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, &qlist);
 
 	if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
 		clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1167,6 +1225,7 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1196,8 +1255,10 @@ static ssize_t assign_domain_store(struct device *dev,
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1210,7 +1271,7 @@ static DEVICE_ATTR_WO(assign_domain);
 
 static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 				       unsigned long apqi,
-				       struct ap_queue_table *qtable)
+				       struct list_head *qlist)
 {
 	unsigned long apid;
 	struct vfio_ap_queue *q;
@@ -1218,11 +1279,10 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1230,26 +1290,23 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_domain(struct ap_matrix_mdev *matrix_mdev,
 					   unsigned long apqi)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, &qlist);
 
 	if (test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
 		clear_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1604,7 +1661,7 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
 		get_update_locks_for_kvm(kvm);
 
 		kvm_arch_crypto_clear_masks(kvm);
-		vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+		vfio_ap_mdev_reset_queues(matrix_mdev);
 		kvm_put_kvm(kvm);
 		matrix_mdev->kvm = NULL;
 
@@ -1740,15 +1797,33 @@ static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q)
 	}
 }
 
-static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable)
+static int vfio_ap_mdev_reset_queues(struct ap_matrix_mdev *matrix_mdev)
 {
 	int ret = 0, loop_cursor;
 	struct vfio_ap_queue *q;
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode)
+	hash_for_each(matrix_mdev->qtable.queues, loop_cursor, q, mdev_qnode)
 		vfio_ap_mdev_reset_queue(q);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	hash_for_each(matrix_mdev->qtable.queues, loop_cursor, q, mdev_qnode) {
+		flush_work(&q->reset_work);
+
+		if (q->reset_status.response_code)
+			ret = -EIO;
+	}
+
+	return ret;
+}
+
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist)
+{
+	int ret = 0;
+	struct vfio_ap_queue *q;
+
+	list_for_each_entry(q, qlist, reset_qnode)
+		vfio_ap_mdev_reset_queue(q);
+
+	list_for_each_entry(q, qlist, reset_qnode) {
 		flush_work(&q->reset_work);
 
 		if (q->reset_status.response_code)
@@ -1934,7 +2009,7 @@ static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
 		ret = vfio_ap_mdev_get_device_info(arg);
 		break;
 	case VFIO_DEVICE_RESET:
-		ret = vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+		ret = vfio_ap_mdev_reset_queues(matrix_mdev);
 		break;
 	case VFIO_DEVICE_GET_IRQ_INFO:
 			ret = vfio_ap_get_irq_info(arg);
@@ -2080,6 +2155,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 {
 	int ret;
 	struct vfio_ap_queue *q;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev;
 
 	ret = sysfs_create_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2112,15 +2188,17 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 		    !bitmap_empty(matrix_mdev->aqm_add, AP_DOMAINS))
 			goto done;
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apids(matrix_mdev, apm_filtered);
+		}
 	}
 
 done:
 	dev_set_drvdata(&apdev->device, q);
 	release_update_locks_for_mdev(matrix_mdev);
 
-	return 0;
+	return ret;
 
 err_remove_group:
 	sysfs_remove_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2464,6 +2542,7 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
@@ -2477,7 +2556,7 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 					 matrix_mdev->adm_add, AP_DOMAINS);
 
 	if (filter_adapters || filter_domains)
-		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered);
 
 	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
@@ -2485,6 +2564,8 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 	if (do_hotplug)
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
+	reset_queues_for_apids(matrix_mdev, apm_filtered);
+
 	mutex_unlock(&matrix_dev->mdevs_lock);
 	mutex_unlock(&matrix_mdev->kvm->lock);
 }
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 88aff8b81f2f..20eac8b0f0b9 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -83,10 +83,10 @@ struct ap_matrix {
 };
 
 /**
- * struct ap_queue_table - a table of queue objects.
- *
- * @queues: a hashtable of queues (struct vfio_ap_queue).
- */
+  * struct ap_queue_table - a table of queue objects.
+  *
+  * @queues: a hashtable of queues (struct vfio_ap_queue).
+  */
 struct ap_queue_table {
 	DECLARE_HASHTABLE(queues, 8);
 };
@@ -133,6 +133,8 @@ struct ap_matrix_mdev {
  * @apqn: the APQN of the AP queue device
  * @saved_isc: the guest ISC registered with the GIB interface
  * @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
+ * @reset_qnode: allows the vfio_ap_queue struct to be added to a list of queues
+ *		 that need to be reset
  * @reset_status: the status from the last reset of the queue
  * @reset_work: work to wait for queue reset to complete
  */
@@ -143,6 +145,7 @@ struct vfio_ap_queue {
 #define VFIO_AP_ISC_INVALID 0xff
 	unsigned char saved_isc;
 	struct hlist_node mdev_qnode;
+	struct list_head reset_qnode;
 	struct ap_queue_status reset_status;
 	struct work_struct reset_work;
 };
-- 
2.43.0


^ permalink raw reply related	[relevance 12%]

* [PATCH v4 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver
                     ` (2 preceding siblings ...)
  2024-01-15 18:54 12% ` [PATCH v4 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
@ 2024-01-15 18:54  5% ` Tony Krowiak
  2024-01-15 18:54  9% ` [PATCH v4 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-15 18:54 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, agordeev, gor, Tony Krowiak, stable

From: Tony Krowiak <akrowiak@linux.ibm.com>

When a queue is unbound from the vfio_ap device driver, if that queue is
assigned to a guest's AP configuration, its associated adapter is removed
because queues are defined to a guest via a matrix of adapters and
domains; so, it is not possible to remove a single queue.

If an adapter is removed from the guest's AP configuration, all associated
queues must be reset to prevent leaking crypto data should any of them be
assigned to a different guest or device driver. The one caveat is that if
the queue is being removed because the adapter or domain has been removed
from the host's AP configuration, then an attempt to reset the queue will
fail with response code 01, AP-queue number not valid; so resetting these
queues should be skipped.

Acked-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 09d31ff78793 ("s390/vfio-ap: hot plug/unplug of AP devices when probed/removed")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 76 +++++++++++++++++--------------
 1 file changed, 41 insertions(+), 35 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 44cd29aace8e..550c936c413d 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -939,45 +939,45 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static void collect_queues_to_reset(struct ap_matrix_mdev *matrix_mdev,
+				    unsigned long apid,
+				    struct list_head *qlist)
+{
+	struct vfio_ap_queue *q;
+	unsigned long  apqi;
+
+	for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS) {
+		q = vfio_ap_mdev_get_queue(matrix_mdev, AP_MKQID(apid, apqi));
+		if (q)
+			list_add_tail(&q->reset_qnode, qlist);
+	}
+}
+
+static void reset_queues_for_apid(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long apid)
+{
+	struct list_head qlist;
+
+	INIT_LIST_HEAD(&qlist);
+	collect_queues_to_reset(matrix_mdev, apid, &qlist);
+	vfio_ap_mdev_reset_qlist(&qlist);
+}
+
 static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
 				  unsigned long *apm_reset)
 {
-	struct vfio_ap_queue *q, *tmpq;
 	struct list_head qlist;
-	unsigned long apid, apqi;
-	int apqn, ret = 0;
+	unsigned long apid;
 
 	if (bitmap_empty(apm_reset, AP_DEVICES))
 		return 0;
 
 	INIT_LIST_HEAD(&qlist);
 
-	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
-				     AP_DOMAINS) {
-			/*
-			 * If the domain is not in the host's AP configuration,
-			 * then resetting it will fail with response code 01
-			 * (APQN not valid).
-			 */
-			if (!test_bit_inv(apqi,
-					  (unsigned long *)matrix_dev->info.aqm))
-				continue;
-
-			apqn = AP_MKQID(apid, apqi);
-			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
-
-			if (q)
-				list_add_tail(&q->reset_qnode, &qlist);
-		}
-	}
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES)
+		collect_queues_to_reset(matrix_mdev, apid, &qlist);
 
-	ret = vfio_ap_mdev_reset_qlist(&qlist);
-
-	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
-		list_del(&q->reset_qnode);
-
-	return ret;
+	return vfio_ap_mdev_reset_qlist(&qlist);
 }
 
 /**
@@ -2217,24 +2217,30 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	matrix_mdev = q->matrix_mdev;
 
 	if (matrix_mdev) {
-		vfio_ap_unlink_queue_fr_mdev(q);
-
 		apid = AP_QID_CARD(q->apqn);
 		apqi = AP_QID_QUEUE(q->apqn);
-
-		/*
-		 * If the queue is assigned to the guest's APCB, then remove
-		 * the adapter's APID from the APCB and hot it into the guest.
-		 */
+		/* If the queue is assigned to the guest's AP configuration */
 		if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
+			/*
+			 * Since the queues are defined via a matrix of adapters
+			 * and domains, it is not possible to hot unplug a
+			 * single queue; so, let's unplug the adapter.
+			 */
 			clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apid(matrix_mdev, apid);
+			goto done;
 		}
 	}
 
 	vfio_ap_mdev_reset_queue(q);
 	flush_work(&q->reset_work);
+
+done:
+	if (matrix_mdev)
+		vfio_ap_unlink_queue_fr_mdev(q);
+
 	dev_set_drvdata(&apdev->device, NULL);
 	kfree(q);
 	release_update_locks_for_mdev(matrix_mdev);
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v4 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration
    2024-01-15 18:54 20% ` [PATCH v4 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
@ 2024-01-15 18:54 14% ` Tony Krowiak
  2024-01-15 18:54 12% ` [PATCH v4 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-15 18:54 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, agordeev, gor, Tony Krowiak, stable

From: Tony Krowiak <akrowiak@linux.ibm.com>

While filtering the mdev matrix, it doesn't make sense - and will have
unexpected results - to filter an APID from the matrix if the APID or one
of the associated APQIs is not in the host's AP configuration. There are
two reasons for this:

1. An adapter or domain that is not in the host's AP configuration can be
   assigned to the matrix; this is known as over-provisioning. Queue
   devices, however, are only created for adapters and domains in the
   host's AP configuration, so there will be no queues associated with an
   over-provisioned adapter or domain to filter.

2. The adapter or domain may have been externally removed from the host's
   configuration via an SE or HMC attached to a DPM enabled LPAR. In this
   case, the vfio_ap device driver would have been notified by the AP bus
   via the on_config_changed callback and the adapter or domain would
   have already been filtered.

Since the matrix_mdev->shadow_apcb.apm and matrix_mdev->shadow_apcb.aqm are
copied from the mdev matrix sans the APIDs and APQIs not in the host's AP
configuration, let's loop over those bitmaps instead of those assigned to
the matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 1f7a6c106786..e825e13847fe 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -695,8 +695,9 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->shadow_apcb.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
-- 
2.43.0


^ permalink raw reply related	[relevance 14%]

* [PATCH v4 1/6] s390/vfio-ap: always filter entire AP matrix
  @ 2024-01-15 18:54 20% ` Tony Krowiak
  2024-01-15 18:54 14% ` [PATCH v4 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-15 18:54 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, agordeev, gor, Tony Krowiak, stable

From: Tony Krowiak <akrowiak@linux.ibm.com>

The vfio_ap_mdev_filter_matrix function is called whenever a new adapter or
domain is assigned to the mdev. The purpose of the function is to update
the guest's AP configuration by filtering the matrix of adapters and
domains assigned to the mdev. When an adapter or domain is assigned, only
the APQNs associated with the APID of the new adapter or APQI of the new
domain are inspected. If an APQN does not reference a queue device bound to
the vfio_ap device driver, then it's APID will be filtered from the mdev's
matrix when updating the guest's AP configuration.

Inspecting only the APID of the new adapter or APQI of the new domain will
result in passing AP queues through to a guest that are not bound to the
vfio_ap device driver under certain circumstances. Consider the following:

guest's AP configuration (all also assigned to the mdev's matrix):
14.0004
14.0005
14.0006
16.0004
16.0005
16.0006

unassign domain 4
unbind queue 16.0005
assign domain 4

When domain 4 is re-assigned, since only domain 4 will be inspected, the
APQNs that will be examined will be:
14.0004
16.0004

Since both of those APQNs reference queue devices that are bound to the
vfio_ap device driver, nothing will get filtered from the mdev's matrix
when updating the guest's AP configuration. Consequently, queue 16.0005
will get passed through despite not being bound to the driver. This
violates the linux device model requirement that a guest shall only be
given access to devices bound to the device driver facilitating their
pass-through.

To resolve this problem, every adapter and domain assigned to the mdev will
be inspected when filtering the mdev's matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 57 +++++++++----------------------
 1 file changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index acb710d3d7bc..1f7a6c106786 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -674,8 +674,7 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
-				       struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -696,8 +695,8 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
@@ -962,7 +961,6 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
-	DECLARE_BITMAP(apm_delta, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -991,11 +989,8 @@ static ssize_t assign_adapter_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
-	memset(apm_delta, 0, sizeof(apm_delta));
-	set_bit_inv(apid, apm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(apm_delta,
-				       matrix_mdev->matrix.aqm, matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -1171,7 +1166,6 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
-	DECLARE_BITMAP(aqm_delta, AP_DOMAINS);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1200,11 +1194,8 @@ static ssize_t assign_domain_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
-	memset(aqm_delta, 0, sizeof(aqm_delta));
-	set_bit_inv(apqi, aqm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm, aqm_delta,
-				       matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -2109,9 +2100,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 	if (matrix_mdev) {
 		vfio_ap_mdev_link_queue(matrix_mdev, q);
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm,
-					       matrix_mdev->matrix.aqm,
-					       matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 	dev_set_drvdata(&apdev->device, q);
@@ -2461,34 +2450,22 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
-	bool do_hotplug = false;
-	int filter_domains = 0;
-	int filter_adapters = 0;
-	DECLARE_BITMAP(apm, AP_DEVICES);
-	DECLARE_BITMAP(aqm, AP_DOMAINS);
+	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
 
-	filter_adapters = bitmap_and(apm, matrix_mdev->matrix.apm,
-				     matrix_mdev->apm_add, AP_DEVICES);
-	filter_domains = bitmap_and(aqm, matrix_mdev->matrix.aqm,
-				    matrix_mdev->aqm_add, AP_DOMAINS);
-
-	if (filter_adapters && filter_domains)
-		do_hotplug |= vfio_ap_mdev_filter_matrix(apm, aqm, matrix_mdev);
-	else if (filter_adapters)
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(apm,
-						   matrix_mdev->shadow_apcb.aqm,
-						   matrix_mdev);
-	else
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(matrix_mdev->shadow_apcb.apm,
-						   aqm, matrix_mdev);
+	filter_adapters = bitmap_intersects(matrix_mdev->matrix.apm,
+					    matrix_mdev->apm_add, AP_DEVICES);
+	filter_domains = bitmap_intersects(matrix_mdev->matrix.aqm,
+					   matrix_mdev->aqm_add, AP_DOMAINS);
+	filter_cdoms = bitmap_intersects(matrix_mdev->matrix.adm,
+					 matrix_mdev->adm_add, AP_DOMAINS);
+
+	if (filter_adapters || filter_domains)
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
 
-	if (bitmap_intersects(matrix_mdev->matrix.adm, matrix_mdev->adm_add,
-			      AP_DOMAINS))
+	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
 
 	if (do_hotplug)
-- 
2.43.0


^ permalink raw reply related	[relevance 20%]

* Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2024-01-15  4:27  0%       ` Xiaoyao Li
@ 2024-01-15 14:54  0%         ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2024-01-15 14:54 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S . Tsirkin,
	Richard Henderson, Paolo Bonzini, Marcelo Tosatti, qemu-devel,
	kvm, Zhenyu Wang, Zhuocheng Ding, Zhao Liu, Babu Moger,
	Yongwei Ma

Hi Xiaoyao,

On Mon, Jan 15, 2024 at 12:27:43PM +0800, Xiaoyao Li wrote:
> Date: Mon, 15 Jan 2024 12:27:43 +0800
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> Subject: Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for
>  CPUID[0x8000001D].EAX[bits 25:14]
> 
> On 1/15/2024 11:48 AM, Zhao Liu wrote:
> > Hi Xiaoyao,
> > 
> > On Sun, Jan 14, 2024 at 10:42:41PM +0800, Xiaoyao Li wrote:
> > > Date: Sun, 14 Jan 2024 22:42:41 +0800
> > > From: Xiaoyao Li <xiaoyao.li@intel.com>
> > > Subject: Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for
> > >   CPUID[0x8000001D].EAX[bits 25:14]
> > > 
> > > On 1/8/2024 4:27 PM, Zhao Liu wrote:
> > > > From: Zhao Liu <zhao1.liu@intel.com>
> > > > 
> > > > The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> > > > for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> > > > the number of sharing threads directly.
> > > > 
> > > >   From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> > > > means [1]:
> > > > 
> > > > The number of logical processors sharing this cache is the value of
> > > > this field incremented by 1. To determine which logical processors are
> > > > sharing a cache, determine a Share Id for each processor as follows:
> > > > 
> > > > ShareId = LocalApicId >> log2(NumSharingCache+1)
> > > > 
> > > > Logical processors with the same ShareId then share a cache. If
> > > > NumSharingCache+1 is not a power of two, round it up to the next power
> > > > of two.
> > > > 
> > > >   From the description above, the calculation of this field should be same
> > > > as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> > > > APIC ID to calculate this field.
> > > > 
> > > > [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
> > > >        Information
> > > 
> > > this patch can be dropped because we have next patch.
> > 
> > This patch is mainly used to explicitly emphasize the change in encoding
> > way and compliance with AMD spec... I didn't tested on AMD machine, so
> > the more granular patch would make it easier for the community to review
> > and test.
> 
> then please move this patch ahead, e.g., after patch 2.
>

OK. Thanks!
-Zhao


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2024-01-15  3:48  0%     ` Zhao Liu
@ 2024-01-15  4:27  0%       ` Xiaoyao Li
  2024-01-15 14:54  0%         ` Zhao Liu
  0 siblings, 1 reply; 200+ results
From: Xiaoyao Li @ 2024-01-15  4:27 UTC (permalink / raw)
  To: Zhao Liu
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S . Tsirkin,
	Richard Henderson, Paolo Bonzini, Marcelo Tosatti, qemu-devel,
	kvm, Zhenyu Wang, Zhuocheng Ding, Zhao Liu, Babu Moger,
	Yongwei Ma

On 1/15/2024 11:48 AM, Zhao Liu wrote:
> Hi Xiaoyao,
> 
> On Sun, Jan 14, 2024 at 10:42:41PM +0800, Xiaoyao Li wrote:
>> Date: Sun, 14 Jan 2024 22:42:41 +0800
>> From: Xiaoyao Li <xiaoyao.li@intel.com>
>> Subject: Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for
>>   CPUID[0x8000001D].EAX[bits 25:14]
>>
>> On 1/8/2024 4:27 PM, Zhao Liu wrote:
>>> From: Zhao Liu <zhao1.liu@intel.com>
>>>
>>> The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
>>> for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
>>> the number of sharing threads directly.
>>>
>>>   From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
>>> means [1]:
>>>
>>> The number of logical processors sharing this cache is the value of
>>> this field incremented by 1. To determine which logical processors are
>>> sharing a cache, determine a Share Id for each processor as follows:
>>>
>>> ShareId = LocalApicId >> log2(NumSharingCache+1)
>>>
>>> Logical processors with the same ShareId then share a cache. If
>>> NumSharingCache+1 is not a power of two, round it up to the next power
>>> of two.
>>>
>>>   From the description above, the calculation of this field should be same
>>> as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
>>> APIC ID to calculate this field.
>>>
>>> [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
>>>        Information
>>
>> this patch can be dropped because we have next patch.
> 
> This patch is mainly used to explicitly emphasize the change in encoding
> way and compliance with AMD spec... I didn't tested on AMD machine, so
> the more granular patch would make it easier for the community to review
> and test.

then please move this patch ahead, e.g., after patch 2.

> Thanks,
> Zhao
> 
>>
>>> Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
>>> Reviewed-by: Babu Moger <babu.moger@amd.com>
>>> Tested-by: Babu Moger <babu.moger@amd.com>
>>> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>> Changes since v3:
>>>    * Rewrite the subject. (Babu)
>>>    * Delete the original "comment/help" expression, as this behavior is
>>>      confirmed for AMD CPUs. (Babu)
>>>    * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
>>>      definition. (Babu)
>>>
>>> Changes since v1:
>>>    * Rename "l3_threads" to "num_apic_ids" in
>>>      encode_cache_cpuid8000001d(). (Yanan)
>>>    * Add the description of the original commit and add Cc.
>>> ---
>>>    target/i386/cpu.c | 10 ++++------
>>>    1 file changed, 4 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
>>> index b23e8190dc68..8a4d72f6f760 100644
>>> --- a/target/i386/cpu.c
>>> +++ b/target/i386/cpu.c
>>> @@ -483,7 +483,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>>>                                           uint32_t *eax, uint32_t *ebx,
>>>                                           uint32_t *ecx, uint32_t *edx)
>>>    {
>>> -    uint32_t l3_threads;
>>> +    uint32_t num_sharing_cache;
>>>        assert(cache->size == cache->line_size * cache->associativity *
>>>                              cache->partitions * cache->sets);
>>> @@ -492,13 +492,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>>>        /* L3 is shared among multiple cores */
>>>        if (cache->level == 3) {
>>> -        l3_threads = topo_info->modules_per_die *
>>> -                     topo_info->cores_per_module *
>>> -                     topo_info->threads_per_core;
>>> -        *eax |= (l3_threads - 1) << 14;
>>> +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
>>>        } else {
>>> -        *eax |= ((topo_info->threads_per_core - 1) << 14);
>>> +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
>>>        }
>>> +    *eax |= (num_sharing_cache - 1) << 14;
>>>        assert(cache->line_size > 0);
>>>        assert(cache->partitions > 0);
>>


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2024-01-14 14:42  0%   ` Xiaoyao Li
@ 2024-01-15  3:48  0%     ` Zhao Liu
  2024-01-15  4:27  0%       ` Xiaoyao Li
  0 siblings, 1 reply; 200+ results
From: Zhao Liu @ 2024-01-15  3:48 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Eduardo Habkost, Marcel Apfelbaum, Michael S . Tsirkin,
	Richard Henderson, Paolo Bonzini, Marcelo Tosatti, qemu-devel,
	kvm, Zhenyu Wang, Zhuocheng Ding, Zhao Liu, Babu Moger,
	Yongwei Ma

Hi Xiaoyao,

On Sun, Jan 14, 2024 at 10:42:41PM +0800, Xiaoyao Li wrote:
> Date: Sun, 14 Jan 2024 22:42:41 +0800
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> Subject: Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for
>  CPUID[0x8000001D].EAX[bits 25:14]
> 
> On 1/8/2024 4:27 PM, Zhao Liu wrote:
> > From: Zhao Liu <zhao1.liu@intel.com>
> > 
> > The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> > for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> > the number of sharing threads directly.
> > 
> >  From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> > means [1]:
> > 
> > The number of logical processors sharing this cache is the value of
> > this field incremented by 1. To determine which logical processors are
> > sharing a cache, determine a Share Id for each processor as follows:
> > 
> > ShareId = LocalApicId >> log2(NumSharingCache+1)
> > 
> > Logical processors with the same ShareId then share a cache. If
> > NumSharingCache+1 is not a power of two, round it up to the next power
> > of two.
> > 
> >  From the description above, the calculation of this field should be same
> > as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> > APIC ID to calculate this field.
> > 
> > [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
> >       Information
> 
> this patch can be dropped because we have next patch.

This patch is mainly used to explicitly emphasize the change in encoding
way and compliance with AMD spec... I didn't tested on AMD machine, so
the more granular patch would make it easier for the community to review
and test.

Thanks,
Zhao

> 
> > Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
> > Reviewed-by: Babu Moger <babu.moger@amd.com>
> > Tested-by: Babu Moger <babu.moger@amd.com>
> > Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > Changes since v3:
> >   * Rewrite the subject. (Babu)
> >   * Delete the original "comment/help" expression, as this behavior is
> >     confirmed for AMD CPUs. (Babu)
> >   * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
> >     definition. (Babu)
> > 
> > Changes since v1:
> >   * Rename "l3_threads" to "num_apic_ids" in
> >     encode_cache_cpuid8000001d(). (Yanan)
> >   * Add the description of the original commit and add Cc.
> > ---
> >   target/i386/cpu.c | 10 ++++------
> >   1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index b23e8190dc68..8a4d72f6f760 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -483,7 +483,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
> >                                          uint32_t *eax, uint32_t *ebx,
> >                                          uint32_t *ecx, uint32_t *edx)
> >   {
> > -    uint32_t l3_threads;
> > +    uint32_t num_sharing_cache;
> >       assert(cache->size == cache->line_size * cache->associativity *
> >                             cache->partitions * cache->sets);
> > @@ -492,13 +492,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
> >       /* L3 is shared among multiple cores */
> >       if (cache->level == 3) {
> > -        l3_threads = topo_info->modules_per_die *
> > -                     topo_info->cores_per_module *
> > -                     topo_info->threads_per_core;
> > -        *eax |= (l3_threads - 1) << 14;
> > +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
> >       } else {
> > -        *eax |= ((topo_info->threads_per_core - 1) << 14);
> > +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
> >       }
> > +    *eax |= (num_sharing_cache - 1) << 14;
> >       assert(cache->line_size > 0);
> >       assert(cache->partitions > 0);
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2024-01-08  8:27  4% ` [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
@ 2024-01-14 14:42  0%   ` Xiaoyao Li
  2024-01-15  3:48  0%     ` Zhao Liu
  0 siblings, 1 reply; 200+ results
From: Xiaoyao Li @ 2024-01-14 14:42 UTC (permalink / raw)
  To: Zhao Liu, Eduardo Habkost, Marcel Apfelbaum, Michael S . Tsirkin,
	Richard Henderson, Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Zhao Liu,
	Babu Moger, Yongwei Ma

On 1/8/2024 4:27 PM, Zhao Liu wrote:
> From: Zhao Liu <zhao1.liu@intel.com>
> 
> The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> the number of sharing threads directly.
> 
>  From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> means [1]:
> 
> The number of logical processors sharing this cache is the value of
> this field incremented by 1. To determine which logical processors are
> sharing a cache, determine a Share Id for each processor as follows:
> 
> ShareId = LocalApicId >> log2(NumSharingCache+1)
> 
> Logical processors with the same ShareId then share a cache. If
> NumSharingCache+1 is not a power of two, round it up to the next power
> of two.
> 
>  From the description above, the calculation of this field should be same
> as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> APIC ID to calculate this field.
> 
> [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
>       Information

this patch can be dropped because we have next patch.

> Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
> Reviewed-by: Babu Moger <babu.moger@amd.com>
> Tested-by: Babu Moger <babu.moger@amd.com>
> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> Changes since v3:
>   * Rewrite the subject. (Babu)
>   * Delete the original "comment/help" expression, as this behavior is
>     confirmed for AMD CPUs. (Babu)
>   * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
>     definition. (Babu)
> 
> Changes since v1:
>   * Rename "l3_threads" to "num_apic_ids" in
>     encode_cache_cpuid8000001d(). (Yanan)
>   * Add the description of the original commit and add Cc.
> ---
>   target/i386/cpu.c | 10 ++++------
>   1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index b23e8190dc68..8a4d72f6f760 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -483,7 +483,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>                                          uint32_t *eax, uint32_t *ebx,
>                                          uint32_t *ecx, uint32_t *edx)
>   {
> -    uint32_t l3_threads;
> +    uint32_t num_sharing_cache;
>       assert(cache->size == cache->line_size * cache->associativity *
>                             cache->partitions * cache->sets);
>   
> @@ -492,13 +492,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>   
>       /* L3 is shared among multiple cores */
>       if (cache->level == 3) {
> -        l3_threads = topo_info->modules_per_die *
> -                     topo_info->cores_per_module *
> -                     topo_info->threads_per_core;
> -        *eax |= (l3_threads - 1) << 14;
> +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
>       } else {
> -        *eax |= ((topo_info->threads_per_core - 1) << 14);
> +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
>       }
> +    *eax |= (num_sharing_cache - 1) << 14;
>   
>       assert(cache->line_size > 0);
>       assert(cache->partitions > 0);


^ permalink raw reply	[relevance 0%]

* [PATCH] KVM: VMX: Correct *intr_info content and *info2 for EPT_VIOLATION in get_exit_info()
@ 2024-01-12  6:51  4% Robert Hoo
  2024-01-26 20:45  0% ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Robert Hoo @ 2024-01-12  6:51 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm; +Cc: Robert Hoo

Fill vmx::idt_vectoring_info in *intr_info, to align with
svm_get_exit_info(), where *intr_info is for complement information about
intercepts occurring during event delivery through IDT (APM 15.7.2
Intercepts During IDT Interrupt Delivery), whose counterpart in
VMX is IDT_VECTORING_INFO_FIELD (SDM 25.9.3 Information for VM Exits
That Occur During Event Delivery), rather than VM_EXIT_INTR_INFO.

Fill *info2 with GUEST_PHYSICAL_ADDRESS in case of EPT_VIOLATION, also
to align with SVM. It can be filled with other info for different exit
reasons, like SVM's EXITINFO2.

Fixes: 235ba74f008d ("KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint")
Signed-off-by: Robert Hoo <robert.hoo.linux@gmail.com>
---
 arch/x86/kvm/vmx/vmx.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d21f55f323ea..f1bf9f1fc561 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6141,14 +6141,26 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 
 	*reason = vmx->exit_reason.full;
 	*info1 = vmx_get_exit_qual(vcpu);
+
 	if (!(vmx->exit_reason.failed_vmentry)) {
-		*info2 = vmx->idt_vectoring_info;
-		*intr_info = vmx_get_intr_info(vcpu);
+		*intr_info = vmx->idt_vectoring_info;
 		if (is_exception_with_error_code(*intr_info))
-			*error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+			*error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
 		else
 			*error_code = 0;
-	} else {
+
+		/* various *info2 semantics according to exit reason */
+		switch (vmx->exit_reason.basic) {
+		case EXIT_REASON_EPT_VIOLATION:
+			*info2 = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+			break;
+		/* To do: *info2 for other exit reasons */
+		default:
+			*info2 = 0;
+			break;
+		}
+
+	} else {
 		*info2 = 0;
 		*intr_info = 0;
 		*error_code = 0;

base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
-- 
2.39.3


^ permalink raw reply related	[relevance 4%]

* [PATCH v3 6/6] s390/vfio-ap: do not reset queue removed from host config
                     ` (3 preceding siblings ...)
  2024-01-11 14:38  5% ` [PATCH v3 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
@ 2024-01-11 14:38  5% ` Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-11 14:38 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When a queue is unbound from the vfio_ap device driver, it is reset to
ensure its crypto data is not leaked when it is bound to another device
driver. If the queue is unbound due to the fact that the adapter or domain
was removed from the host's AP configuration, then attempting to reset it
will fail with response code 01 (APID not valid) getting returned from the
reset command. Let's ensure that the queue is assigned to the host's
configuration before resetting it.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e014108067dc..84decb0d5c97 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2197,6 +2197,8 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	q = dev_get_drvdata(&apdev->device);
 	get_update_locks_for_queue(q);
 	matrix_mdev = q->matrix_mdev;
+	apid = AP_QID_CARD(q->apqn);
+	apqi = AP_QID_QUEUE(q->apqn);
 
 	if (matrix_mdev) {
 		/* If the queue is assigned to the guest's AP configuration */
@@ -2214,8 +2216,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 		}
 	}
 
-	vfio_ap_mdev_reset_queue(q);
-	flush_work(&q->reset_work);
+	/*
+	 * If the queue is not in the host's AP configuration, then resetting
+	 * it will fail with response code 01, (APQN not valid); so, let's make
+	 * sure it is in the host's config.
+	 */
+	if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
+	    test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
+		vfio_ap_mdev_reset_queue(q);
+		flush_work(&q->reset_work);
+	}
 
 done:
 	if (matrix_mdev)
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v3 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration
    2024-01-11 14:38 20% ` [PATCH v3 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
@ 2024-01-11 14:38 14% ` Tony Krowiak
  2024-01-11 14:38 13% ` [PATCH v3 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-11 14:38 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

While filtering the mdev matrix, it doesn't make sense - and will have
unexpected results - to filter an APID from the matrix if the APID or one
of the associated APQIs is not in the host's AP configuration. There are
two reasons for this:

1. An adapter or domain that is not in the host's AP configuration can be
   assigned to the matrix; this is known as over-provisioning. Queue
   devices, however, are only created for adapters and domains in the
   host's AP configuration, so there will be no queues associated with an
   over-provisioned adapter or domain to filter.

2. The adapter or domain may have been externally removed from the host's
   configuration via an SE or HMC attached to a DPM enabled LPAR. In this
   case, the vfio_ap device driver would have been notified by the AP bus
   via the on_config_changed callback and the adapter or domain would
   have already been filtered.

Since the matrix_mdev->shadow_apcb.apm and matrix_mdev->shadow_apcb.aqm are
copied from the mdev matrix sans the APIDs and APQIs not in the host's AP
configuration, let's loop over those bitmaps instead of those assigned to
the matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 9382b32e5bd1..47232e19a50e 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -691,8 +691,9 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->shadow_apcb.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
-- 
2.43.0


^ permalink raw reply related	[relevance 14%]

* [PATCH v3 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver
                     ` (2 preceding siblings ...)
  2024-01-11 14:38 13% ` [PATCH v3 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
@ 2024-01-11 14:38  5% ` Tony Krowiak
  2024-01-11 14:38  5% ` [PATCH v3 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-11 14:38 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When a queue is unbound from the vfio_ap device driver, if that queue is
assigned to a guest's AP configuration, its associated adapter is removed
because queues are defined to a guest via a matrix of adapters and
domains; so, it is not possible to remove a single queue.

If an adapter is removed from the guest's AP configuration, all associated
queues must be reset to prevent leaking crypto data should any of them be
assigned to a different guest or device driver. The one caveat is that if
the queue is being removed because the adapter or domain has been removed
from the host's AP configuration, then an attempt to reset the queue will
fail with response code 01, AP-queue number not valid; so resetting these
queues should be skipped.
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 09d31ff78793 ("s390/vfio-ap: hot plug/unplug of AP devices when probed/removed")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 78 ++++++++++++++++---------------
 1 file changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 11f8f0bcc7ed..e014108067dc 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -935,45 +935,45 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static void collect_queues_to_reset(struct ap_matrix_mdev *matrix_mdev,
+				    unsigned long apid,
+				    struct list_head *qlist)
+{
+	struct vfio_ap_queue *q;
+	unsigned long  apqi;
+
+	for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS) {
+		q = vfio_ap_mdev_get_queue(matrix_mdev, AP_MKQID(apid, apqi));
+		if (q)
+			list_add_tail(&q->reset_qnode, qlist);
+	}
+}
+
+static void reset_queues_for_apid(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long apid)
+{
+	struct list_head qlist;
+
+	INIT_LIST_HEAD(&qlist);
+	collect_queues_to_reset(matrix_mdev, apid, &qlist);
+	vfio_ap_mdev_reset_qlist(&qlist);
+}
+
 static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
 				  unsigned long *apm_reset)
 {
-	struct vfio_ap_queue *q, *tmpq;
 	struct list_head qlist;
-	unsigned long apid, apqi;
-	int apqn, ret = 0;
+	unsigned long apid;
 
 	if (bitmap_empty(apm_reset, AP_DEVICES))
 		return 0;
 
 	INIT_LIST_HEAD(&qlist);
 
-	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
-				     AP_DOMAINS) {
-			/*
-			 * If the domain is not in the host's AP configuration,
-			 * then resetting it will fail with response code 01
-			 * (APQN not valid).
-			 */
-			if (!test_bit_inv(apqi,
-					  (unsigned long *)matrix_dev->info.aqm))
-				continue;
-
-			apqn = AP_MKQID(apid, apqi);
-			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
-
-			if (q)
-				list_add_tail(&q->reset_qnode, &qlist);
-		}
-	}
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES)
+		collect_queues_to_reset(matrix_mdev, apid, &qlist);
 
-	ret = vfio_ap_mdev_reset_qlist(&qlist);
-
-	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
-		list_del(&q->reset_qnode);
-
-	return ret;
+	return vfio_ap_mdev_reset_qlist(&qlist);
 }
 
 /**
@@ -2199,24 +2199,28 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	matrix_mdev = q->matrix_mdev;
 
 	if (matrix_mdev) {
-		vfio_ap_unlink_queue_fr_mdev(q);
-
-		apid = AP_QID_CARD(q->apqn);
-		apqi = AP_QID_QUEUE(q->apqn);
-
-		/*
-		 * If the queue is assigned to the guest's APCB, then remove
-		 * the adapter's APID from the APCB and hot it into the guest.
-		 */
+		/* If the queue is assigned to the guest's AP configuration */
 		if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
+			/*
+			 * Since the queues are defined via a matrix of adapters
+			 * and domains, it is not possible to hot unplug a
+			 * single queue; so, let's unplug the adapter.
+			 */
 			clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apid(matrix_mdev, apid);
+			goto done;
 		}
 	}
 
 	vfio_ap_mdev_reset_queue(q);
 	flush_work(&q->reset_work);
+
+done:
+	if (matrix_mdev)
+		vfio_ap_unlink_queue_fr_mdev(q);
+
 	dev_set_drvdata(&apdev->device, NULL);
 	kfree(q);
 	release_update_locks_for_mdev(matrix_mdev);
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v3 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config
    2024-01-11 14:38 20% ` [PATCH v3 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
  2024-01-11 14:38 14% ` [PATCH v3 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
@ 2024-01-11 14:38 13% ` Tony Krowiak
  2024-01-11 14:38  5% ` [PATCH v3 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
  2024-01-11 14:38  5% ` [PATCH v3 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-11 14:38 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When filtering the adapters from the configuration profile for a guest to
create or update a guest's AP configuration, if the APID of an adapter and
the APQI of a domain identify a queue device that is not bound to the
vfio_ap device driver, the APID of the adapter will be filtered because an
individual APQN can not be filtered due to the fact the APQNs are assigned
to an AP configuration as a matrix of APIDs and APQIs. Consequently, a
guest will not have access to all of the queues associated with the
filtered adapter. If the queues are subsequently made available again to
the guest, they should re-appear in a reset state; so, let's make sure all
queues associated with an adapter unplugged from the guest are reset.

In order to identify the set of queues that need to be reset, let's allow a
vfio_ap_queue object to be simultaneously stored in both a hashtable and a
list: A hashtable used to store all of the queues assigned
to a matrix mdev; and/or, a list used to store a subset of the queues that
need to be reset. For example, when an adapter is hot unplugged from a
guest, all guest queues associated with that adapter must be reset. Since
that may be a subset of those assigned to the matrix mdev, they can be
stored in a list that can be passed to the vfio_ap_mdev_reset_queues
function.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c     | 171 +++++++++++++++++++-------
 drivers/s390/crypto/vfio_ap_private.h |  11 +-
 2 files changed, 133 insertions(+), 49 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 26bd4aca497a..11f8f0bcc7ed 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -32,7 +32,8 @@
 
 #define AP_RESET_INTERVAL		20	/* Reset sleep interval (20ms)		*/
 
-static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable);
+static int vfio_ap_mdev_reset_queues(struct ap_matrix_mdev *matrix_mdev);
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist);
 static struct vfio_ap_queue *vfio_ap_find_queue(int apqn);
 static const struct vfio_device_ops vfio_ap_matrix_dev_ops;
 static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q);
@@ -661,16 +662,23 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  *				device driver.
  *
  * @matrix_mdev: the matrix mdev whose matrix is to be filtered.
+ * @apm_filtered: a 256-bit bitmap for storing the APIDs filtered from the
+ *		  guest's AP configuration that are still in the host's AP
+ *		  configuration.
  *
  * Note: If an APQN referencing a queue device that is not bound to the vfio_ap
  *	 driver, its APID will be filtered from the guest's APCB. The matrix
  *	 structure precludes filtering an individual APQN, so its APID will be
- *	 filtered.
+ *	 filtered. Consequently, all queues associated with the adapter that
+ *	 are in the host's AP configuration must be reset. If queues are
+ *	 subsequently made available again to the guest, they should re-appear
+ *	 in a reset state
  *
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev,
+				       unsigned long *apm_filtered)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -680,6 +688,7 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_copy(prev_shadow_apm, matrix_mdev->shadow_apcb.apm, AP_DEVICES);
 	bitmap_copy(prev_shadow_aqm, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS);
 	vfio_ap_matrix_init(&matrix_dev->info, &matrix_mdev->shadow_apcb);
+	bitmap_clear(apm_filtered, 0, AP_DEVICES);
 
 	/*
 	 * Copy the adapters, domains and control domains to the shadow_apcb
@@ -705,8 +714,16 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 			apqn = AP_MKQID(apid, apqi);
 			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
 			if (!q || q->reset_status.response_code) {
-				clear_bit_inv(apid,
-					      matrix_mdev->shadow_apcb.apm);
+				clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+				/*
+				 * If the adapter was previously plugged into
+				 * the guest, let's let the caller know that
+				 * the APID was filtered.
+				 */
+				if (test_bit_inv(apid, prev_shadow_apm))
+					set_bit_inv(apid, apm_filtered);
+
 				break;
 			}
 		}
@@ -808,7 +825,7 @@ static void vfio_ap_mdev_remove(struct mdev_device *mdev)
 
 	mutex_lock(&matrix_dev->guests_lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
-	vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+	vfio_ap_mdev_reset_queues(matrix_mdev);
 	vfio_ap_mdev_unlink_fr_queues(matrix_mdev);
 	list_del(&matrix_mdev->node);
 	mutex_unlock(&matrix_dev->mdevs_lock);
@@ -918,6 +935,47 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long *apm_reset)
+{
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
+	unsigned long apid, apqi;
+	int apqn, ret = 0;
+
+	if (bitmap_empty(apm_reset, AP_DEVICES))
+		return 0;
+
+	INIT_LIST_HEAD(&qlist);
+
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
+			/*
+			 * If the domain is not in the host's AP configuration,
+			 * then resetting it will fail with response code 01
+			 * (APQN not valid).
+			 */
+			if (!test_bit_inv(apqi,
+					  (unsigned long *)matrix_dev->info.aqm))
+				continue;
+
+			apqn = AP_MKQID(apid, apqi);
+			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
+
+			if (q)
+				list_add_tail(&q->reset_qnode, &qlist);
+		}
+	}
+
+	ret = vfio_ap_mdev_reset_qlist(&qlist);
+
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
+		list_del(&q->reset_qnode);
+
+	return ret;
+}
+
 /**
  * assign_adapter_store - parses the APID from @buf and sets the
  * corresponding bit in the mediated matrix device's APM
@@ -958,6 +1016,7 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -987,8 +1046,10 @@ static ssize_t assign_adapter_store(struct device *dev,
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1019,11 +1080,12 @@ static struct vfio_ap_queue
  *				 adapter was assigned.
  * @matrix_mdev: the matrix mediated device to which the adapter was assigned.
  * @apid: the APID of the unassigned adapter.
- * @qtable: table for storing queues associated with unassigned adapter.
+ * @qlist: list for storing queues associated with unassigned adapter that
+ *	   need to be reset.
  */
 static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 					unsigned long apid,
-					struct ap_queue_table *qtable)
+					struct list_head *qlist)
 {
 	unsigned long apqi;
 	struct vfio_ap_queue *q;
@@ -1031,11 +1093,10 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1043,26 +1104,23 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_adapter(struct ap_matrix_mdev *matrix_mdev,
 					    unsigned long apid)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, &qlist);
 
 	if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
 		clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1163,6 +1221,7 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1192,8 +1251,10 @@ static ssize_t assign_domain_store(struct device *dev,
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1206,7 +1267,7 @@ static DEVICE_ATTR_WO(assign_domain);
 
 static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 				       unsigned long apqi,
-				       struct ap_queue_table *qtable)
+				       struct list_head *qlist)
 {
 	unsigned long apid;
 	struct vfio_ap_queue *q;
@@ -1214,11 +1275,10 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1226,26 +1286,23 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_domain(struct ap_matrix_mdev *matrix_mdev,
 					   unsigned long apqi)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, &qlist);
 
 	if (test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
 		clear_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1600,7 +1657,7 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
 		get_update_locks_for_kvm(kvm);
 
 		kvm_arch_crypto_clear_masks(kvm);
-		vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+		vfio_ap_mdev_reset_queues(matrix_mdev);
 		kvm_put_kvm(kvm);
 		matrix_mdev->kvm = NULL;
 
@@ -1736,15 +1793,33 @@ static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q)
 	}
 }
 
-static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable)
+static int vfio_ap_mdev_reset_queues(struct ap_matrix_mdev *matrix_mdev)
 {
 	int ret = 0, loop_cursor;
 	struct vfio_ap_queue *q;
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode)
+	hash_for_each(matrix_mdev->qtable.queues, loop_cursor, q, mdev_qnode)
 		vfio_ap_mdev_reset_queue(q);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	hash_for_each(matrix_mdev->qtable.queues, loop_cursor, q, mdev_qnode) {
+		flush_work(&q->reset_work);
+
+		if (q->reset_status.response_code)
+			ret = -EIO;
+	}
+
+	return ret;
+}
+
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist)
+{
+	int ret = 0;
+	struct vfio_ap_queue *q;
+
+	list_for_each_entry(q, qlist, reset_qnode)
+		vfio_ap_mdev_reset_queue(q);
+
+	list_for_each_entry(q, qlist, reset_qnode) {
 		flush_work(&q->reset_work);
 
 		if (q->reset_status.response_code)
@@ -1930,7 +2005,7 @@ static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
 		ret = vfio_ap_mdev_get_device_info(arg);
 		break;
 	case VFIO_DEVICE_RESET:
-		ret = vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+		ret = vfio_ap_mdev_reset_queues(matrix_mdev);
 		break;
 	case VFIO_DEVICE_GET_IRQ_INFO:
 			ret = vfio_ap_get_irq_info(arg);
@@ -2062,6 +2137,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 {
 	int ret;
 	struct vfio_ap_queue *q;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev;
 
 	ret = sysfs_create_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2094,15 +2170,17 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 		    !bitmap_empty(matrix_mdev->aqm_add, AP_DOMAINS))
 			goto done;
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apids(matrix_mdev, apm_filtered);
+		}
 	}
 
 done:
 	dev_set_drvdata(&apdev->device, q);
 	release_update_locks_for_mdev(matrix_mdev);
 
-	return 0;
+	return ret;
 
 err_remove_group:
 	sysfs_remove_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2446,6 +2524,7 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
@@ -2459,7 +2538,7 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 					 matrix_mdev->adm_add, AP_DOMAINS);
 
 	if (filter_adapters || filter_domains)
-		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered);
 
 	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
@@ -2467,6 +2546,8 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 	if (do_hotplug)
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
+	reset_queues_for_apids(matrix_mdev, apm_filtered);
+
 	mutex_unlock(&matrix_dev->mdevs_lock);
 	mutex_unlock(&matrix_mdev->kvm->lock);
 }
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 88aff8b81f2f..20eac8b0f0b9 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -83,10 +83,10 @@ struct ap_matrix {
 };
 
 /**
- * struct ap_queue_table - a table of queue objects.
- *
- * @queues: a hashtable of queues (struct vfio_ap_queue).
- */
+  * struct ap_queue_table - a table of queue objects.
+  *
+  * @queues: a hashtable of queues (struct vfio_ap_queue).
+  */
 struct ap_queue_table {
 	DECLARE_HASHTABLE(queues, 8);
 };
@@ -133,6 +133,8 @@ struct ap_matrix_mdev {
  * @apqn: the APQN of the AP queue device
  * @saved_isc: the guest ISC registered with the GIB interface
  * @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
+ * @reset_qnode: allows the vfio_ap_queue struct to be added to a list of queues
+ *		 that need to be reset
  * @reset_status: the status from the last reset of the queue
  * @reset_work: work to wait for queue reset to complete
  */
@@ -143,6 +145,7 @@ struct vfio_ap_queue {
 #define VFIO_AP_ISC_INVALID 0xff
 	unsigned char saved_isc;
 	struct hlist_node mdev_qnode;
+	struct list_head reset_qnode;
 	struct ap_queue_status reset_status;
 	struct work_struct reset_work;
 };
-- 
2.43.0


^ permalink raw reply related	[relevance 13%]

* [PATCH v3 1/6] s390/vfio-ap: always filter entire AP matrix
  @ 2024-01-11 14:38 20% ` Tony Krowiak
  2024-01-11 14:38 14% ` [PATCH v3 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2024-01-11 14:38 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

The vfio_ap_mdev_filter_matrix function is called whenever a new adapter or
domain is assigned to the mdev. The purpose of the function is to update
the guest's AP configuration by filtering the matrix of adapters and
domains assigned to the mdev. When an adapter or domain is assigned, only
the APQNs associated with the APID of the new adapter or APQI of the new
domain are inspected. If an APQN does not reference a queue device bound to
the vfio_ap device driver, then it's APID will be filtered from the mdev's
matrix when updating the guest's AP configuration.

Inspecting only the APID of the new adapter or APQI of the new domain will
result in passing AP queues through to a guest that are not bound to the
vfio_ap device driver under certain circumstances. Consider the following:

guest's AP configuration (all also assigned to the mdev's matrix):
14.0004
14.0005
14.0006
16.0004
16.0005
16.0006

unassign domain 4
unbind queue 16.0005
assign domain 4

When domain 4 is re-assigned, since only domain 4 will be inspected, the
APQNs that will be examined will be:
14.0004
16.0004

Since both of those APQNs reference queue devices that are bound to the
vfio_ap device driver, nothing will get filtered from the mdev's matrix
when updating the guest's AP configuration. Consequently, queue 16.0005
will get passed through despite not being bound to the driver. This
violates the linux device model requirement that a guest shall only be
given access to devices bound to the device driver facilitating their
pass-through.

To resolve this problem, every adapter and domain assigned to the mdev will
be inspected when filtering the mdev's matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 57 +++++++++----------------------
 1 file changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..9382b32e5bd1 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -670,8 +670,7 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
-				       struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -692,8 +691,8 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
@@ -958,7 +957,6 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
-	DECLARE_BITMAP(apm_delta, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -987,11 +985,8 @@ static ssize_t assign_adapter_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
-	memset(apm_delta, 0, sizeof(apm_delta));
-	set_bit_inv(apid, apm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(apm_delta,
-				       matrix_mdev->matrix.aqm, matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -1167,7 +1162,6 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
-	DECLARE_BITMAP(aqm_delta, AP_DOMAINS);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1196,11 +1190,8 @@ static ssize_t assign_domain_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
-	memset(aqm_delta, 0, sizeof(aqm_delta));
-	set_bit_inv(apqi, aqm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm, aqm_delta,
-				       matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -2091,9 +2082,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 	if (matrix_mdev) {
 		vfio_ap_mdev_link_queue(matrix_mdev, q);
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm,
-					       matrix_mdev->matrix.aqm,
-					       matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 	dev_set_drvdata(&apdev->device, q);
@@ -2443,34 +2432,22 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
-	bool do_hotplug = false;
-	int filter_domains = 0;
-	int filter_adapters = 0;
-	DECLARE_BITMAP(apm, AP_DEVICES);
-	DECLARE_BITMAP(aqm, AP_DOMAINS);
+	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
 
-	filter_adapters = bitmap_and(apm, matrix_mdev->matrix.apm,
-				     matrix_mdev->apm_add, AP_DEVICES);
-	filter_domains = bitmap_and(aqm, matrix_mdev->matrix.aqm,
-				    matrix_mdev->aqm_add, AP_DOMAINS);
-
-	if (filter_adapters && filter_domains)
-		do_hotplug |= vfio_ap_mdev_filter_matrix(apm, aqm, matrix_mdev);
-	else if (filter_adapters)
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(apm,
-						   matrix_mdev->shadow_apcb.aqm,
-						   matrix_mdev);
-	else
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(matrix_mdev->shadow_apcb.apm,
-						   aqm, matrix_mdev);
+	filter_adapters = bitmap_intersects(matrix_mdev->matrix.apm,
+					    matrix_mdev->apm_add, AP_DEVICES);
+	filter_domains = bitmap_intersects(matrix_mdev->matrix.aqm,
+					   matrix_mdev->aqm_add, AP_DOMAINS);
+	filter_cdoms = bitmap_intersects(matrix_mdev->matrix.adm,
+					 matrix_mdev->adm_add, AP_DOMAINS);
+
+	if (filter_adapters || filter_domains)
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
 
-	if (bitmap_intersects(matrix_mdev->matrix.adm, matrix_mdev->adm_add,
-			      AP_DOMAINS))
+	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
 
 	if (do_hotplug)
-- 
2.43.0


^ permalink raw reply related	[relevance 20%]

* Re: [PATCH v2 6/6] s390/vfio-ap: do not reset queue removed from host config
  2023-12-12 21:25  5% ` [PATCH v2 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
@ 2024-01-10 15:44  0%   ` Jason J. Herne
  0 siblings, 0 replies; 200+ results
From: Jason J. Herne @ 2024-01-10 15:44 UTC (permalink / raw)
  To: Tony Krowiak, linux-s390, linux-kernel, kvm
  Cc: borntraeger, pasic, pbonzini, frankja, imbrenda, alex.williamson,
	kwankhede, stable



On 12/12/23 4:25 PM, Tony Krowiak wrote:
> When a queue is unbound from the vfio_ap device driver, it is reset to
> ensure its crypto data is not leaked when it is bound to another device
> driver. If the queue is unbound due to the fact that the adapter or domain
> was removed from the host's AP configuration, then attempting to reset it
> will fail with response code 01 (APID not valid) getting returned from the
> reset command. Let's ensure that the queue is assigned to the host's
> configuration before resetting it.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/s390/crypto/vfio_ap_ops.c | 14 ++++++++++++--
>   1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e014108067dc..84decb0d5c97 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -2197,6 +2197,8 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
>   	q = dev_get_drvdata(&apdev->device);
>   	get_update_locks_for_queue(q);
>   	matrix_mdev = q->matrix_mdev;
> +	apid = AP_QID_CARD(q->apqn);
> +	apqi = AP_QID_QUEUE(q->apqn);
>   
>   	if (matrix_mdev) {
>   		/* If the queue is assigned to the guest's AP configuration */
> @@ -2214,8 +2216,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
>   		}
>   	}
>   
> -	vfio_ap_mdev_reset_queue(q);
> -	flush_work(&q->reset_work);
> +	/*
> +	 * If the queue is not in the host's AP configuration, then resetting
> +	 * it will fail with response code 01, (APQN not valid); so, let's make
> +	 * sure it is in the host's config.
> +	 */
> +	if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
> +	    test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
> +		vfio_ap_mdev_reset_queue(q);
> +		flush_work(&q->reset_work);
> +	}
>   
>   done:
>   	if (matrix_mdev)

Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com>

^ permalink raw reply	[relevance 0%]

* [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  @ 2024-01-08  8:27  4% ` Zhao Liu
  2024-01-14 14:42  0%   ` Xiaoyao Li
  0 siblings, 1 reply; 200+ results
From: Zhao Liu @ 2024-01-08  8:27 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Michael S . Tsirkin,
	Richard Henderson, Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Zhao Liu,
	Babu Moger, Yongwei Ma

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes since v3:
 * Rewrite the subject. (Babu)
 * Delete the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Rename "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Add the description of the original commit and add Cc.
---
 target/i386/cpu.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index b23e8190dc68..8a4d72f6f760 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -483,7 +483,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -492,13 +492,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->modules_per_die *
-                     topo_info->cores_per_module *
-                     topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v1 04/26] x86/sev: Add the host SEV-SNP initialization support
  @ 2024-01-04 11:16  3%   ` Borislav Petkov
  0 siblings, 0 replies; 200+ results
From: Borislav Petkov @ 2024-01-04 11:16 UTC (permalink / raw)
  To: Michael Roth
  Cc: x86, kvm, linux-coco, linux-mm, linux-crypto, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, tobin, vbabka, kirill, ak,
	tony.luck, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick,
	zhi.a.wang, Brijesh Singh

On Sat, Dec 30, 2023 at 10:19:32AM -0600, Michael Roth wrote:
> From: Brijesh Singh <brijesh.singh@amd.com>
> 
> The memory integrity guarantees of SEV-SNP are enforced through a new
> structure called the Reverse Map Table (RMP). The RMP is a single data
> structure shared across the system that contains one entry for every 4K
> page of DRAM that may be used by SEV-SNP VMs. APM2 section 15.36 details
> a number of steps needed to detect/enable SEV-SNP and RMP table support
> on the host:
> 
>  - Detect SEV-SNP support based on CPUID bit
>  - Initialize the RMP table memory reported by the RMP base/end MSR
>    registers and configure IOMMU to be compatible with RMP access
>    restrictions
>  - Set the MtrrFixDramModEn bit in SYSCFG MSR
>  - Set the SecureNestedPagingEn and VMPLEn bits in the SYSCFG MSR
>  - Configure IOMMU
> 
> RMP table entry format is non-architectural and it can vary by
> processor. It is defined by the PPR. Restrict SNP support to CPU
> models/families which are compatible with the current RMP table entry
> format to guard against any undefined behavior when running on other
> system types. Future models/support will handle this through an
> architectural mechanism to allow for broader compatibility.
> 
> SNP host code depends on CONFIG_KVM_AMD_SEV config flag, which may be
> enabled even when CONFIG_AMD_MEM_ENCRYPT isn't set, so update the
> SNP-specific IOMMU helpers used here to rely on CONFIG_KVM_AMD_SEV
> instead of CONFIG_AMD_MEM_ENCRYPT.

Small fixups to the commit message:

    The memory integrity guarantees of SEV-SNP are enforced through a new
    structure called the Reverse Map Table (RMP). The RMP is a single data
    structure shared across the system that contains one entry for every 4K
    page of DRAM that may be used by SEV-SNP VMs. The APM v2 section on
    Secure Nested Paging (SEV-SNP) details a number of steps needed to
    detect/enable SEV-SNP and RMP table support on the host:
    
     - Detect SEV-SNP support based on CPUID bit
     - Initialize the RMP table memory reported by the RMP base/end MSR
       registers and configure IOMMU to be compatible with RMP access
       restrictions
     - Set the MtrrFixDramModEn bit in SYSCFG MSR
     - Set the SecureNestedPagingEn and VMPLEn bits in the SYSCFG MSR
     - Configure IOMMU
    
    The RMP table entry format is non-architectural and it can vary by
    processor. It is defined by the PPR document for each respective CPU
    family. Restrict SNP support to CPU models/families which are compatible
    with the current RMP table entry format to guard against any undefined
    behavior when running on other system types. Future models/support will
    handle this through an architectural mechanism to allow for broader
    compatibility.
    
    The SNP host code depends on CONFIG_KVM_AMD_SEV config flag which may
    be enabled even when CONFIG_AMD_MEM_ENCRYPT isn't set, so update the
    SNP-specific IOMMU helpers used here to rely on CONFIG_KVM_AMD_SEV
    instead of CONFIG_AMD_MEM_ENCRYPT.

> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index f1bd7b91b3c6..15ce1269f270 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -599,6 +599,8 @@
>  #define MSR_AMD64_SEV_ENABLED		BIT_ULL(MSR_AMD64_SEV_ENABLED_BIT)
>  #define MSR_AMD64_SEV_ES_ENABLED	BIT_ULL(MSR_AMD64_SEV_ES_ENABLED_BIT)
>  #define MSR_AMD64_SEV_SNP_ENABLED	BIT_ULL(MSR_AMD64_SEV_SNP_ENABLED_BIT)
> +#define MSR_AMD64_RMP_BASE		0xc0010132
> +#define MSR_AMD64_RMP_END		0xc0010133
>  
>  /* SNP feature bits enabled by the hypervisor */
>  #define MSR_AMD64_SNP_VTOM			BIT_ULL(3)
> @@ -709,7 +711,14 @@
>  #define MSR_K8_TOP_MEM2			0xc001001d
>  #define MSR_AMD64_SYSCFG		0xc0010010
>  #define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT	23
> -#define MSR_AMD64_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_AMD64_SYSCFG_MEM_ENCRYPT		BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_EN_BIT		24
> +#define MSR_AMD64_SYSCFG_SNP_EN		BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT	25
> +#define MSR_AMD64_SYSCFG_SNP_VMPL_EN		BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
> +#define MSR_AMD64_SYSCFG_MFDM_BIT		19
> +#define MSR_AMD64_SYSCFG_MFDM			BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
> +

Fix the vertical alignment:

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 15ce1269f270..f482bc6a5ae7 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -710,14 +710,14 @@
 #define MSR_K8_TOP_MEM1			0xc001001a
 #define MSR_K8_TOP_MEM2			0xc001001d
 #define MSR_AMD64_SYSCFG		0xc0010010
-#define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT	23
-#define MSR_AMD64_SYSCFG_MEM_ENCRYPT		BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
-#define MSR_AMD64_SYSCFG_SNP_EN_BIT		24
+#define MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT 23
+#define MSR_AMD64_SYSCFG_MEM_ENCRYPT	BIT_ULL(MSR_AMD64_SYSCFG_MEM_ENCRYPT_BIT)
+#define MSR_AMD64_SYSCFG_SNP_EN_BIT	24
 #define MSR_AMD64_SYSCFG_SNP_EN		BIT_ULL(MSR_AMD64_SYSCFG_SNP_EN_BIT)
-#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT	25
-#define MSR_AMD64_SYSCFG_SNP_VMPL_EN		BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
-#define MSR_AMD64_SYSCFG_MFDM_BIT		19
-#define MSR_AMD64_SYSCFG_MFDM			BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT 25
+#define MSR_AMD64_SYSCFG_SNP_VMPL_EN	BIT_ULL(MSR_AMD64_SYSCFG_SNP_VMPL_EN_BIT)
+#define MSR_AMD64_SYSCFG_MFDM_BIT	19
+#define MSR_AMD64_SYSCFG_MFDM		BIT_ULL(MSR_AMD64_SYSCFG_MFDM_BIT)
 
 #define MSR_K8_INT_PENDING_MSG		0xc0010055
 /* C1E active bits in int pending message */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[relevance 3%]

* [PATCH v11 24/35] KVM: SEV: Add support to handle RMP nested page faults
    2023-12-30 17:23  3% ` [PATCH v11 21/35] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
  2023-12-30 17:23  4% ` [PATCH v11 22/35] KVM: SEV: Add support to handle " Michael Roth
@ 2023-12-30 17:23  2% ` Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2023-12-30 17:23 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, zhi.a.wang, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |  3 ++
 arch/x86/kvm/svm/sev.c     | 92 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c     | 21 +++++++--
 arch/x86/kvm/svm/svm.h     |  1 +
 4 files changed, 113 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 435ba9bc4510..e84dd1d2d8ab 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -90,6 +90,9 @@ extern bool handle_vc_boot_ghcb(struct pt_regs *regs);
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 8b6143110411..ad1aea7f6266 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3276,6 +3276,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3835,3 +3842,88 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, size-mismatch for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no private backing page for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB-aligned RMP
+	 *    entry for the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB-aligned RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE)
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 18d55df7fa5f..4367da074612 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2051,15 +2051,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		handle_rmp_page_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 2bee24017bae..fb98d88d8124 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -717,6 +717,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 
 /* vmenter.S */
 
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v11 22/35] KVM: SEV: Add support to handle Page State Change VMGEXIT
    2023-12-30 17:23  3% ` [PATCH v11 21/35] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
@ 2023-12-30 17:23  4% ` Michael Roth
  2023-12-30 17:23  2% ` [PATCH v11 24/35] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2023-12-30 17:23 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, zhi.a.wang, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change NAE event
as defined in the GHCB specification version 2.

Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how
it is done for requests that don't use a GHCB page.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst | 14 ++++++++++++++
 arch/x86/kvm/svm/sev.c         | 16 ++++++++++++++++
 include/uapi/linux/kvm.h       |  5 +++++
 3 files changed, 35 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 682490230feb..2a526b4f8e06 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7036,6 +7036,7 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 		/* KVM_EXIT_VMGEXIT */
 		struct kvm_user_vmgexit {
 		#define KVM_USER_VMGEXIT_PSC_MSR	1
+		#define KVM_USER_VMGEXIT_PSC		2
 			__u32 type; /* KVM_USER_VMGEXIT_* type */
 			union {
 				struct {
@@ -7045,9 +7046,14 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 					__u8 op;
 					__u32 ret;
 				} psc_msr;
+				struct {
+					__u64 shared_gpa;
+					__u64 ret;
+				} psc;
 			};
 		};
 
+
 If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
 has issued a VMGEXIT instruction (as documented by the AMD Architecture
 Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
@@ -7065,6 +7071,14 @@ update the private/shared state of the GPA using the corresponding
 KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
 userpace on success, or some non-zero value on failure.
 
+For the KVM_USER_VMGEXIT_PSC type, the psc union type is used. The kernel
+will supply the GPA of the Page State Structure defined in the GHCB spec.
+Userspace will process this structure as defined by the GHCB, and issue
+KVM_SET_MEMORY_ATTRIBUTES ioctls to set the GPAs therein to the expected
+private/shared state. Userspace will return a value in 'ret' that is in
+agreement with the GHCB-defined return values that the guest will expect
+in the SW_EXITINFO2 field of the GHCB in response to these requests.
+
 6. Capabilities that can be enabled on vCPUs
 ============================================
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 37e65d5700b8..8b6143110411 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3087,6 +3087,7 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 	case SVM_VMGEXIT_HV_FEATURES:
+	case SVM_VMGEXIT_PSC:
 		break;
 	default:
 		reason = GHCB_ERR_INVALID_EVENT;
@@ -3305,6 +3306,15 @@ static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
 	return 0; /* forward request to userspace */
 }
 
+static int snp_complete_psc(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, vcpu->run->vmgexit.psc.ret);
+
+	return 1; /* resume guest */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3542,6 +3552,12 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 
 		ret = 1;
 		break;
+	case SVM_VMGEXIT_PSC:
+		vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+		vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC;
+		vcpu->run->vmgexit.psc.shared_gpa = svm->sev_es.sw_scratch;
+		vcpu->arch.complete_userspace_io = snp_complete_psc;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 62093ddf7ec3..e0599144387b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -169,6 +169,7 @@ struct kvm_xen_exit {
 
 struct kvm_user_vmgexit {
 #define KVM_USER_VMGEXIT_PSC_MSR	1
+#define KVM_USER_VMGEXIT_PSC		2
 	__u32 type; /* KVM_USER_VMGEXIT_* type */
 	union {
 		struct {
@@ -178,6 +179,10 @@ struct kvm_user_vmgexit {
 			__u8 op;
 			__u32 ret;
 		} psc_msr;
+		struct {
+			__u64 shared_gpa;
+			__u64 ret;
+		} psc;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v11 21/35] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
  @ 2023-12-30 17:23  3% ` Michael Roth
  2023-12-30 17:23  4% ` [PATCH v11 22/35] KVM: SEV: Add support to handle " Michael Roth
  2023-12-30 17:23  2% ` [PATCH v11 24/35] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2023-12-30 17:23 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, zhi.a.wang, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

When using gmem, private/shared memory is allocated through separate
pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
backed by private memory or not.

Forward these page state change requests to userspace so that it can
issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
entries when it is ready to map a private page into a guest.

Define a new KVM_EXIT_VMGEXIT for exits of this type, and structure it
so that it can be extended for other cases where VMGEXITs need some
level of handling in userspace.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst    | 33 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/sev-common.h |  6 ++++++
 arch/x86/kvm/svm/sev.c            | 33 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          | 17 ++++++++++++++++
 4 files changed, 89 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3ec0b7a455a0..682490230feb 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7031,6 +7031,39 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+		#define KVM_USER_VMGEXIT_PSC_MSR	1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 gpa;
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+					__u8 op;
+					__u32 ret;
+				} psc_msr;
+			};
+		};
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects handling a VMGEXIT are handled by userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+For the KVM_USER_VMGEXIT_PSC_MSR type, the psc_msr union type is used. The
+kernel will supply the 'gpa' and 'op' fields, and userspace is expected to
+update the private/shared state of the GPA using the corresponding
+KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
+userpace on success, or some non-zero value on failure.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 1006bfffe07a..6d68db812de1 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,11 +101,17 @@ enum psc_op {
 	/* GHCBData[11:0] */				\
 	GHCB_MSR_PSC_REQ)
 
+#define GHCB_MSR_PSC_REQ_TO_GFN(msr) (((msr) & GENMASK_ULL(51, 12)) >> 12)
+#define GHCB_MSR_PSC_REQ_TO_OP(msr) (((msr) & GENMASK_ULL(55, 52)) >> 52)
+
 #define GHCB_MSR_PSC_RESP		0x015
 #define GHCB_MSR_PSC_RESP_VAL(val)			\
 	/* GHCBData[63:32] */				\
 	(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
 
+/* Set highest bit as a generic error response */
+#define GHCB_MSR_PSC_RESP_ERROR (BIT_ULL(63) | GHCB_MSR_PSC_RESP)
+
 /* GHCB Hypervisor Feature Request/Response */
 #define GHCB_MSR_HV_FT_REQ		0x080
 #define GHCB_MSR_HV_FT_RESP		0x081
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0b8837e21705..37e65d5700b8 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3275,6 +3275,36 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 vmm_ret = vcpu->run->vmgexit.psc_msr.ret;
+
+	set_ghcb_msr(svm, (vmm_ret << 32) | GHCB_MSR_PSC_RESP);
+
+	return 1; /* resume guest */
+}
+
+static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
+{
+	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
+	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
+		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
+		return 1; /* resume guest */
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
+	vcpu->run->vmgexit.psc_msr.gpa = gpa;
+	vcpu->run->vmgexit.psc_msr.op = op;
+	vcpu->arch.complete_userspace_io = snp_complete_psc_msr;
+
+	return 0; /* forward request to userspace */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3373,6 +3403,9 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PSC_REQ:
+		ret = snp_begin_psc_msr(vcpu, control->ghcb_gpa);
+		break;
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5218075fe1f4..62093ddf7ec3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -167,6 +167,20 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_PSC_MSR	1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 gpa;
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+			__u8 op;
+			__u32 ret;
+		} psc_msr;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -210,6 +224,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -470,6 +485,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 3%]

* [PATCH v7 00/16] Add Secure TSC support for SNP guests
@ 2023-12-20 15:13  2% Nikunj A Dadhania
  2024-01-25  6:08  0% ` Nikunj A. Dadhania
  0 siblings, 1 reply; 200+ results
From: Nikunj A Dadhania @ 2023-12-20 15:13 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, x86, kvm
  Cc: bp, mingo, tglx, dave.hansen, dionnaglaze, pgonda, seanjc,
	pbonzini, nikunj

Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
parameters being used cannot be changed by hypervisor once the guest is
launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".

During the boot-up of the secondary cpus, SecureTSC enabled guests need to
query TSC info from AMD Security Processor. This communication channel is
encrypted between the AMD Security Processor and the guest, the hypervisor
is just the conduit to deliver the guest messages to the AMD Security
Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
Secure Nested Paging Firmware ABI Specification" document (currently at
https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"

Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
communicate with the AMD Security Processor which is available at early
boot.

SEV-guest driver has the implementation for guest and AMD Security
Processor communication. As the TSC_INFO needs to be initialized during
early boot before smp cpus are started, move most of the sev-guest driver
code to kernel/sev.c and provide well defined APIs to the sev-guest driver
to use the interface to avoid code-duplication.

Patches:
01-08: Preparation and movement of sev-guest driver code
09-16: SecureTSC enablement patches.

Testing SecureTSC
-----------------

SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5

QEMU changes:
https://github.com/nikunjad/qemu/tree/snp_securetsc_v5

QEMU commandline SEV-SNP-UPM with SecureTSC:

  qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc,+invtsc -smp 4 \
    -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
    -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
    -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
    ...

Changelog:
----------
v7:
* Drop mutex from the snp_dev and add snp_guest_cmd_{lock,unlock} API
* Added comments for secrets page failure
* Added define for maximum supported VMPCK
* Updated comments why sev_status is used directly instead of
  cpu_feature_enabled()

v6:
* Add synthetic SecureTSC x86 feature bit
* Drop {__enc,dec}_payload() as they are pretty small and has one caller.
* Instead of a pointer use data_npages as variable
* Beautify struct snp_guest_req
* Make vmpck_id as unsigned int in snp_assign_vmpck()
* Move most of the functions to end of sev.c file
* Update commit/comments/error messages
* Mark free_shared_pages and alloc_shared_pages as inline
* Free snp_dev->certs_data when guest driver is removed
* Add lockdep assert in snp_inc_msg_seqno()
* Drop redundant enc_init NULL check
* Move SNP_TSC_INFO_REQ_SZ define out of structure
* Rename guest_tsc_{scale,offset} to snp_tsc_{scale,offset}
* Add new linux termination error code GHCB_TERM_SECURE_TSC
* Initialize and use cmd_mutex in snp_get_tsc_info()
* Set TSC as reliable in sme_early_init()
* Do not print firmware bug for Secure TSC enabled guests

https://lore.kernel.org/lkml/20231128125959.1810039-1-nikunj@amd.com/

v5:
* Rebased on v6.6 kernel
* Dropped link tag in first patch
* Dropped get_ctx_authsize() as it was redundant

https://lore.kernel.org/lkml/20231030063652.68675-1-nikunj@amd.com/

v4:
* Drop handle_guest_request() and handle_guest_request_ext()
* Drop NULL check for key
* Corrected commit subject
* Added Reviewed-by from Tom

https://lore.kernel.org/lkml/20230814055222.1056404-1-nikunj@amd.com/

v3:
* Updated commit messages
* Made snp_setup_psp_messaging() generic that is accessed by both the
  kernel and the driver
* Moved most of the context information to sev.c, sev-guest driver
  does not need to know the secrets page layout anymore
* Add CC_ATTR_GUEST_SECURE_TSC early in the series therefore it can be
  used in later patches.
* Removed data_gpa and data_npages from struct snp_req_data, as certs_data
  and its size is passed to handle_guest_request_ext()
* Make vmpck_id as unsigned int
* Dropped unnecessary usage of memzero_explicit()
* Cache secrets_pa instead of remapping the cc_blob always
* Rebase on top of v6.4 kernel
https://lore.kernel.org/lkml/20230722111909.15166-1-nikunj@amd.com/

v2:
* Rebased on top of v6.3-rc3 that has Boris's sev-guest cleanup series
  https://lore.kernel.org/r/20230307192449.24732-1-bp@alien8.de/

v1: https://lore.kernel.org/r/20230130120327.977460-1-nikunj@amd.com/

Nikunj A Dadhania (16):
  virt: sev-guest: Use AES GCM crypto library
  virt: sev-guest: Replace dev_dbg with pr_debug
  virt: sev-guest: Add SNP guest request structure
  virt: sev-guest: Add vmpck_id to snp_guest_dev struct
  x86/sev: Cache the secrets page address
  virt: sev-guest: Move SNP Guest command mutex
  x86/sev: Move and reorganize sev guest request api
  x86/mm: Add generic guest initialization hook
  x86/cpufeatures: Add synthetic Secure TSC bit
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled
    guests
  x86/kvmclock: Skip kvmclock when Secure TSC is available
  x86/sev: Mark Secure TSC as reliable
  x86/cpu/amd: Do not print FW_BUG for Secure TSC
  x86/sev: Enable Secure TSC for SNP guests

 arch/x86/Kconfig                        |   1 +
 arch/x86/boot/compressed/sev.c          |   3 +-
 arch/x86/include/asm/cpufeatures.h      |   1 +
 arch/x86/include/asm/sev-common.h       |   1 +
 arch/x86/include/asm/sev-guest.h        | 190 +++++++
 arch/x86/include/asm/sev.h              |  21 +-
 arch/x86/include/asm/svm.h              |   6 +-
 arch/x86/include/asm/x86_init.h         |   2 +
 arch/x86/kernel/cpu/amd.c               |   3 +-
 arch/x86/kernel/kvmclock.c              |   2 +-
 arch/x86/kernel/sev-shared.c            |  10 +
 arch/x86/kernel/sev.c                   | 649 +++++++++++++++++++++--
 arch/x86/kernel/x86_init.c              |   2 +
 arch/x86/mm/mem_encrypt.c               |  12 +-
 arch/x86/mm/mem_encrypt_amd.c           |  12 +
 drivers/virt/coco/sev-guest/Kconfig     |   3 -
 drivers/virt/coco/sev-guest/sev-guest.c | 668 +++---------------------
 drivers/virt/coco/sev-guest/sev-guest.h |  63 ---
 18 files changed, 917 insertions(+), 732 deletions(-)
 create mode 100644 arch/x86/include/asm/sev-guest.h
 delete mode 100644 drivers/virt/coco/sev-guest/sev-guest.h


base-commit: ceb6a6f023fd3e8b07761ed900352ef574010bcb
-- 
2.34.1


^ permalink raw reply	[relevance 2%]

* [PATCH v2 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver
                     ` (2 preceding siblings ...)
  2023-12-12 21:25 13% ` [PATCH v2 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
@ 2023-12-12 21:25  5% ` Tony Krowiak
  2023-12-12 21:25  5% ` [PATCH v2 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-12 21:25 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When a queue is unbound from the vfio_ap device driver, if that queue is
assigned to a guest's AP configuration, its associated adapter is removed
because queues are defined to a guest via a matrix of adapters and
domains; so, it is not possible to remove a single queue.

If an adapter is removed from the guest's AP configuration, all associated
queues must be reset to prevent leaking crypto data should any of them be
assigned to a different guest or device driver. The one caveat is that if
the queue is being removed because the adapter or domain has been removed
from the host's AP configuration, then an attempt to reset the queue will
fail with response code 01, AP-queue number not valid; so resetting these
queues should be skipped.
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 09d31ff78793 ("s390/vfio-ap: hot plug/unplug of AP devices when probed/removed")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 78 ++++++++++++++++---------------
 1 file changed, 41 insertions(+), 37 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 11f8f0bcc7ed..e014108067dc 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -935,45 +935,45 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static void collect_queues_to_reset(struct ap_matrix_mdev *matrix_mdev,
+				    unsigned long apid,
+				    struct list_head *qlist)
+{
+	struct vfio_ap_queue *q;
+	unsigned long  apqi;
+
+	for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS) {
+		q = vfio_ap_mdev_get_queue(matrix_mdev, AP_MKQID(apid, apqi));
+		if (q)
+			list_add_tail(&q->reset_qnode, qlist);
+	}
+}
+
+static void reset_queues_for_apid(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long apid)
+{
+	struct list_head qlist;
+
+	INIT_LIST_HEAD(&qlist);
+	collect_queues_to_reset(matrix_mdev, apid, &qlist);
+	vfio_ap_mdev_reset_qlist(&qlist);
+}
+
 static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
 				  unsigned long *apm_reset)
 {
-	struct vfio_ap_queue *q, *tmpq;
 	struct list_head qlist;
-	unsigned long apid, apqi;
-	int apqn, ret = 0;
+	unsigned long apid;
 
 	if (bitmap_empty(apm_reset, AP_DEVICES))
 		return 0;
 
 	INIT_LIST_HEAD(&qlist);
 
-	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
-				     AP_DOMAINS) {
-			/*
-			 * If the domain is not in the host's AP configuration,
-			 * then resetting it will fail with response code 01
-			 * (APQN not valid).
-			 */
-			if (!test_bit_inv(apqi,
-					  (unsigned long *)matrix_dev->info.aqm))
-				continue;
-
-			apqn = AP_MKQID(apid, apqi);
-			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
-
-			if (q)
-				list_add_tail(&q->reset_qnode, &qlist);
-		}
-	}
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES)
+		collect_queues_to_reset(matrix_mdev, apid, &qlist);
 
-	ret = vfio_ap_mdev_reset_qlist(&qlist);
-
-	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
-		list_del(&q->reset_qnode);
-
-	return ret;
+	return vfio_ap_mdev_reset_qlist(&qlist);
 }
 
 /**
@@ -2199,24 +2199,28 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	matrix_mdev = q->matrix_mdev;
 
 	if (matrix_mdev) {
-		vfio_ap_unlink_queue_fr_mdev(q);
-
-		apid = AP_QID_CARD(q->apqn);
-		apqi = AP_QID_QUEUE(q->apqn);
-
-		/*
-		 * If the queue is assigned to the guest's APCB, then remove
-		 * the adapter's APID from the APCB and hot it into the guest.
-		 */
+		/* If the queue is assigned to the guest's AP configuration */
 		if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
+			/*
+			 * Since the queues are defined via a matrix of adapters
+			 * and domains, it is not possible to hot unplug a
+			 * single queue; so, let's unplug the adapter.
+			 */
 			clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apid(matrix_mdev, apid);
+			goto done;
 		}
 	}
 
 	vfio_ap_mdev_reset_queue(q);
 	flush_work(&q->reset_work);
+
+done:
+	if (matrix_mdev)
+		vfio_ap_unlink_queue_fr_mdev(q);
+
 	dev_set_drvdata(&apdev->device, NULL);
 	kfree(q);
 	release_update_locks_for_mdev(matrix_mdev);
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v2 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration
    2023-12-12 21:25 20% ` [PATCH v2 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
@ 2023-12-12 21:25 14% ` Tony Krowiak
  2023-12-12 21:25 13% ` [PATCH v2 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-12 21:25 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

While filtering the mdev matrix, it doesn't make sense - and will have
unexpected results - to filter an APID from the matrix if the APID or one
of the associated APQIs is not in the host's AP configuration. There are
two reasons for this:

1. An adapter or domain that is not in the host's AP configuration can be
   assigned to the matrix; this is known as over-provisioning. Queue
   devices, however, are only created for adapters and domains in the
   host's AP configuration, so there will be no queues associated with an
   over-provisioned adapter or domain to filter.

2. The adapter or domain may have been externally removed from the host's
   configuration via an SE or HMC attached to a DPM enabled LPAR. In this
   case, the vfio_ap device driver would have been notified by the AP bus
   via the on_config_changed callback and the adapter or domain would
   have already been filtered.

Since the matrix_mdev->shadow_apcb.apm and matrix_mdev->shadow_apcb.aqm are
copied from the mdev matrix sans the APIDs and APQIs not in the host's AP
configuration, let's loop over those bitmaps instead of those assigned to
the matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 9382b32e5bd1..47232e19a50e 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -691,8 +691,9 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->shadow_apcb.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
-- 
2.43.0


^ permalink raw reply related	[relevance 14%]

* [PATCH v2 1/6] s390/vfio-ap: always filter entire AP matrix
  @ 2023-12-12 21:25 20% ` Tony Krowiak
  2023-12-12 21:25 14% ` [PATCH v2 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-12 21:25 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

The vfio_ap_mdev_filter_matrix function is called whenever a new adapter or
domain is assigned to the mdev. The purpose of the function is to update
the guest's AP configuration by filtering the matrix of adapters and
domains assigned to the mdev. When an adapter or domain is assigned, only
the APQNs associated with the APID of the new adapter or APQI of the new
domain are inspected. If an APQN does not reference a queue device bound to
the vfio_ap device driver, then it's APID will be filtered from the mdev's
matrix when updating the guest's AP configuration.

Inspecting only the APID of the new adapter or APQI of the new domain will
result in passing AP queues through to a guest that are not bound to the
vfio_ap device driver under certain circumstances. Consider the following:

guest's AP configuration (all also assigned to the mdev's matrix):
14.0004
14.0005
14.0006
16.0004
16.0005
16.0006

unassign domain 4
unbind queue 16.0005
assign domain 4

When domain 4 is re-assigned, since only domain 4 will be inspected, the
APQNs that will be examined will be:
14.0004
16.0004

Since both of those APQNs reference queue devices that are bound to the
vfio_ap device driver, nothing will get filtered from the mdev's matrix
when updating the guest's AP configuration. Consequently, queue 16.0005
will get passed through despite not being bound to the driver. This
violates the linux device model requirement that a guest shall only be
given access to devices bound to the device driver facilitating their
pass-through.

To resolve this problem, every adapter and domain assigned to the mdev will
be inspected when filtering the mdev's matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 57 +++++++++----------------------
 1 file changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..9382b32e5bd1 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -670,8 +670,7 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
-				       struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -692,8 +691,8 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
@@ -958,7 +957,6 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
-	DECLARE_BITMAP(apm_delta, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -987,11 +985,8 @@ static ssize_t assign_adapter_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
-	memset(apm_delta, 0, sizeof(apm_delta));
-	set_bit_inv(apid, apm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(apm_delta,
-				       matrix_mdev->matrix.aqm, matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -1167,7 +1162,6 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
-	DECLARE_BITMAP(aqm_delta, AP_DOMAINS);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1196,11 +1190,8 @@ static ssize_t assign_domain_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
-	memset(aqm_delta, 0, sizeof(aqm_delta));
-	set_bit_inv(apqi, aqm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm, aqm_delta,
-				       matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -2091,9 +2082,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 	if (matrix_mdev) {
 		vfio_ap_mdev_link_queue(matrix_mdev, q);
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm,
-					       matrix_mdev->matrix.aqm,
-					       matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 	dev_set_drvdata(&apdev->device, q);
@@ -2443,34 +2432,22 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
-	bool do_hotplug = false;
-	int filter_domains = 0;
-	int filter_adapters = 0;
-	DECLARE_BITMAP(apm, AP_DEVICES);
-	DECLARE_BITMAP(aqm, AP_DOMAINS);
+	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
 
-	filter_adapters = bitmap_and(apm, matrix_mdev->matrix.apm,
-				     matrix_mdev->apm_add, AP_DEVICES);
-	filter_domains = bitmap_and(aqm, matrix_mdev->matrix.aqm,
-				    matrix_mdev->aqm_add, AP_DOMAINS);
-
-	if (filter_adapters && filter_domains)
-		do_hotplug |= vfio_ap_mdev_filter_matrix(apm, aqm, matrix_mdev);
-	else if (filter_adapters)
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(apm,
-						   matrix_mdev->shadow_apcb.aqm,
-						   matrix_mdev);
-	else
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(matrix_mdev->shadow_apcb.apm,
-						   aqm, matrix_mdev);
+	filter_adapters = bitmap_intersects(matrix_mdev->matrix.apm,
+					    matrix_mdev->apm_add, AP_DEVICES);
+	filter_domains = bitmap_intersects(matrix_mdev->matrix.aqm,
+					   matrix_mdev->aqm_add, AP_DOMAINS);
+	filter_cdoms = bitmap_intersects(matrix_mdev->matrix.adm,
+					 matrix_mdev->adm_add, AP_DOMAINS);
+
+	if (filter_adapters || filter_domains)
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
 
-	if (bitmap_intersects(matrix_mdev->matrix.adm, matrix_mdev->adm_add,
-			      AP_DOMAINS))
+	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
 
 	if (do_hotplug)
-- 
2.43.0


^ permalink raw reply related	[relevance 20%]

* [PATCH v2 6/6] s390/vfio-ap: do not reset queue removed from host config
                     ` (3 preceding siblings ...)
  2023-12-12 21:25  5% ` [PATCH v2 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
@ 2023-12-12 21:25  5% ` Tony Krowiak
  2024-01-10 15:44  0%   ` Jason J. Herne
  4 siblings, 1 reply; 200+ results
From: Tony Krowiak @ 2023-12-12 21:25 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When a queue is unbound from the vfio_ap device driver, it is reset to
ensure its crypto data is not leaked when it is bound to another device
driver. If the queue is unbound due to the fact that the adapter or domain
was removed from the host's AP configuration, then attempting to reset it
will fail with response code 01 (APID not valid) getting returned from the
reset command. Let's ensure that the queue is assigned to the host's
configuration before resetting it.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e014108067dc..84decb0d5c97 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2197,6 +2197,8 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	q = dev_get_drvdata(&apdev->device);
 	get_update_locks_for_queue(q);
 	matrix_mdev = q->matrix_mdev;
+	apid = AP_QID_CARD(q->apqn);
+	apqi = AP_QID_QUEUE(q->apqn);
 
 	if (matrix_mdev) {
 		/* If the queue is assigned to the guest's AP configuration */
@@ -2214,8 +2216,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 		}
 	}
 
-	vfio_ap_mdev_reset_queue(q);
-	flush_work(&q->reset_work);
+	/*
+	 * If the queue is not in the host's AP configuration, then resetting
+	 * it will fail with response code 01, (APQN not valid); so, let's make
+	 * sure it is in the host's config.
+	 */
+	if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
+	    test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
+		vfio_ap_mdev_reset_queue(q);
+		flush_work(&q->reset_work);
+	}
 
 done:
 	if (matrix_mdev)
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v2 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config
    2023-12-12 21:25 20% ` [PATCH v2 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
  2023-12-12 21:25 14% ` [PATCH v2 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
@ 2023-12-12 21:25 13% ` Tony Krowiak
  2023-12-12 21:25  5% ` [PATCH v2 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
  2023-12-12 21:25  5% ` [PATCH v2 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-12 21:25 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When filtering the adapters from the configuration profile for a guest to
create or update a guest's AP configuration, if the APID of an adapter and
the APQI of a domain identify a queue device that is not bound to the
vfio_ap device driver, the APID of the adapter will be filtered because an
individual APQN can not be filtered due to the fact the APQNs are assigned
to an AP configuration as a matrix of APIDs and APQIs. Consequently, a
guest will not have access to all of the queues associated with the
filtered adapter. If the queues are subsequently made available again to
the guest, they should re-appear in a reset state; so, let's make sure all
queues associated with an adapter unplugged from the guest are reset.

In order to identify the set of queues that need to be reset, let's allow a
vfio_ap_queue object to be simultaneously stored in both a hashtable and a
list: A hashtable used to store all of the queues assigned
to a matrix mdev; and/or, a list used to store a subset of the queues that
need to be reset. For example, when an adapter is hot unplugged from a
guest, all guest queues associated with that adapter must be reset. Since
that may be a subset of those assigned to the matrix mdev, they can be
stored in a list that can be passed to the vfio_ap_mdev_reset_queues
function.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c     | 171 +++++++++++++++++++-------
 drivers/s390/crypto/vfio_ap_private.h |  11 +-
 2 files changed, 133 insertions(+), 49 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 26bd4aca497a..11f8f0bcc7ed 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -32,7 +32,8 @@
 
 #define AP_RESET_INTERVAL		20	/* Reset sleep interval (20ms)		*/
 
-static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable);
+static int vfio_ap_mdev_reset_queues(struct ap_matrix_mdev *matrix_mdev);
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist);
 static struct vfio_ap_queue *vfio_ap_find_queue(int apqn);
 static const struct vfio_device_ops vfio_ap_matrix_dev_ops;
 static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q);
@@ -661,16 +662,23 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  *				device driver.
  *
  * @matrix_mdev: the matrix mdev whose matrix is to be filtered.
+ * @apm_filtered: a 256-bit bitmap for storing the APIDs filtered from the
+ *		  guest's AP configuration that are still in the host's AP
+ *		  configuration.
  *
  * Note: If an APQN referencing a queue device that is not bound to the vfio_ap
  *	 driver, its APID will be filtered from the guest's APCB. The matrix
  *	 structure precludes filtering an individual APQN, so its APID will be
- *	 filtered.
+ *	 filtered. Consequently, all queues associated with the adapter that
+ *	 are in the host's AP configuration must be reset. If queues are
+ *	 subsequently made available again to the guest, they should re-appear
+ *	 in a reset state
  *
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev,
+				       unsigned long *apm_filtered)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -680,6 +688,7 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_copy(prev_shadow_apm, matrix_mdev->shadow_apcb.apm, AP_DEVICES);
 	bitmap_copy(prev_shadow_aqm, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS);
 	vfio_ap_matrix_init(&matrix_dev->info, &matrix_mdev->shadow_apcb);
+	bitmap_clear(apm_filtered, 0, AP_DEVICES);
 
 	/*
 	 * Copy the adapters, domains and control domains to the shadow_apcb
@@ -705,8 +714,16 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 			apqn = AP_MKQID(apid, apqi);
 			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
 			if (!q || q->reset_status.response_code) {
-				clear_bit_inv(apid,
-					      matrix_mdev->shadow_apcb.apm);
+				clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+				/*
+				 * If the adapter was previously plugged into
+				 * the guest, let's let the caller know that
+				 * the APID was filtered.
+				 */
+				if (test_bit_inv(apid, prev_shadow_apm))
+					set_bit_inv(apid, apm_filtered);
+
 				break;
 			}
 		}
@@ -808,7 +825,7 @@ static void vfio_ap_mdev_remove(struct mdev_device *mdev)
 
 	mutex_lock(&matrix_dev->guests_lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
-	vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+	vfio_ap_mdev_reset_queues(matrix_mdev);
 	vfio_ap_mdev_unlink_fr_queues(matrix_mdev);
 	list_del(&matrix_mdev->node);
 	mutex_unlock(&matrix_dev->mdevs_lock);
@@ -918,6 +935,47 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long *apm_reset)
+{
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
+	unsigned long apid, apqi;
+	int apqn, ret = 0;
+
+	if (bitmap_empty(apm_reset, AP_DEVICES))
+		return 0;
+
+	INIT_LIST_HEAD(&qlist);
+
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
+			/*
+			 * If the domain is not in the host's AP configuration,
+			 * then resetting it will fail with response code 01
+			 * (APQN not valid).
+			 */
+			if (!test_bit_inv(apqi,
+					  (unsigned long *)matrix_dev->info.aqm))
+				continue;
+
+			apqn = AP_MKQID(apid, apqi);
+			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
+
+			if (q)
+				list_add_tail(&q->reset_qnode, &qlist);
+		}
+	}
+
+	ret = vfio_ap_mdev_reset_qlist(&qlist);
+
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
+		list_del(&q->reset_qnode);
+
+	return ret;
+}
+
 /**
  * assign_adapter_store - parses the APID from @buf and sets the
  * corresponding bit in the mediated matrix device's APM
@@ -958,6 +1016,7 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -987,8 +1046,10 @@ static ssize_t assign_adapter_store(struct device *dev,
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1019,11 +1080,12 @@ static struct vfio_ap_queue
  *				 adapter was assigned.
  * @matrix_mdev: the matrix mediated device to which the adapter was assigned.
  * @apid: the APID of the unassigned adapter.
- * @qtable: table for storing queues associated with unassigned adapter.
+ * @qlist: list for storing queues associated with unassigned adapter that
+ *	   need to be reset.
  */
 static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 					unsigned long apid,
-					struct ap_queue_table *qtable)
+					struct list_head *qlist)
 {
 	unsigned long apqi;
 	struct vfio_ap_queue *q;
@@ -1031,11 +1093,10 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1043,26 +1104,23 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_adapter(struct ap_matrix_mdev *matrix_mdev,
 					    unsigned long apid)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, &qlist);
 
 	if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
 		clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1163,6 +1221,7 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1192,8 +1251,10 @@ static ssize_t assign_domain_store(struct device *dev,
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1206,7 +1267,7 @@ static DEVICE_ATTR_WO(assign_domain);
 
 static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 				       unsigned long apqi,
-				       struct ap_queue_table *qtable)
+				       struct list_head *qlist)
 {
 	unsigned long apid;
 	struct vfio_ap_queue *q;
@@ -1214,11 +1275,10 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1226,26 +1286,23 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_domain(struct ap_matrix_mdev *matrix_mdev,
 					   unsigned long apqi)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, &qlist);
 
 	if (test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
 		clear_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1600,7 +1657,7 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
 		get_update_locks_for_kvm(kvm);
 
 		kvm_arch_crypto_clear_masks(kvm);
-		vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+		vfio_ap_mdev_reset_queues(matrix_mdev);
 		kvm_put_kvm(kvm);
 		matrix_mdev->kvm = NULL;
 
@@ -1736,15 +1793,33 @@ static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q)
 	}
 }
 
-static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable)
+static int vfio_ap_mdev_reset_queues(struct ap_matrix_mdev *matrix_mdev)
 {
 	int ret = 0, loop_cursor;
 	struct vfio_ap_queue *q;
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode)
+	hash_for_each(matrix_mdev->qtable.queues, loop_cursor, q, mdev_qnode)
 		vfio_ap_mdev_reset_queue(q);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	hash_for_each(matrix_mdev->qtable.queues, loop_cursor, q, mdev_qnode) {
+		flush_work(&q->reset_work);
+
+		if (q->reset_status.response_code)
+			ret = -EIO;
+	}
+
+	return ret;
+}
+
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist)
+{
+	int ret = 0;
+	struct vfio_ap_queue *q;
+
+	list_for_each_entry(q, qlist, reset_qnode)
+		vfio_ap_mdev_reset_queue(q);
+
+	list_for_each_entry(q, qlist, reset_qnode) {
 		flush_work(&q->reset_work);
 
 		if (q->reset_status.response_code)
@@ -1930,7 +2005,7 @@ static ssize_t vfio_ap_mdev_ioctl(struct vfio_device *vdev,
 		ret = vfio_ap_mdev_get_device_info(arg);
 		break;
 	case VFIO_DEVICE_RESET:
-		ret = vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
+		ret = vfio_ap_mdev_reset_queues(matrix_mdev);
 		break;
 	case VFIO_DEVICE_GET_IRQ_INFO:
 			ret = vfio_ap_get_irq_info(arg);
@@ -2062,6 +2137,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 {
 	int ret;
 	struct vfio_ap_queue *q;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev;
 
 	ret = sysfs_create_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2094,15 +2170,17 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 		    !bitmap_empty(matrix_mdev->aqm_add, AP_DOMAINS))
 			goto done;
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apids(matrix_mdev, apm_filtered);
+		}
 	}
 
 done:
 	dev_set_drvdata(&apdev->device, q);
 	release_update_locks_for_mdev(matrix_mdev);
 
-	return 0;
+	return ret;
 
 err_remove_group:
 	sysfs_remove_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2446,6 +2524,7 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
@@ -2459,7 +2538,7 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 					 matrix_mdev->adm_add, AP_DOMAINS);
 
 	if (filter_adapters || filter_domains)
-		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered);
 
 	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
@@ -2467,6 +2546,8 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 	if (do_hotplug)
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
+	reset_queues_for_apids(matrix_mdev, apm_filtered);
+
 	mutex_unlock(&matrix_dev->mdevs_lock);
 	mutex_unlock(&matrix_mdev->kvm->lock);
 }
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 88aff8b81f2f..20eac8b0f0b9 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -83,10 +83,10 @@ struct ap_matrix {
 };
 
 /**
- * struct ap_queue_table - a table of queue objects.
- *
- * @queues: a hashtable of queues (struct vfio_ap_queue).
- */
+  * struct ap_queue_table - a table of queue objects.
+  *
+  * @queues: a hashtable of queues (struct vfio_ap_queue).
+  */
 struct ap_queue_table {
 	DECLARE_HASHTABLE(queues, 8);
 };
@@ -133,6 +133,8 @@ struct ap_matrix_mdev {
  * @apqn: the APQN of the AP queue device
  * @saved_isc: the guest ISC registered with the GIB interface
  * @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
+ * @reset_qnode: allows the vfio_ap_queue struct to be added to a list of queues
+ *		 that need to be reset
  * @reset_status: the status from the last reset of the queue
  * @reset_work: work to wait for queue reset to complete
  */
@@ -143,6 +145,7 @@ struct vfio_ap_queue {
 #define VFIO_AP_ISC_INVALID 0xff
 	unsigned char saved_isc;
 	struct hlist_node mdev_qnode;
+	struct list_head reset_qnode;
 	struct ap_queue_status reset_status;
 	struct work_struct reset_work;
 };
-- 
2.43.0


^ permalink raw reply related	[relevance 13%]

* Re: [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support
  2023-11-07 19:00  0%     ` Kalra, Ashish
  2023-11-07 19:19  0%       ` Borislav Petkov
@ 2023-12-08 17:09  0%       ` Jeremi Piotrowski
  1 sibling, 0 replies; 200+ results
From: Jeremi Piotrowski @ 2023-12-08 17:09 UTC (permalink / raw)
  To: Kalra, Ashish, Borislav Petkov, Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, vbabka, kirill,
	ak, tony.luck, marcorr, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, nikunj.dadhania, pankaj.gupta, liam.merwick, zhi.a.wang,
	Brijesh Singh

On 07/11/2023 20:00, Kalra, Ashish wrote:
> Hello Boris,
> 
> Addressing of some of the remaining comments:
> 
> On 11/7/2023 10:31 AM, Borislav Petkov wrote:
>> On Mon, Oct 16, 2023 at 08:27:35AM -0500, Michael Roth wrote:
>>> +static bool early_rmptable_check(void)
>>> +{
>>> +    u64 rmp_base, rmp_size;
>>> +
>>> +    /*
>>> +     * For early BSP initialization, max_pfn won't be set up yet, wait until
>>> +     * it is set before performing the RMP table calculations.
>>> +     */
>>> +    if (!max_pfn)
>>> +        return true;
>>
>> This already says that this is called at the wrong point during init.
>>
>> Right now we have
>>
>> early_identify_cpu -> early_init_amd -> early_detect_mem_encrypt
>>
>> which runs only on the BSP but then early_init_amd() is called in
>> init_amd() too so that it takes care of the APs too.
>>
>> Which ends up doing a lot of unnecessary work on each AP in
>> early_detect_mem_encrypt() like calculating the RMP size on each AP
>> unnecessarily where this needs to happen exactly once.
>>
>> Is there any reason why this function cannot be moved to init_amd()
>> where it'll do the normal, per-AP init?
>>
>> And the stuff that needs to happen once, needs to be called once too.
>>
>>> +
>>> +    return snp_get_rmptable_info(&rmp_base, &rmp_size);
>>> +}
>>> +
>>>   static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>   {
>>>       u64 msr;
>>> @@ -659,6 +674,9 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>           if (!(msr & MSR_K7_HWCR_SMMLOCK))
>>>               goto clear_sev;
>>>   +        if (cpu_has(c, X86_FEATURE_SEV_SNP) && !early_rmptable_check())
>>> +            goto clear_snp;
>>> +
>>>           return;
>>>     clear_all:
>>> @@ -666,6 +684,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>   clear_sev:
>>>           setup_clear_cpu_cap(X86_FEATURE_SEV);
>>>           setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
>>> +clear_snp:
>>>           setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>>       }
>>>   }
>>
>> ...
>>
>>> +bool snp_get_rmptable_info(u64 *start, u64 *len)
>>> +{
>>> +    u64 max_rmp_pfn, calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
>>> +
>>> +    rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
>>> +    rdmsrl(MSR_AMD64_RMP_END, rmp_end);
>>> +
>>> +    if (!(rmp_base & RMP_ADDR_MASK) || !(rmp_end & RMP_ADDR_MASK)) {
>>> +        pr_err("Memory for the RMP table has not been reserved by BIOS\n");
>>> +        return false;
>>> +    }
>>
>> If you're masking off bits 0-12 above...
>>
>>> +
>>> +    if (rmp_base > rmp_end) {
>>
>> ... why aren't you using the masked out vars further on?
>>
>> I know, the hw will say, yeah, those bits are 0 but still. IOW, do:
>>
>>     rmp_base &= RMP_ADDR_MASK;
>>     rmp_end  &= RMP_ADDR_MASK;
>>
>> after reading them.
>>
>>> +        pr_err("RMP configuration not valid: base=%#llx, end=%#llx\n", rmp_base, rmp_end);
>>> +        return false;
>>> +    }
>>> +
>>> +    rmp_sz = rmp_end - rmp_base + 1;
>>> +
>>> +    /*
>>> +     * Calculate the amount the memory that must be reserved by the BIOS to
>>> +     * address the whole RAM, including the bookkeeping area. The RMP itself
>>> +     * must also be covered.
>>> +     */
>>> +    max_rmp_pfn = max_pfn;
>>> +    if (PHYS_PFN(rmp_end) > max_pfn)
>>> +        max_rmp_pfn = PHYS_PFN(rmp_end);
>>> +
>>> +    calc_rmp_sz = (max_rmp_pfn << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
>>> +
>>> +    if (calc_rmp_sz > rmp_sz) {
>>> +        pr_err("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
>>> +               calc_rmp_sz, rmp_sz);
>>> +        return false;
>>> +    }
>>> +
>>> +    *start = rmp_base;
>>> +    *len = rmp_sz;
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static __init int __snp_rmptable_init(void)
>>> +{
>>> +    u64 rmp_base, rmp_size;
>>> +    void *rmp_start;
>>> +    u64 val;
>>> +
>>> +    if (!snp_get_rmptable_info(&rmp_base, &rmp_size))
>>> +        return 1;
>>> +
>>> +    pr_info("RMP table physical address [0x%016llx - 0x%016llx]\n",
>>
>> That's "RMP table physical range"
>>
>>> +        rmp_base, rmp_base + rmp_size - 1);
>>> +
>>> +    rmp_start = memremap(rmp_base, rmp_size, MEMREMAP_WB);
>>> +    if (!rmp_start) {
>>> +        pr_err("Failed to map RMP table addr 0x%llx size 0x%llx\n", rmp_base, rmp_size);
>>
>> No need to dump rmp_base and rmp_size again here - you're dumping them
>> above.
>>
>>> +        return 1;
>>> +    }
>>> +
>>> +    /*
>>> +     * Check if SEV-SNP is already enabled, this can happen in case of
>>> +     * kexec boot.
>>> +     */
>>> +    rdmsrl(MSR_AMD64_SYSCFG, val);
>>> +    if (val & MSR_AMD64_SYSCFG_SNP_EN)
>>> +        goto skip_enable;
>>> +
>>> +    /* Initialize the RMP table to zero */
>>
>> Again: useless comment.
>>
>>> +    memset(rmp_start, 0, rmp_size);
>>> +
>>> +    /* Flush the caches to ensure that data is written before SNP is enabled. */
>>> +    wbinvd_on_all_cpus();
>>> +
>>> +    /* MFDM must be enabled on all the CPUs prior to enabling SNP. */
>>
>> First of all, use the APM bit name here pls: MtrrFixDramModEn.
>>
>> And then, for the life of me, I can't find any mention in the APM why
>> this bit is needed. Neither in "15.36.2 Enabling SEV-SNP" nor in
>> "15.34.3 Enabling SEV".
>>
>> Looking at the bit defintions of WrMem an RdMem - read and write
>> requests get directed to system memory instead of MMIO so I guess you
>> don't want to be able to write MMIO for certain physical ranges when SNP
>> is enabled but it'll be good to have this properly explained instead of
>> a "this must happen" information-less sentence.
> 
> This is a per-requisite for SNP_INIT as per the SNP Firmware ABI specifications, section 8.8.2:
> 
> From the SNP FW ABI specs:
> 
> If INIT_RMP is 1, then the firmware ensures the following system requirements are met:
> • SYSCFG[MemoryEncryptionModEn] must be set to 1 across all cores. (SEV must be
> enabled.)> • SYSCFG[SecureNestedPagingEn] must be set to 1 across all cores.
> • SYSCFG[VMPLEn] must be set to 1 across all cores.
> • SYSCFG[MFDM] must be set to 1 across all cores.

Hi Ashish,

I just noticed that the kernel shouts at me about this bit when I offline->online a CPU in
an SNP host:

[2692586.589194] smpboot: CPU 63 is now offline
[2692589.366822] [Firmware Warn]: MTRR: CPU 0: SYSCFG[MtrrFixDramModEn] not cleared by BIOS, clearing this bit
[2692589.376582] smpboot: Booting Node 0 Processor 63 APIC 0x3f
[2692589.378070] [Firmware Warn]: MTRR: CPU 63: SYSCFG[MtrrFixDramModEn] not cleared by BIOS, clearing this bit
[2692589.388845] microcode: CPU63: new patch_level=0x0a0011d1

Now I understand if you say "CPU offlining is not supported" but there's nothing currently
blocking it.

Best wishes,
Jeremi 

> • VM_HSAVE_PA (MSR C001_0117) must be set to 0h across all cores.
> • HWCR[SmmLock] (MSR C001_0015) must be set to 1 across all cores.
> 
> So, this platform enabling code for SNP needs to ensure that these conditions are met before SNP_INIT is called.
> 


^ permalink raw reply	[relevance 0%]

* [PATCH v1 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver
                     ` (2 preceding siblings ...)
  2023-12-08 16:22 13% ` [PATCH v1 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
@ 2023-12-08 16:22 10% ` Tony Krowiak
  2023-12-08 16:22  5% ` [PATCH v1 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-08 16:22 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When a queue is unbound from the vfio_ap device driver, if that queue is
assigned to a guest's AP configuration, its associated adapter is removed
because queues are defined to a guest via a matrix of adapters and
domains; so, it is not possible to remove a single queue.

If an adapter is removed from the guest's AP configuration, all associated
queues must be reset to prevent leaking crypto data should any of them be
assigned to a different guest or device driver. The one caveat is that if
the queue is being removed because the adapter or domain has been removed
from the host's AP configuration, then an attempt to reset the queue will
fail with response code 01, AP-queue number not valid; so resetting these
queues should be skipped.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 09d31ff78793 ("s390/vfio-ap: hot plug/unplug of AP devices when probed/removed")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 39 ++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index f08321385058..5db11d50b4b0 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2187,6 +2187,23 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 	return ret;
 }
 
+static void reset_queues_for_apid(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long apid)
+{
+	DECLARE_BITMAP(apm_reset, AP_DEVICES);
+
+	/*
+	 * If the adapter is not in the host's AP configuration, then resetting
+	 * any queue for that adapter will fail with response code 01, (APQN not
+	 * valid).
+	 */
+	if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm)) {
+		bitmap_clear(apm_reset, 0, AP_DEVICES);
+		set_bit_inv(apid, apm_reset);
+		reset_queues_for_apids(matrix_mdev, apm_reset);
+	}
+}
+
 void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 {
 	unsigned long apid, apqi;
@@ -2199,24 +2216,28 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	matrix_mdev = q->matrix_mdev;
 
 	if (matrix_mdev) {
-		vfio_ap_unlink_queue_fr_mdev(q);
-
-		apid = AP_QID_CARD(q->apqn);
-		apqi = AP_QID_QUEUE(q->apqn);
-
-		/*
-		 * If the queue is assigned to the guest's APCB, then remove
-		 * the adapter's APID from the APCB and hot it into the guest.
-		 */
+		/* If the queue is assigned to the guest's AP configuration */
 		if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
+			/*
+			 * Since the queues are defined via a matrix of adapters
+			 * and domains, it is not possible to hot unplug a
+			 * single queue; so, let's unplug the adapter.
+			 */
 			clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apid(matrix_mdev, apid);
+			goto done;
 		}
 	}
 
 	vfio_ap_mdev_reset_queue(q);
 	flush_work(&q->reset_work);
+
+done:
+	if (matrix_mdev)
+		vfio_ap_unlink_queue_fr_mdev(q);
+
 	dev_set_drvdata(&apdev->device, NULL);
 	kfree(q);
 	release_update_locks_for_mdev(matrix_mdev);
-- 
2.43.0


^ permalink raw reply related	[relevance 10%]

* [PATCH v1 6/6] s390/vfio-ap: do not reset queue removed from host config
                     ` (3 preceding siblings ...)
  2023-12-08 16:22 10% ` [PATCH v1 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
@ 2023-12-08 16:22  5% ` Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-08 16:22 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When a queue is unbound from the vfio_ap device driver, it is reset to
ensure its crypto data is not leaked when it is bound to another device
driver. If the queue is unbound due to the fact that the adapter or domain
was removed from the host's AP configuration, then attempting to reset it
will fail with response code 01 (APID not valid) getting returned from the
reset command. Let's ensure that the queue is assigned to the host's
configuration before resetting it.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: eeb386aeb5b7 ("s390/vfio-ap: handle config changed and scan complete notification")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 5db11d50b4b0..b6928fc3b395 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -2214,6 +2214,8 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 	q = dev_get_drvdata(&apdev->device);
 	get_update_locks_for_queue(q);
 	matrix_mdev = q->matrix_mdev;
+	apid = AP_QID_CARD(q->apqn);
+	apqi = AP_QID_QUEUE(q->apqn);
 
 	if (matrix_mdev) {
 		/* If the queue is assigned to the guest's AP configuration */
@@ -2231,8 +2233,16 @@ void vfio_ap_mdev_remove_queue(struct ap_device *apdev)
 		}
 	}
 
-	vfio_ap_mdev_reset_queue(q);
-	flush_work(&q->reset_work);
+	/*
+	 * If the queue is not in the host's AP configuration, then resetting
+	 * it will fail with response code 01, (APQN not valid); so, let's make
+	 * sure it is in the host's config.
+	 */
+	if (test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm) &&
+	    test_bit_inv(apqi, (unsigned long *)matrix_dev->info.aqm)) {
+		vfio_ap_mdev_reset_queue(q);
+		flush_work(&q->reset_work);
+	}
 
 done:
 	if (matrix_mdev)
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v1 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config
    2023-12-08 16:22 20% ` [PATCH v1 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
  2023-12-08 16:22 14% ` [PATCH v1 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
@ 2023-12-08 16:22 13% ` Tony Krowiak
  2023-12-08 16:22 10% ` [PATCH v1 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
  2023-12-08 16:22  5% ` [PATCH v1 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-08 16:22 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

When filtering the adapters from the configuration profile for a guest to
create or update a guest's AP configuration, if the APID of an adapter and
the APQI of a domain identify a queue device that is not bound to the
vfio_ap device driver, the APID of the adapter will be filtered because an
individual APQN can not be filtered due to the fact the APQNs are assigned
to an AP configuration as a matrix of APIDs and APQIs. Consequently, a
guest will not have access to all of the queues associated with the
filtered adapter. If the queues are subsequently made available again to
the guest, they should re-appear in a reset state; so, let's make sure all
queues associated with an adapter unplugged from the guest are reset.

In order to identify the set of queues that need to be reset, let's allow a
vfio_ap_queue object to be simultaneously stored in both a hashtable and a
list: A hashtable used to store all of the queues assigned
to a matrix mdev; and/or, a list used to store a subset of the queues that
need to be reset. For example, when an adapter is hot unplugged from a
guest, all guest queues associated with that adapter must be reset. Since
that may be a subset of those assigned to the matrix mdev, they can be
stored in a list that can be passed to the vfio_ap_mdev_reset_queues
function.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c     | 157 +++++++++++++++++++-------
 drivers/s390/crypto/vfio_ap_private.h |  11 +-
 2 files changed, 126 insertions(+), 42 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 26bd4aca497a..f08321385058 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -33,6 +33,7 @@
 #define AP_RESET_INTERVAL		20	/* Reset sleep interval (20ms)		*/
 
 static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable);
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist);
 static struct vfio_ap_queue *vfio_ap_find_queue(int apqn);
 static const struct vfio_device_ops vfio_ap_matrix_dev_ops;
 static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q);
@@ -661,16 +662,23 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  *				device driver.
  *
  * @matrix_mdev: the matrix mdev whose matrix is to be filtered.
+ * @apm_filtered: a 256-bit bitmap for storing the APIDs filtered from the
+ *		  guest's AP configuration that are still in the host's AP
+ *		  configuration.
  *
  * Note: If an APQN referencing a queue device that is not bound to the vfio_ap
  *	 driver, its APID will be filtered from the guest's APCB. The matrix
  *	 structure precludes filtering an individual APQN, so its APID will be
- *	 filtered.
+ *	 filtered. Consequently, all queues associated with the adapter that
+ *	 are in the host's AP configuration must be reset. If queues are
+ *	 subsequently made available again to the guest, they should re-appear
+ *	 in a reset state
  *
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev,
+				       unsigned long *apm_filtered)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -680,6 +688,7 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_copy(prev_shadow_apm, matrix_mdev->shadow_apcb.apm, AP_DEVICES);
 	bitmap_copy(prev_shadow_aqm, matrix_mdev->shadow_apcb.aqm, AP_DOMAINS);
 	vfio_ap_matrix_init(&matrix_dev->info, &matrix_mdev->shadow_apcb);
+	bitmap_clear(apm_filtered, 0, AP_DEVICES);
 
 	/*
 	 * Copy the adapters, domains and control domains to the shadow_apcb
@@ -705,8 +714,16 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 			apqn = AP_MKQID(apid, apqi);
 			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
 			if (!q || q->reset_status.response_code) {
-				clear_bit_inv(apid,
-					      matrix_mdev->shadow_apcb.apm);
+				clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+				/*
+				 * If the adapter was previously plugged into
+				 * the guest, let's let the caller know that
+				 * the APID was filtered.
+				 */
+				if (test_bit_inv(apid, prev_shadow_apm))
+					set_bit_inv(apid, apm_filtered);
+
 				break;
 			}
 		}
@@ -918,6 +935,47 @@ static void vfio_ap_mdev_link_adapter(struct ap_matrix_mdev *matrix_mdev,
 				       AP_MKQID(apid, apqi));
 }
 
+static int reset_queues_for_apids(struct ap_matrix_mdev *matrix_mdev,
+				  unsigned long *apm_reset)
+{
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
+	unsigned long apid, apqi;
+	int apqn, ret = 0;
+
+	if (bitmap_empty(apm_reset, AP_DEVICES))
+		return 0;
+
+	INIT_LIST_HEAD(&qlist);
+
+	for_each_set_bit_inv(apid, apm_reset, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
+			/*
+			 * If the domain is not in the host's AP configuration,
+			 * then resetting it will fail with response code 01
+			 * (APQN not valid).
+			 */
+			if (!test_bit_inv(apqi,
+					  (unsigned long *)matrix_dev->info.aqm))
+				continue;
+
+			apqn = AP_MKQID(apid, apqi);
+			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
+
+			if (q)
+				list_add_tail(&q->reset_qnode, &qlist);
+		}
+	}
+
+	ret = vfio_ap_mdev_reset_qlist(&qlist);
+
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode)
+		list_del(&q->reset_qnode);
+
+	return ret;
+}
+
 /**
  * assign_adapter_store - parses the APID from @buf and sets the
  * corresponding bit in the mediated matrix device's APM
@@ -958,6 +1016,7 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -987,8 +1046,10 @@ static ssize_t assign_adapter_store(struct device *dev,
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1019,11 +1080,12 @@ static struct vfio_ap_queue
  *				 adapter was assigned.
  * @matrix_mdev: the matrix mediated device to which the adapter was assigned.
  * @apid: the APID of the unassigned adapter.
- * @qtable: table for storing queues associated with unassigned adapter.
+ * @qlist: list for storing queues associated with unassigned adapter that
+ *	   need to be reset.
  */
 static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 					unsigned long apid,
-					struct ap_queue_table *qtable)
+					struct list_head *qlist)
 {
 	unsigned long apqi;
 	struct vfio_ap_queue *q;
@@ -1031,11 +1093,10 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1043,26 +1104,23 @@ static void vfio_ap_mdev_unlink_adapter(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_adapter(struct ap_matrix_mdev *matrix_mdev,
 					    unsigned long apid)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_adapter(matrix_mdev, apid, &qlist);
 
 	if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
 		clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1163,6 +1221,7 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1192,8 +1251,10 @@ static ssize_t assign_domain_store(struct device *dev,
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+		reset_queues_for_apids(matrix_mdev, apm_filtered);
+	}
 
 	ret = count;
 done:
@@ -1206,7 +1267,7 @@ static DEVICE_ATTR_WO(assign_domain);
 
 static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 				       unsigned long apqi,
-				       struct ap_queue_table *qtable)
+				       struct list_head *qlist)
 {
 	unsigned long apid;
 	struct vfio_ap_queue *q;
@@ -1214,11 +1275,10 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
 		q = vfio_ap_unlink_apqn_fr_mdev(matrix_mdev, apid, apqi);
 
-		if (q && qtable) {
+		if (q && qlist) {
 			if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
 			    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
-				hash_add(qtable->queues, &q->mdev_qnode,
-					 q->apqn);
+				list_add_tail(&q->reset_qnode, qlist);
 		}
 	}
 }
@@ -1226,26 +1286,23 @@ static void vfio_ap_mdev_unlink_domain(struct ap_matrix_mdev *matrix_mdev,
 static void vfio_ap_mdev_hot_unplug_domain(struct ap_matrix_mdev *matrix_mdev,
 					   unsigned long apqi)
 {
-	int loop_cursor;
-	struct vfio_ap_queue *q;
-	struct ap_queue_table *qtable = kzalloc(sizeof(*qtable), GFP_KERNEL);
+	struct vfio_ap_queue *q, *tmpq;
+	struct list_head qlist;
 
-	hash_init(qtable->queues);
-	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, qtable);
+	INIT_LIST_HEAD(&qlist);
+	vfio_ap_mdev_unlink_domain(matrix_mdev, apqi, &qlist);
 
 	if (test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm)) {
 		clear_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm);
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 
-	vfio_ap_mdev_reset_queues(qtable);
+	vfio_ap_mdev_reset_qlist(&qlist);
 
-	hash_for_each(qtable->queues, loop_cursor, q, mdev_qnode) {
+	list_for_each_entry_safe(q, tmpq, &qlist, reset_qnode) {
 		vfio_ap_unlink_mdev_fr_queue(q);
-		hash_del(&q->mdev_qnode);
+		list_del(&q->reset_qnode);
 	}
-
-	kfree(qtable);
 }
 
 /**
@@ -1754,6 +1811,24 @@ static int vfio_ap_mdev_reset_queues(struct ap_queue_table *qtable)
 	return ret;
 }
 
+static int vfio_ap_mdev_reset_qlist(struct list_head *qlist)
+{
+	int ret = 0;
+	struct vfio_ap_queue *q;
+
+	list_for_each_entry(q, qlist, reset_qnode)
+		vfio_ap_mdev_reset_queue(q);
+
+	list_for_each_entry(q, qlist, reset_qnode) {
+		flush_work(&q->reset_work);
+
+		if (q->reset_status.response_code)
+			ret = -EIO;
+	}
+
+	return ret;
+}
+
 static int vfio_ap_mdev_open_device(struct vfio_device *vdev)
 {
 	struct ap_matrix_mdev *matrix_mdev =
@@ -2062,6 +2137,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 {
 	int ret;
 	struct vfio_ap_queue *q;
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev;
 
 	ret = sysfs_create_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2094,15 +2170,17 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 		    !bitmap_empty(matrix_mdev->aqm_add, AP_DOMAINS))
 			goto done;
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered)) {
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
+			reset_queues_for_apids(matrix_mdev, apm_filtered);
+		}
 	}
 
 done:
 	dev_set_drvdata(&apdev->device, q);
 	release_update_locks_for_mdev(matrix_mdev);
 
-	return 0;
+	return ret;
 
 err_remove_group:
 	sysfs_remove_group(&apdev->device.kobj, &vfio_queue_attr_group);
@@ -2446,6 +2524,7 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
+	DECLARE_BITMAP(apm_filtered, AP_DEVICES);
 	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
@@ -2459,7 +2538,7 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 					 matrix_mdev->adm_add, AP_DOMAINS);
 
 	if (filter_adapters || filter_domains)
-		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev, apm_filtered);
 
 	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
@@ -2467,6 +2546,8 @@ static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 	if (do_hotplug)
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
+	reset_queues_for_apids(matrix_mdev, apm_filtered);
+
 	mutex_unlock(&matrix_dev->mdevs_lock);
 	mutex_unlock(&matrix_mdev->kvm->lock);
 }
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 88aff8b81f2f..20eac8b0f0b9 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -83,10 +83,10 @@ struct ap_matrix {
 };
 
 /**
- * struct ap_queue_table - a table of queue objects.
- *
- * @queues: a hashtable of queues (struct vfio_ap_queue).
- */
+  * struct ap_queue_table - a table of queue objects.
+  *
+  * @queues: a hashtable of queues (struct vfio_ap_queue).
+  */
 struct ap_queue_table {
 	DECLARE_HASHTABLE(queues, 8);
 };
@@ -133,6 +133,8 @@ struct ap_matrix_mdev {
  * @apqn: the APQN of the AP queue device
  * @saved_isc: the guest ISC registered with the GIB interface
  * @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
+ * @reset_qnode: allows the vfio_ap_queue struct to be added to a list of queues
+ *		 that need to be reset
  * @reset_status: the status from the last reset of the queue
  * @reset_work: work to wait for queue reset to complete
  */
@@ -143,6 +145,7 @@ struct vfio_ap_queue {
 #define VFIO_AP_ISC_INVALID 0xff
 	unsigned char saved_isc;
 	struct hlist_node mdev_qnode;
+	struct list_head reset_qnode;
 	struct ap_queue_status reset_status;
 	struct work_struct reset_work;
 };
-- 
2.43.0


^ permalink raw reply related	[relevance 13%]

* [PATCH v1 1/6] s390/vfio-ap: always filter entire AP matrix
  @ 2023-12-08 16:22 20% ` Tony Krowiak
  2023-12-08 16:22 14% ` [PATCH v1 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-08 16:22 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

The vfio_ap_mdev_filter_matrix function is called whenever a new adapter or
domain is assigned to the mdev. The purpose of the function is to update
the guest's AP configuration by filtering the matrix of adapters and
domains assigned to the mdev. When an adapter or domain is assigned, only
the APQNs associated with the APID of the new adapter or APQI of the new
domain are inspected. If an APQN does not reference a queue device bound to
the vfio_ap device driver, then it's APID will be filtered from the mdev's
matrix when updating the guest's AP configuration.

Inspecting only the APID of the new adapter or APQI of the new domain will
result in passing AP queues through to a guest that are not bound to the
vfio_ap device driver under certain circumstances. Consider the following:

guest's AP configuration (all also assigned to the mdev's matrix):
14.0004
14.0005
14.0006
16.0004
16.0005
16.0006

unassign domain 4
unbind queue 16.0005
assign domain 4

When domain 4 is re-assigned, since only domain 4 will be inspected, the
APQNs that will be examined will be:
14.0004
16.0004

Since both of those APQNs reference queue devices that are bound to the
vfio_ap device driver, nothing will get filtered from the mdev's matrix
when updating the guest's AP configuration. Consequently, queue 16.0005
will get passed through despite not being bound to the driver. This
violates the linux device model requirement that a guest shall only be
given access to devices bound to the device driver facilitating their
pass-through.

To resolve this problem, every adapter and domain assigned to the mdev will
be inspected when filtering the mdev's matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 57 +++++++++----------------------
 1 file changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..9382b32e5bd1 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -670,8 +670,7 @@ static bool vfio_ap_mdev_filter_cdoms(struct ap_matrix_mdev *matrix_mdev)
  * Return: a boolean value indicating whether the KVM guest's APCB was changed
  *	   by the filtering or not.
  */
-static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
-				       struct ap_matrix_mdev *matrix_mdev)
+static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 {
 	unsigned long apid, apqi, apqn;
 	DECLARE_BITMAP(prev_shadow_apm, AP_DEVICES);
@@ -692,8 +691,8 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
@@ -958,7 +957,6 @@ static ssize_t assign_adapter_store(struct device *dev,
 {
 	int ret;
 	unsigned long apid;
-	DECLARE_BITMAP(apm_delta, AP_DEVICES);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -987,11 +985,8 @@ static ssize_t assign_adapter_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_adapter(matrix_mdev, apid);
-	memset(apm_delta, 0, sizeof(apm_delta));
-	set_bit_inv(apid, apm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(apm_delta,
-				       matrix_mdev->matrix.aqm, matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -1167,7 +1162,6 @@ static ssize_t assign_domain_store(struct device *dev,
 {
 	int ret;
 	unsigned long apqi;
-	DECLARE_BITMAP(aqm_delta, AP_DOMAINS);
 	struct ap_matrix_mdev *matrix_mdev = dev_get_drvdata(dev);
 
 	mutex_lock(&ap_perms_mutex);
@@ -1196,11 +1190,8 @@ static ssize_t assign_domain_store(struct device *dev,
 	}
 
 	vfio_ap_mdev_link_domain(matrix_mdev, apqi);
-	memset(aqm_delta, 0, sizeof(aqm_delta));
-	set_bit_inv(apqi, aqm_delta);
 
-	if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm, aqm_delta,
-				       matrix_mdev))
+	if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 		vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 
 	ret = count;
@@ -2091,9 +2082,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 	if (matrix_mdev) {
 		vfio_ap_mdev_link_queue(matrix_mdev, q);
 
-		if (vfio_ap_mdev_filter_matrix(matrix_mdev->matrix.apm,
-					       matrix_mdev->matrix.aqm,
-					       matrix_mdev))
+		if (vfio_ap_mdev_filter_matrix(matrix_mdev))
 			vfio_ap_mdev_update_guest_apcb(matrix_mdev);
 	}
 	dev_set_drvdata(&apdev->device, q);
@@ -2443,34 +2432,22 @@ void vfio_ap_on_cfg_changed(struct ap_config_info *cur_cfg_info,
 
 static void vfio_ap_mdev_hot_plug_cfg(struct ap_matrix_mdev *matrix_mdev)
 {
-	bool do_hotplug = false;
-	int filter_domains = 0;
-	int filter_adapters = 0;
-	DECLARE_BITMAP(apm, AP_DEVICES);
-	DECLARE_BITMAP(aqm, AP_DOMAINS);
+	bool filter_domains, filter_adapters, filter_cdoms, do_hotplug = false;
 
 	mutex_lock(&matrix_mdev->kvm->lock);
 	mutex_lock(&matrix_dev->mdevs_lock);
 
-	filter_adapters = bitmap_and(apm, matrix_mdev->matrix.apm,
-				     matrix_mdev->apm_add, AP_DEVICES);
-	filter_domains = bitmap_and(aqm, matrix_mdev->matrix.aqm,
-				    matrix_mdev->aqm_add, AP_DOMAINS);
-
-	if (filter_adapters && filter_domains)
-		do_hotplug |= vfio_ap_mdev_filter_matrix(apm, aqm, matrix_mdev);
-	else if (filter_adapters)
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(apm,
-						   matrix_mdev->shadow_apcb.aqm,
-						   matrix_mdev);
-	else
-		do_hotplug |=
-			vfio_ap_mdev_filter_matrix(matrix_mdev->shadow_apcb.apm,
-						   aqm, matrix_mdev);
+	filter_adapters = bitmap_intersects(matrix_mdev->matrix.apm,
+					    matrix_mdev->apm_add, AP_DEVICES);
+	filter_domains = bitmap_intersects(matrix_mdev->matrix.aqm,
+					   matrix_mdev->aqm_add, AP_DOMAINS);
+	filter_cdoms = bitmap_intersects(matrix_mdev->matrix.adm,
+					 matrix_mdev->adm_add, AP_DOMAINS);
+
+	if (filter_adapters || filter_domains)
+		do_hotplug = vfio_ap_mdev_filter_matrix(matrix_mdev);
 
-	if (bitmap_intersects(matrix_mdev->matrix.adm, matrix_mdev->adm_add,
-			      AP_DOMAINS))
+	if (filter_cdoms)
 		do_hotplug |= vfio_ap_mdev_filter_cdoms(matrix_mdev);
 
 	if (do_hotplug)
-- 
2.43.0


^ permalink raw reply related	[relevance 20%]

* [PATCH v1 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration
    2023-12-08 16:22 20% ` [PATCH v1 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
@ 2023-12-08 16:22 14% ` Tony Krowiak
  2023-12-08 16:22 13% ` [PATCH v1 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-12-08 16:22 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, borntraeger, pasic, pbonzini, frankja, imbrenda,
	alex.williamson, kwankhede, stable

While filtering the mdev matrix, it doesn't make sense - and will have
unexpected results - to filter an APID from the matrix if the APID or one
of the associated APQIs is not in the host's AP configuration. There are
two reasons for this:

1. An adapter or domain that is not in the host's AP configuration can be
   assigned to the matrix; this is known as over-provisioning. Queue
   devices, however, are only created for adapters and domains in the
   host's AP configuration, so there will be no queues associated with an
   over-provisioned adapter or domain to filter.

2. The adapter or domain may have been externally removed from the host's
   configuration via an SE or HMC attached to a DPM enabled LPAR. In this
   case, the vfio_ap device driver would have been notified by the AP bus
   via the on_config_changed callback and the adapter or domain would
   have already been filtered.

Since the matrix_mdev->shadow_apcb.apm and matrix_mdev->shadow_apcb.aqm are
copied from the mdev matrix sans the APIDs and APQIs not in the host's AP
configuration, let's loop over those bitmaps instead of those assigned to
the matrix.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: 48cae940c31d ("s390/vfio-ap: refresh guest's APCB by filtering AP resources assigned to mdev")
Cc: <stable@vger.kernel.org>
---
 drivers/s390/crypto/vfio_ap_ops.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 9382b32e5bd1..47232e19a50e 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -691,8 +691,9 @@ static bool vfio_ap_mdev_filter_matrix(struct ap_matrix_mdev *matrix_mdev)
 	bitmap_and(matrix_mdev->shadow_apcb.aqm, matrix_mdev->matrix.aqm,
 		   (unsigned long *)matrix_dev->info.aqm, AP_DOMAINS);
 
-	for_each_set_bit_inv(apid, matrix_mdev->matrix.apm, AP_DEVICES) {
-		for_each_set_bit_inv(apqi, matrix_mdev->matrix.aqm, AP_DOMAINS) {
+	for_each_set_bit_inv(apid, matrix_mdev->shadow_apcb.apm, AP_DEVICES) {
+		for_each_set_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm,
+				     AP_DOMAINS) {
 			/*
 			 * If the APQN is not bound to the vfio_ap device
 			 * driver, then we can't assign it to the guest's
-- 
2.43.0


^ permalink raw reply related	[relevance 14%]

* Re: [PATCH v6 00/16] Add Secure TSC support for SNP guests
  2023-11-28 12:59  2% [PATCH v6 00/16] Add Secure TSC support for SNP guests Nikunj A Dadhania
@ 2023-12-06 17:46  0% ` Peter Gonda
  0 siblings, 0 replies; 200+ results
From: Peter Gonda @ 2023-12-06 17:46 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, thomas.lendacky, x86, kvm, bp, mingo, tglx,
	dave.hansen, dionnaglaze, seanjc, pbonzini

On Tue, Nov 28, 2023 at 6:00 AM Nikunj A Dadhania <nikunj@amd.com> wrote:
>
> Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
> parameters being used cannot be changed by hypervisor once the guest is
> launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".
>
> During the boot-up of the secondary cpus, SecureTSC enabled guests need to
> query TSC info from AMD Security Processor. This communication channel is
> encrypted between the AMD Security Processor and the guest, the hypervisor
> is just the conduit to deliver the guest messages to the AMD Security
> Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
> Secure Nested Paging Firmware ABI Specification" document (currently at
> https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"
>
> Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
> communicate with the AMD Security Processor which is available at early
> boot.
>
> SEV-guest driver has the implementation for guest and AMD Security
> Processor communication. As the TSC_INFO needs to be initialized during
> early boot before smp cpus are started, move most of the sev-guest driver
> code to kernel/sev.c and provide well defined APIs to the sev-guest driver
> to use the interface to avoid code-duplication.
>
> Patches:
> 01-08: Preparation and movement of sev-guest driver code
> 09-16: SecureTSC enablement patches.
>
> Testing SecureTSC
> -----------------
>
> SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
> https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5
>
> QEMU changes:
> https://github.com/nikunjad/qemu/tree/snp_securetsc_v5
>
> QEMU commandline SEV-SNP-UPM with SecureTSC:
>
>   qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc,+invtsc -smp 4 \
>     -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
>     -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
>     -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
>     ...

Thanks Nikunj!

I was able to modify my SNP host kernel to support SecureTSC based off
of your `snp-host-latest-securetsc_v5` and use that to test this
series. Seemed to work as intended in the happy path but I didn't
spend much time trying any corner cases. Also checked the series
continues to work without SecureTSC enabled for the V.

Tested-by: Peter Gonda <pgonda@google.com>

^ permalink raw reply	[relevance 0%]

* Re: [RFC PATCH 0/4] KVM: SEV: Limit cache flush operations in sev guest memory reclaim events
  2023-12-01 22:30  4%             ` Sean Christopherson
@ 2023-12-02  6:21  0%               ` Mingwei Zhang
  0 siblings, 0 replies; 200+ results
From: Mingwei Zhang @ 2023-12-02  6:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ashish Kalra, Jacky Li, Paolo Bonzini, Ovidiu Panait,
	Liam Merwick, David Rientjes, David Kaplan, Peter Gonda, kvm,
	Michael Roth

On Fri, Dec 01, 2023, Sean Christopherson wrote:
> On Fri, Dec 01, 2023, Mingwei Zhang wrote:
> > On Fri, Dec 1, 2023 at 2:13 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Dec 01, 2023, Mingwei Zhang wrote:
> > > > On Fri, Dec 1, 2023 at 1:30 PM Kalra, Ashish <ashish.kalra@amd.com> wrote:
> > > > > For SNP + gmem, where the HVA ranges covered by the MMU notifiers are
> > > > > not acting on encrypted pages, we are ignoring MMU invalidation
> > > > > notifiers for SNP guests as part of the SNP host patches being posted
> > > > > upstream and instead relying on gmem own invalidation stuff to clean
> > > > > them up on a per-folio basis.
> > > > >
> > > > > Thanks,
> > > > > Ashish
> > > >
> > > > oh, I have no question about that. This series only applies to
> > > > SEV/SEV-ES type of VMs.
> > > >
> > > > For SNP + guest_memfd, I don't see the implementation details, but I
> > > > doubt you can ignore mmu_notifiers if the request does cover some
> > > > encrypted memory in error cases or corner cases. Does the SNP enforce
> > > > the usage of guest_memfd? How do we prevent exceptional cases? I am
> > > > sure you guys already figured out the answers, so I don't plan to dig
> > > > deeper until SNP host pages are accepted.
> > >
> > > KVM will not allow SNP guests to map VMA-based memory as encrypted/private, full
> > > stop.  Any invalidations initiated by mmu_notifiers will therefore apply only to
> > > shared memory.
> > 
> > Remind me. If I (as a SEV-SNP guest) flip the C-bit in my own x86 page
> > table and write to some of the pages, am I generating encrypted dirty
> > cache lines?
> 
> No.  See Table 15-39. "RMP Memory Access Checks" in the APM (my attempts to copy
> it to plain test failed miserably).  
> 
> For accesses with effective C-bit == 0, the page must be Hypervisor-Owned.  For
> effective C-bit == 1, the page must be fully assigned to the guest.  Violation
> of those rules generates #VMEXIT.
> 
> A missing Validated attribute causes a #VC, but that case has lower priority than
> the about checks.

Thank you. RMP check seems to be mandatory on all memory accesses when
SNP is active. Hope that property remain an invarient regardless of any
future optimization.

Thanks.
-Mingwei

^ permalink raw reply	[relevance 0%]

* Re: [RFC PATCH 0/4] KVM: SEV: Limit cache flush operations in sev guest memory reclaim events
  @ 2023-12-01 22:30  4%             ` Sean Christopherson
  2023-12-02  6:21  0%               ` Mingwei Zhang
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-12-01 22:30 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Ashish Kalra, Jacky Li, Paolo Bonzini, Ovidiu Panait,
	Liam Merwick, David Rientjes, David Kaplan, Peter Gonda, kvm,
	Michael Roth

On Fri, Dec 01, 2023, Mingwei Zhang wrote:
> On Fri, Dec 1, 2023 at 2:13 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Dec 01, 2023, Mingwei Zhang wrote:
> > > On Fri, Dec 1, 2023 at 1:30 PM Kalra, Ashish <ashish.kalra@amd.com> wrote:
> > > > For SNP + gmem, where the HVA ranges covered by the MMU notifiers are
> > > > not acting on encrypted pages, we are ignoring MMU invalidation
> > > > notifiers for SNP guests as part of the SNP host patches being posted
> > > > upstream and instead relying on gmem own invalidation stuff to clean
> > > > them up on a per-folio basis.
> > > >
> > > > Thanks,
> > > > Ashish
> > >
> > > oh, I have no question about that. This series only applies to
> > > SEV/SEV-ES type of VMs.
> > >
> > > For SNP + guest_memfd, I don't see the implementation details, but I
> > > doubt you can ignore mmu_notifiers if the request does cover some
> > > encrypted memory in error cases or corner cases. Does the SNP enforce
> > > the usage of guest_memfd? How do we prevent exceptional cases? I am
> > > sure you guys already figured out the answers, so I don't plan to dig
> > > deeper until SNP host pages are accepted.
> >
> > KVM will not allow SNP guests to map VMA-based memory as encrypted/private, full
> > stop.  Any invalidations initiated by mmu_notifiers will therefore apply only to
> > shared memory.
> 
> Remind me. If I (as a SEV-SNP guest) flip the C-bit in my own x86 page
> table and write to some of the pages, am I generating encrypted dirty
> cache lines?

No.  See Table 15-39. "RMP Memory Access Checks" in the APM (my attempts to copy
it to plain test failed miserably).  

For accesses with effective C-bit == 0, the page must be Hypervisor-Owned.  For
effective C-bit == 1, the page must be fully assigned to the guest.  Violation
of those rules generates #VMEXIT.

A missing Validated attribute causes a #VC, but that case has lower priority than
the about checks.

^ permalink raw reply	[relevance 4%]

* [PATCH v6 00/16] Add Secure TSC support for SNP guests
@ 2023-11-28 12:59  2% Nikunj A Dadhania
  2023-12-06 17:46  0% ` Peter Gonda
  0 siblings, 1 reply; 200+ results
From: Nikunj A Dadhania @ 2023-11-28 12:59 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, x86, kvm
  Cc: bp, mingo, tglx, dave.hansen, dionnaglaze, pgonda, seanjc,
	pbonzini, nikunj

Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
parameters being used cannot be changed by hypervisor once the guest is
launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".

During the boot-up of the secondary cpus, SecureTSC enabled guests need to
query TSC info from AMD Security Processor. This communication channel is
encrypted between the AMD Security Processor and the guest, the hypervisor
is just the conduit to deliver the guest messages to the AMD Security
Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
Secure Nested Paging Firmware ABI Specification" document (currently at
https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"

Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
communicate with the AMD Security Processor which is available at early
boot.

SEV-guest driver has the implementation for guest and AMD Security
Processor communication. As the TSC_INFO needs to be initialized during
early boot before smp cpus are started, move most of the sev-guest driver
code to kernel/sev.c and provide well defined APIs to the sev-guest driver
to use the interface to avoid code-duplication.

Patches:
01-08: Preparation and movement of sev-guest driver code
09-16: SecureTSC enablement patches.

Testing SecureTSC
-----------------

SecureTSC hypervisor patches based on top of SEV-SNP Guest MEMFD series:
https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5

QEMU changes:
https://github.com/nikunjad/qemu/tree/snp_securetsc_v5

QEMU commandline SEV-SNP-UPM with SecureTSC:

  qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc,+invtsc -smp 4 \
    -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
    -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
    -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
    ...

Changelog:
----------
v6:
* Add synthetic SecureTSC x86 feature bit
* Drop {__enc,dec}_payload() as they are pretty small and has one caller.
* Instead of a pointer use data_npages as variable
* Beautify struct snp_guest_req
* Make vmpck_id as unsigned int in snp_assign_vmpck()
* Move most of the functions to end of sev.c file
* Update commit/comments/error messages
* Mark free_shared_pages and alloc_shared_pages as inline
* Free snp_dev->certs_data when guest driver is removed
* Add lockdep assert in snp_inc_msg_seqno()
* Drop redundant enc_init NULL check
* Move SNP_TSC_INFO_REQ_SZ define out of structure
* Rename guest_tsc_{scale,offset} to snp_tsc_{scale,offset}
* Add new linux termination error code GHCB_TERM_SECURE_TSC
* Initialize and use cmd_mutex in snp_get_tsc_info()
* Set TSC as reliable in sme_early_init()
* Do not print firmware bug for Secure TSC enabled guests

v5:
* Rebased on v6.6 kernel
* Dropped link tag in first patch
* Dropped get_ctx_authsize() as it was redundant

https://lore.kernel.org/lkml/20231030063652.68675-1-nikunj@amd.com/

v4:
* Drop handle_guest_request() and handle_guest_request_ext()
* Drop NULL check for key
* Corrected commit subject
* Added Reviewed-by from Tom

https://lore.kernel.org/lkml/20230814055222.1056404-1-nikunj@amd.com/

v3:
* Updated commit messages
* Made snp_setup_psp_messaging() generic that is accessed by both the
  kernel and the driver
* Moved most of the context information to sev.c, sev-guest driver
  does not need to know the secrets page layout anymore
* Add CC_ATTR_GUEST_SECURE_TSC early in the series therefore it can be
  used in later patches.
* Removed data_gpa and data_npages from struct snp_req_data, as certs_data
  and its size is passed to handle_guest_request_ext()
* Make vmpck_id as unsigned int
* Dropped unnecessary usage of memzero_explicit()
* Cache secrets_pa instead of remapping the cc_blob always
* Rebase on top of v6.4 kernel
https://lore.kernel.org/lkml/20230722111909.15166-1-nikunj@amd.com/

v2:
* Rebased on top of v6.3-rc3 that has Boris's sev-guest cleanup series
  https://lore.kernel.org/r/20230307192449.24732-1-bp@alien8.de/

v1: https://lore.kernel.org/r/20230130120327.977460-1-nikunj@amd.com/

Nikunj A Dadhania (16):
  virt: sev-guest: Use AES GCM crypto library
  virt: sev-guest: Move mutex to SNP guest device structure
  virt: sev-guest: Replace dev_dbg with pr_debug
  virt: sev-guest: Add SNP guest request structure
  virt: sev-guest: Add vmpck_id to snp_guest_dev struct
  x86/sev: Cache the secrets page address
  x86/sev: Move and reorganize sev guest request api
  x86/mm: Add generic guest initialization hook
  x86/cpufeatures: Add synthetic Secure TSC bit
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled
    guests
  x86/kvmclock: Skip kvmclock when Secure TSC is available
  x86/sev: Mark Secure TSC as reliable
  x86/cpu/amd: Do not print FW_BUG for Secure TSC
  x86/sev: Enable Secure TSC for SNP guests

 arch/x86/Kconfig                        |   1 +
 arch/x86/boot/compressed/sev.c          |   3 +-
 arch/x86/include/asm/cpufeatures.h      |   1 +
 arch/x86/include/asm/sev-common.h       |   1 +
 arch/x86/include/asm/sev-guest.h        | 191 +++++++
 arch/x86/include/asm/sev.h              |  20 +-
 arch/x86/include/asm/svm.h              |   6 +-
 arch/x86/include/asm/x86_init.h         |   2 +
 arch/x86/kernel/cpu/amd.c               |   3 +-
 arch/x86/kernel/kvmclock.c              |   2 +-
 arch/x86/kernel/sev-shared.c            |  10 +
 arch/x86/kernel/sev.c                   | 622 ++++++++++++++++++++--
 arch/x86/kernel/x86_init.c              |   2 +
 arch/x86/mm/mem_encrypt.c               |  12 +-
 arch/x86/mm/mem_encrypt_amd.c           |  11 +
 drivers/virt/coco/sev-guest/Kconfig     |   3 -
 drivers/virt/coco/sev-guest/sev-guest.c | 661 +++---------------------
 drivers/virt/coco/sev-guest/sev-guest.h |  63 ---
 18 files changed, 888 insertions(+), 726 deletions(-)
 create mode 100644 arch/x86/include/asm/sev-guest.h
 delete mode 100644 drivers/virt/coco/sev-guest/sev-guest.h


base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
-- 
2.34.1


^ permalink raw reply	[relevance 2%]

* Re: [PATCH 1/2] Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"
  2023-10-18 19:41  5% ` [PATCH 1/2] Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB" Sean Christopherson
@ 2023-11-20 11:41  0%   ` Maxim Levitsky
  0 siblings, 0 replies; 200+ results
From: Maxim Levitsky @ 2023-11-20 11:41 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Stefan Sterz

On Wed, 2023-10-18 at 12:41 -0700, Sean Christopherson wrote:
> Revert KVM's made-up consistency check on SVM's TLB control.  The APM says
> that unsupported encodings are reserved, but the APM doesn't state that
> VMRUN checks for a supported encoding.  Unless something is called out
> in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be
> Zero), AMD behavior is typically to let software shoot itself in the foot.
> 
> This reverts commit 174a921b6975ef959dd82ee9e8844067a62e3ec1.
> 
> Fixes: 174a921b6975 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB")
> Reported-by: Stefan Sterz <s.sterz@proxmox.com>
> Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/nested.c | 15 ---------------
>  1 file changed, 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 3fea8c47679e..60891b9ce25f 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -247,18 +247,6 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
>  	    kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1);
>  }
>  
> -static bool nested_svm_check_tlb_ctl(struct kvm_vcpu *vcpu, u8 tlb_ctl)
> -{
> -	/* Nested FLUSHBYASID is not supported yet.  */
> -	switch(tlb_ctl) {
> -		case TLB_CONTROL_DO_NOTHING:
> -		case TLB_CONTROL_FLUSH_ALL_ASID:
> -			return true;
> -		default:
> -			return false;
> -	}
> -}
> -
>  static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
>  					 struct vmcb_ctrl_area_cached *control)
>  {
> @@ -278,9 +266,6 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
>  					   IOPM_SIZE)))
>  		return false;
>  
> -	if (CC(!nested_svm_check_tlb_ctl(vcpu, control->tlb_ctl)))
> -		return false;
> -
>  	if (CC((control->int_ctl & V_NMI_ENABLE_MASK) &&
>  	       !vmcb12_is_intercept(control, INTERCEPT_NMI))) {
>  		return false;


Yes, after checking Jim's comment (*) on this I still agree that revert is OK.
KVM never passes through the tlb_ctl field (but does copy it to the cache),
thus there is no need to sanitize it.

https://www.spinics.net/lists/kvm/msg316072.html


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[relevance 0%]

* [kvm-unit-tests PATCH v3 3/7] lib: s390x: ap: Add proper ap setup code
    2023-11-17 15:19  4% ` [kvm-unit-tests PATCH v3 1/7] lib: s390x: Add ap library Janosch Frank
@ 2023-11-17 15:19  5% ` Janosch Frank
  1 sibling, 0 replies; 200+ results
From: Janosch Frank @ 2023-11-17 15:19 UTC (permalink / raw)
  To: kvm; +Cc: linux-s390, imbrenda, thuth, david, nsg, nrb

For the next tests we need a valid queue which means we need to grab
the qci info and search the first set bit in the ap and aq masks.

Let's move from the stfle 12 check to proper setup code that also
returns arrays for the aps and qns.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 lib/s390x/ap.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++---
 lib/s390x/ap.h | 17 +++++++++-
 s390x/ap.c     |  4 ++-
 3 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
index 17a32d66..9ba5a037 100644
--- a/lib/s390x/ap.c
+++ b/lib/s390x/ap.c
@@ -13,10 +13,18 @@
 
 #include <libcflat.h>
 #include <interrupt.h>
+#include <alloc.h>
+#include <bitops.h>
 #include <ap.h>
 #include <asm/time.h>
 #include <asm/facility.h>
 
+static uint8_t num_ap;
+static uint8_t num_queue;
+static uint8_t *array_ap;
+static uint8_t *array_queue;
+static struct ap_config_info qci;
+
 int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
 		 struct pqap_r2 *r2)
 {
@@ -78,16 +86,90 @@ int ap_pqap_qci(struct ap_config_info *info)
 	return cc;
 }
 
-/* Will later be extended to a proper setup function */
-bool ap_setup(void)
+static int get_entry(uint8_t *ptr, int i, size_t len)
 {
+	/* Skip over the last entry */
+	if (i)
+		i++;
+
+	for (; i < 8 * len; i++) {
+		/* Do we even need to check the byte? */
+		if (!ptr[i / 8]) {
+			i += 7;
+			continue;
+		}
+
+		/* Check the bit in big-endian order */
+		if (ptr[i / 8] & BIT(7 - (i % 8)))
+			return i;
+	}
+	return -1;
+}
+
+static void setup_info(void)
+{
+	int i, entry = 0;
+
+	ap_pqap_qci(&qci);
+
+	for (i = 0; i < AP_ARRAY_MAX_NUM; i++) {
+		entry = get_entry((uint8_t *)qci.apm, entry, sizeof(qci.apm));
+		if (entry == -1)
+			break;
+
+		if (!num_ap) {
+			/*
+			 * We have at least one AP, time to
+			 * allocate the queue arrays
+			 */
+			array_ap = malloc(AP_ARRAY_MAX_NUM);
+			array_queue = malloc(AP_ARRAY_MAX_NUM);
+		}
+		array_ap[num_ap] = entry;
+		num_ap += 1;
+	}
+
+	/* Without an AP we don't even need to look at the queues */
+	if (!num_ap)
+		return;
+
+	entry = 0;
+	for (i = 0; i < AP_ARRAY_MAX_NUM; i++) {
+		entry = get_entry((uint8_t *)qci.aqm, entry, sizeof(qci.aqm));
+		if (entry == -1)
+			return;
+
+		array_queue[num_queue] = entry;
+		num_queue += 1;
+	}
+
+}
+
+int ap_setup(uint8_t **ap_array, uint8_t **qn_array, uint8_t *naps, uint8_t *nqns)
+{
+	assert(!num_ap);
+
 	/*
 	 * Base AP support has no STFLE or SCLP feature bit but the
 	 * PQAP QCI support is indicated via stfle bit 12. As this
 	 * library relies on QCI we bail out if it's not available.
 	 */
 	if (!test_facility(12))
-		return false;
+		return AP_SETUP_NOINSTR;
 
-	return true;
+	/* Setup ap and queue arrays */
+	setup_info();
+
+	if (!num_ap)
+		return AP_SETUP_NOAPQN;
+
+	/* Only provide the data if the caller actually needs it. */
+	if (ap_array && qn_array && naps && nqns) {
+		*ap_array = array_ap;
+		*qn_array = array_queue;
+		*naps = num_ap;
+		*nqns = num_queue;
+	}
+
+	return AP_SETUP_READY;
 }
diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
index cda1e564..ac9e59d1 100644
--- a/lib/s390x/ap.h
+++ b/lib/s390x/ap.h
@@ -81,7 +81,22 @@ struct pqap_r2 {
 } __attribute__((packed))  __attribute__((aligned(8)));
 _Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
 
-bool ap_setup(void);
+/*
+ * Maximum number of APQNs that we keep track of.
+ *
+ * This value is already way larger than the number of APQNs a AP test
+ * is probably going to use.
+ */
+#define AP_ARRAY_MAX_NUM	16
+
+/* Return values of ap_setup() */
+enum {
+	AP_SETUP_NOINSTR,	/* AP instructions not available */
+	AP_SETUP_NOAPQN,	/* Instructions available but no APQNs */
+	AP_SETUP_READY		/* Full setup complete, at least one APQN */
+};
+
+int ap_setup(uint8_t **ap_array, uint8_t **qn_array, uint8_t *naps, uint8_t *nqns);
 int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
 		 struct pqap_r2 *r2);
 int ap_pqap_qci(struct ap_config_info *info);
diff --git a/s390x/ap.c b/s390x/ap.c
index 94f08783..32feb8db 100644
--- a/s390x/ap.c
+++ b/s390x/ap.c
@@ -292,8 +292,10 @@ static void test_priv(void)
 
 int main(void)
 {
+	int setup_rc = ap_setup(NULL, NULL, NULL, NULL);
+
 	report_prefix_push("ap");
-	if (!ap_check()) {
+	if (setup_rc == AP_SETUP_NOINSTR) {
 		report_skip("AP instructions not available");
 		goto done;
 	}
-- 
2.34.1


^ permalink raw reply related	[relevance 5%]

* [kvm-unit-tests PATCH v3 1/7] lib: s390x: Add ap library
  @ 2023-11-17 15:19  4% ` Janosch Frank
  2023-11-17 15:19  5% ` [kvm-unit-tests PATCH v3 3/7] lib: s390x: ap: Add proper ap setup code Janosch Frank
  1 sibling, 0 replies; 200+ results
From: Janosch Frank @ 2023-11-17 15:19 UTC (permalink / raw)
  To: kvm; +Cc: linux-s390, imbrenda, thuth, david, nsg, nrb

Add functions and definitions needed to test the Adjunct
Processor (AP) crypto interface.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 lib/s390x/ap.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/s390x/ap.h | 88 +++++++++++++++++++++++++++++++++++++++++++++++
 s390x/Makefile |  1 +
 3 files changed, 182 insertions(+)
 create mode 100644 lib/s390x/ap.c
 create mode 100644 lib/s390x/ap.h

diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
new file mode 100644
index 00000000..17a32d66
--- /dev/null
+++ b/lib/s390x/ap.c
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * AP crypto functions
+ *
+ * Some parts taken from the Linux AP driver.
+ *
+ * Copyright IBM Corp. 2023
+ * Author: Janosch Frank <frankja@linux.ibm.com>
+ *	   Tony Krowiak <akrowia@linux.ibm.com>
+ *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *	   Harald Freudenberger <freude@de.ibm.com>
+ */
+
+#include <libcflat.h>
+#include <interrupt.h>
+#include <ap.h>
+#include <asm/time.h>
+#include <asm/facility.h>
+
+int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
+		 struct pqap_r2 *r2)
+{
+	struct pqap_r0 r0 = {
+		.ap = ap,
+		.qn = qn,
+		.fc = PQAP_TEST_APQ
+	};
+	int cc;
+
+	/*
+	 * Test AP Queue
+	 *
+	 * Writes AP configuration information to the memory pointed
+	 * at by GR2.
+	 *
+	 * Inputs: GR0
+	 * Outputs: GR1 (APQSW), GR2 (tapq data)
+	 * Synchronous
+	 */
+	asm volatile(
+		"	lgr	0,%[r0]\n"
+		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
+		"	stg	1,%[apqsw]\n"
+		"	stg	2,%[r2]\n"
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [apqsw] "=&T" (*apqsw), [r2] "=&T" (*r2), [cc] "=&d" (cc)
+		: [r0] "d" (r0));
+
+	return cc;
+}
+
+int ap_pqap_qci(struct ap_config_info *info)
+{
+	struct pqap_r0 r0 = { .fc = PQAP_QUERY_AP_CONF_INFO };
+	int cc;
+
+	/*
+	 * Query AP Configuration Information
+	 *
+	 * Writes AP configuration information to the memory pointed
+	 * at by GR2.
+	 *
+	 * Inputs: GR0, GR2 (QCI block address)
+	 * Outputs: memory at GR2 address
+	 * Synchronous
+	 */
+	asm volatile(
+		"	lgr	0,%[r0]\n"
+		"	lgr	2,%[info]\n"
+		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [cc] "=&d" (cc)
+		: [r0] "d" (r0), [info] "d" (info)
+		: "cc", "memory", "0", "2");
+
+	return cc;
+}
+
+/* Will later be extended to a proper setup function */
+bool ap_setup(void)
+{
+	/*
+	 * Base AP support has no STFLE or SCLP feature bit but the
+	 * PQAP QCI support is indicated via stfle bit 12. As this
+	 * library relies on QCI we bail out if it's not available.
+	 */
+	if (!test_facility(12))
+		return false;
+
+	return true;
+}
diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
new file mode 100644
index 00000000..cda1e564
--- /dev/null
+++ b/lib/s390x/ap.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * AP definitions
+ *
+ * Some parts taken from the Linux AP driver.
+ *
+ * Copyright IBM Corp. 2023
+ * Author: Janosch Frank <frankja@linux.ibm.com>
+ *	   Tony Krowiak <akrowia@linux.ibm.com>
+ *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *	   Harald Freudenberger <freude@de.ibm.com>
+ */
+
+#ifndef _S390X_AP_H_
+#define _S390X_AP_H_
+
+enum PQAP_FC {
+	PQAP_TEST_APQ,
+	PQAP_RESET_APQ,
+	PQAP_ZEROIZE_APQ,
+	PQAP_QUEUE_INT_CONTRL,
+	PQAP_QUERY_AP_CONF_INFO,
+	PQAP_QUERY_AP_COMP_TYPE,
+	PQAP_BEST_AP,
+};
+
+struct ap_queue_status {
+	uint32_t pad0;			/* Ignored padding for zArch  */
+	uint32_t empty		: 1;
+	uint32_t replies_waiting: 1;
+	uint32_t full		: 1;
+	uint32_t pad1		: 4;
+	uint32_t irq_enabled	: 1;
+	uint32_t rc		: 8;
+	uint32_t pad2		: 16;
+} __attribute__((packed))  __attribute__((aligned(8)));
+_Static_assert(sizeof(struct ap_queue_status) == sizeof(uint64_t), "APQSW size");
+
+struct ap_config_info {
+	uint8_t apsc	 : 1;	/* S bit */
+	uint8_t apxa	 : 1;	/* N bit */
+	uint8_t qact	 : 1;	/* C bit */
+	uint8_t rc8a	 : 1;	/* R bit */
+	uint8_t l	 : 1;	/* L bit */
+	uint8_t lext	 : 3;	/* Lext bits */
+	uint8_t reserved2[3];
+	uint8_t Na;		/* max # of APs - 1 */
+	uint8_t Nd;		/* max # of Domains - 1 */
+	uint8_t reserved6[10];
+	uint32_t apm[8];	/* AP ID mask */
+	uint32_t aqm[8];	/* AP (usage) queue mask */
+	uint32_t adm[8];	/* AP (control) domain mask */
+	uint8_t _reserved4[16];
+} __attribute__((aligned(8))) __attribute__ ((__packed__));
+_Static_assert(sizeof(struct ap_config_info) == 128, "PQAP QCI size");
+
+struct pqap_r0 {
+	uint32_t pad0;
+	uint8_t fc;
+	uint8_t t : 1;		/* Test facilities (TAPQ)*/
+	uint8_t pad1 : 7;
+	uint8_t ap;
+	uint8_t qn;
+} __attribute__((packed))  __attribute__((aligned(8)));
+
+struct pqap_r2 {
+	uint8_t s : 1;		/* Special Command facility */
+	uint8_t m : 1;		/* AP4KM */
+	uint8_t c : 1;		/* AP4KC */
+	uint8_t cop : 1;	/* AP is in coprocessor mode */
+	uint8_t acc : 1;	/* AP is in accelerator mode */
+	uint8_t xcp : 1;	/* AP is in XCP-mode */
+	uint8_t n : 1;		/* AP extended addressing facility */
+	uint8_t pad_0 : 1;
+	uint8_t pad_1[3];
+	uint8_t at;
+	uint8_t nd;
+	uint8_t pad_6;
+	uint8_t pad_7 : 4;
+	uint8_t qd : 4;
+} __attribute__((packed))  __attribute__((aligned(8)));
+_Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
+
+bool ap_setup(void);
+int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
+		 struct pqap_r2 *r2);
+int ap_pqap_qci(struct ap_config_info *info);
+#endif
diff --git a/s390x/Makefile b/s390x/Makefile
index f79fd009..2199eeba 100644
--- a/s390x/Makefile
+++ b/s390x/Makefile
@@ -110,6 +110,7 @@ cflatobjs += lib/s390x/malloc_io.o
 cflatobjs += lib/s390x/uv.o
 cflatobjs += lib/s390x/sie.o
 cflatobjs += lib/s390x/fault.o
+cflatobjs += lib/s390x/ap.o
 
 OBJDIRS += lib/s390x
 
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v6 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  @ 2023-11-17  7:51  4% ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-11-17  7:51 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Michael S . Tsirkin,
	Richard Henderson, Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Zhuocheng Ding, Babu Moger,
	Yongwei Ma, Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes since v3:
 * Rewrite the subject. (Babu)
 * Delete the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Rename "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Add the description of the original commit and add Cc.
---
 target/i386/cpu.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 86445e833a2d..3ddedd6de120 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -483,7 +483,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -492,13 +492,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->modules_per_die *
-                     topo_info->cores_per_module *
-                     topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v2] s390/vfio-ap: fix sysfs status attribute for AP queue devices
  2023-11-08 20:11  4% [PATCH v2] s390/vfio-ap: fix sysfs status attribute for AP queue devices Tony Krowiak
@ 2023-11-08 20:21  0% ` Tony Krowiak
  0 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-11-08 20:21 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, pasic, borntraeger, frankja, imbrenda, david,
	Harald Freudenberger

Christian,
Can this be pushed with the Acks by Halil and Harald?

On 11/8/23 15:11, Tony Krowiak wrote:
> The 'status' attribute for AP queue devices bound to the vfio_ap device
> driver displays incorrect status when the mediated device is attached to a
> guest, but the queue device is not passed through. In the current
> implementation, the status displayed is 'in_use' which is not correct; it
> should be 'assigned'. This can happen if one of the queue devices
> associated with a given adapter is not bound to the vfio_ap device driver.
> For example:
> 
> Queues listed in /sys/bus/ap/drivers/vfio_ap:
> 14.0005
> 14.0006
> 14.000d
> 16.0006
> 16.000d
> 
> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/matrix
> 14.0005
> 14.0006
> 14.000d
> 16.0005
> 16.0006
> 16.000d
> 
> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/guest_matrix
> 14.0005
> 14.0006
> 14.000d
> 
> The reason no queues for adapter 0x16 are listed in the guest_matrix is
> because queue 16.0005 is not bound to the vfio_ap device driver, so no
> queue associated with the adapter is passed through to the guest;
> therefore, each queue device for adapter 0x16 should display 'assigned'
> instead of 'in_use', because those queues are not in use by a guest, but
> only assigned to the mediated device.
> 
> Let's check the AP configuration for the guest to determine whether a
> queue device is passed through before displaying a status of 'in_use'.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> Acked-by: Halil Pasic <pasic@linux.ibm.com>
> Acked-by: Harald Freudenberger <freude@linux.ibm.com>
> ---
>   drivers/s390/crypto/vfio_ap_ops.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index 4db538a55192..6e0a79086656 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1976,6 +1976,7 @@ static ssize_t status_show(struct device *dev,
>   {
>   	ssize_t nchars = 0;
>   	struct vfio_ap_queue *q;
> +	unsigned long apid, apqi;
>   	struct ap_matrix_mdev *matrix_mdev;
>   	struct ap_device *apdev = to_ap_dev(dev);
>   
> @@ -1983,8 +1984,21 @@ static ssize_t status_show(struct device *dev,
>   	q = dev_get_drvdata(&apdev->device);
>   	matrix_mdev = vfio_ap_mdev_for_queue(q);
>   
> +	/* If the queue is assigned to the matrix mediated device, then
> +	 * determine whether it is passed through to a guest; otherwise,
> +	 * indicate that it is unassigned.
> +	 */
>   	if (matrix_mdev) {
> -		if (matrix_mdev->kvm)
> +		apid = AP_QID_CARD(q->apqn);
> +		apqi = AP_QID_QUEUE(q->apqn);
> +		/*
> +		 * If the queue is passed through to the guest, then indicate
> +		 * that it is in use; otherwise, indicate that it is
> +		 * merely assigned to a matrix mediated device.
> +		 */
> +		if (matrix_mdev->kvm &&
> +		    test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
> +		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
>   			nchars = scnprintf(buf, PAGE_SIZE, "%s\n",
>   					   AP_QUEUE_IN_USE);
>   		else

^ permalink raw reply	[relevance 0%]

* [PATCH v2] s390/vfio-ap: fix sysfs status attribute for AP queue devices
@ 2023-11-08 20:11  4% Tony Krowiak
  2023-11-08 20:21  0% ` Tony Krowiak
  0 siblings, 1 reply; 200+ results
From: Tony Krowiak @ 2023-11-08 20:11 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, pasic, borntraeger, frankja, imbrenda, david,
	Harald Freudenberger

The 'status' attribute for AP queue devices bound to the vfio_ap device
driver displays incorrect status when the mediated device is attached to a
guest, but the queue device is not passed through. In the current
implementation, the status displayed is 'in_use' which is not correct; it
should be 'assigned'. This can happen if one of the queue devices
associated with a given adapter is not bound to the vfio_ap device driver.
For example:

Queues listed in /sys/bus/ap/drivers/vfio_ap:
14.0005
14.0006
14.000d
16.0006
16.000d

Queues listed in /sys/devices/vfio_ap/matrix/$UUID/matrix
14.0005
14.0006
14.000d
16.0005
16.0006
16.000d

Queues listed in /sys/devices/vfio_ap/matrix/$UUID/guest_matrix
14.0005
14.0006
14.000d

The reason no queues for adapter 0x16 are listed in the guest_matrix is
because queue 16.0005 is not bound to the vfio_ap device driver, so no
queue associated with the adapter is passed through to the guest;
therefore, each queue device for adapter 0x16 should display 'assigned'
instead of 'in_use', because those queues are not in use by a guest, but
only assigned to the mediated device.

Let's check the AP configuration for the guest to determine whether a
queue device is passed through before displaying a status of 'in_use'.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Acked-by: Harald Freudenberger <freude@linux.ibm.com>
---
 drivers/s390/crypto/vfio_ap_ops.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..6e0a79086656 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1976,6 +1976,7 @@ static ssize_t status_show(struct device *dev,
 {
 	ssize_t nchars = 0;
 	struct vfio_ap_queue *q;
+	unsigned long apid, apqi;
 	struct ap_matrix_mdev *matrix_mdev;
 	struct ap_device *apdev = to_ap_dev(dev);
 
@@ -1983,8 +1984,21 @@ static ssize_t status_show(struct device *dev,
 	q = dev_get_drvdata(&apdev->device);
 	matrix_mdev = vfio_ap_mdev_for_queue(q);
 
+	/* If the queue is assigned to the matrix mediated device, then
+	 * determine whether it is passed through to a guest; otherwise,
+	 * indicate that it is unassigned.
+	 */
 	if (matrix_mdev) {
-		if (matrix_mdev->kvm)
+		apid = AP_QID_CARD(q->apqn);
+		apqi = AP_QID_QUEUE(q->apqn);
+		/*
+		 * If the queue is passed through to the guest, then indicate
+		 * that it is in use; otherwise, indicate that it is
+		 * merely assigned to a matrix mediated device.
+		 */
+		if (matrix_mdev->kvm &&
+		    test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
+		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
 			nchars = scnprintf(buf, PAGE_SIZE, "%s\n",
 					   AP_QUEUE_IN_USE);
 		else
-- 
2.41.0


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support
  2023-11-08  8:21  4%       ` Jeremi Piotrowski
@ 2023-11-08 15:19  4%         ` Tom Lendacky
  0 siblings, 0 replies; 200+ results
From: Tom Lendacky @ 2023-11-08 15:19 UTC (permalink / raw)
  To: Jeremi Piotrowski, Borislav Petkov, Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, hpa, ardb, pbonzini, seanjc, vkuznets, jmattson,
	luto, dave.hansen, slp, pgonda, peterz, srinivas.pandruvada,
	rientjes, dovmurik, tobin, vbabka, kirill, ak, tony.luck,
	marcorr, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick,
	zhi.a.wang, Brijesh Singh

On 11/8/23 02:21, Jeremi Piotrowski wrote:
> On 07/11/2023 19:32, Tom Lendacky wrote:
>> On 11/7/23 10:31, Borislav Petkov wrote:
>>>
>>> And the stuff that needs to happen once, needs to be called once too.
>>>
>>>> +
>>>> +    return snp_get_rmptable_info(&rmp_base, &rmp_size);
>>>> +}
>>>> +
>>>>    static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>>    {
>>>>        u64 msr;
>>>> @@ -659,6 +674,9 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>>            if (!(msr & MSR_K7_HWCR_SMMLOCK))
>>>>                goto clear_sev;
>>>>    +        if (cpu_has(c, X86_FEATURE_SEV_SNP) && !early_rmptable_check())
>>>> +            goto clear_snp;
>>>> +
>>>>            return;
>>>>      clear_all:
>>>> @@ -666,6 +684,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>>    clear_sev:
>>>>            setup_clear_cpu_cap(X86_FEATURE_SEV);
>>>>            setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
>>>> +clear_snp:
>>>>            setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>>>        }
>>>>    }
>>>
>>> ...
>>>
>>>> +bool snp_get_rmptable_info(u64 *start, u64 *len)
>>>> +{
>>>> +    u64 max_rmp_pfn, calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
>>>> +
>>>> +    rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
>>>> +    rdmsrl(MSR_AMD64_RMP_END, rmp_end);
>>>> +
>>>> +    if (!(rmp_base & RMP_ADDR_MASK) || !(rmp_end & RMP_ADDR_MASK)) {
>>>> +        pr_err("Memory for the RMP table has not been reserved by BIOS\n");
>>>> +        return false;
>>>> +    }
>>>
>>> If you're masking off bits 0-12 above...
>>
>> Because the RMP_END MSR, most specifically, has a default value of 0x1fff, where bits [12:0] are reserved. So to specifically check if the MSR has been set to a non-zero end value, the bit are masked off. However, ...
>>
> 
> Do you have a source for this? Because the APM vol. 2, table A.7 says the reset value of RMP_END is all zeros.

Ah, good catch. Let me work on getting the APM updated.

Thanks,
Tom

> 
> Thanks,
> Jeremi
> 

^ permalink raw reply	[relevance 4%]

* Re: [PATCH] s390/vfio-ap: fix sysfs status attribute for AP queue devices
  2023-11-07  8:07  0%   ` Harald Freudenberger
@ 2023-11-08 14:44  0%     ` Tony Krowiak
  0 siblings, 0 replies; 200+ results
From: Tony Krowiak @ 2023-11-08 14:44 UTC (permalink / raw)
  To: freude
  Cc: linux-s390, linux-kernel, kvm, jjherne, pasic, borntraeger,
	frankja, imbrenda, david, stable



On 11/7/23 03:07, Harald Freudenberger wrote:
> On 2023-11-06 17:03, Tony Krowiak wrote:
>> PING
>> This patch is pretty straight forward, does anyone see a reason why
>> this shouldn't be integrated?
>>
>> On 10/20/23 16:48, Tony Krowiak wrote:
>>> The 'status' attribute for AP queue devices bound to the vfio_ap device
>>> driver displays incorrect status when the mediated device is attached 
>>> to a
>>> guest, but the queue device is not passed through. In the current
>>> implementation, the status displayed is 'in_use' which is not 
>>> correct; it
>>> should be 'assigned'. This can happen if one of the queue devices
>>> associated with a given adapter is not bound to the vfio_ap device 
>>> driver.
>>> For example:
>>>
>>> Queues listed in /sys/bus/ap/drivers/vfio_ap:
>>> 14.0005
>>> 14.0006
>>> 14.000d
>>> 16.0006
>>> 16.000d
>>>
>>> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/matrix
>>> 14.0005
>>> 14.0006
>>> 14.000d
>>> 16.0005
>>> 16.0006
>>> 16.000d
>>>
>>> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/guest_matrix
>>> 14.0005
>>> 14.0006
>>> 14.000d
>>>
>>> The reason no queues for adapter 0x16 are listed in the guest_matrix is
>>> because queue 16.0005 is not bound to the vfio_ap device driver, so no
>>> queue associated with the adapter is passed through to the guest;
>>> therefore, each queue device for adapter 0x16 should display 'assigned'
>>> instead of 'in_use', because those queues are not in use by a guest, but
>>> only assigned to the mediated device.
>>>
>>> Let's check the AP configuration for the guest to determine whether a
>>> queue device is passed through before displaying a status of 'in_use'.
>>>
>>> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
>>> Fixes: f139862b92cf ("s390/vfio-ap: add status attribute to AP queue 
>>> device's sysfs dir")
>>> Cc: stable@vger.kernel.org
>>> ---
>>>   drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++-
>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/s390/crypto/vfio_ap_ops.c 
>>> b/drivers/s390/crypto/vfio_ap_ops.c
>>> index 4db538a55192..871c14a6921f 100644
>>> --- a/drivers/s390/crypto/vfio_ap_ops.c
>>> +++ b/drivers/s390/crypto/vfio_ap_ops.c
>>> @@ -1976,6 +1976,7 @@ static ssize_t status_show(struct device *dev,
>>>   {
>>>       ssize_t nchars = 0;
>>>       struct vfio_ap_queue *q;
>>> +    unsigned long apid, apqi;
>>>       struct ap_matrix_mdev *matrix_mdev;
>>>       struct ap_device *apdev = to_ap_dev(dev);
>>>   @@ -1984,7 +1985,11 @@ static ssize_t status_show(struct device *dev,
>>>       matrix_mdev = vfio_ap_mdev_for_queue(q);
>>>         if (matrix_mdev) {
>>> -        if (matrix_mdev->kvm)
>>> +        apid = AP_QID_CARD(q->apqn);
>>> +        apqi = AP_QID_QUEUE(q->apqn);
>>> +        if (matrix_mdev->kvm &&
>>> +            test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
>>> +            test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
>>>               nchars = scnprintf(buf, PAGE_SIZE, "%s\n",
>>>                          AP_QUEUE_IN_USE);
>>>           else
> 
> I can give you an
> Acked-by: Harald Freudenberger <freude@linux.ibm.com>
> for this. Your explanation sounds sane to me and fixes a wrong
> display. However, I am not familiar with the code so, I can't tell
> if that's correct.
> Just a remark: How can it happen that one queue is not bound to the vfio 
> dd?
> Didn't we actively remove the unbind possibility from the sysfs for devices
> assigned to the vfio dd?

A device bound to the vfio_ap device driver can be manually unbound; 
however, it can not be unbound by the AP bus via a change to the 
apmask/aqmask attributes if it is assigned to a mediated device.

At one point I wanted to remove the unbind sysfs attribute, but as I 
recall I was talked out of it.


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support
  @ 2023-11-08  8:21  4%       ` Jeremi Piotrowski
  2023-11-08 15:19  4%         ` Tom Lendacky
  0 siblings, 1 reply; 200+ results
From: Jeremi Piotrowski @ 2023-11-08  8:21 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov, Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, hpa, ardb, pbonzini, seanjc, vkuznets, jmattson,
	luto, dave.hansen, slp, pgonda, peterz, srinivas.pandruvada,
	rientjes, dovmurik, tobin, vbabka, kirill, ak, tony.luck,
	marcorr, sathyanarayanan.kuppuswamy, alpergun, jarkko,
	ashish.kalra, nikunj.dadhania, pankaj.gupta, liam.merwick,
	zhi.a.wang, Brijesh Singh

On 07/11/2023 19:32, Tom Lendacky wrote:
> On 11/7/23 10:31, Borislav Petkov wrote:
>>
>> And the stuff that needs to happen once, needs to be called once too.
>>
>>> +
>>> +    return snp_get_rmptable_info(&rmp_base, &rmp_size);
>>> +}
>>> +
>>>   static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>   {
>>>       u64 msr;
>>> @@ -659,6 +674,9 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>           if (!(msr & MSR_K7_HWCR_SMMLOCK))
>>>               goto clear_sev;
>>>   +        if (cpu_has(c, X86_FEATURE_SEV_SNP) && !early_rmptable_check())
>>> +            goto clear_snp;
>>> +
>>>           return;
>>>     clear_all:
>>> @@ -666,6 +684,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>>   clear_sev:
>>>           setup_clear_cpu_cap(X86_FEATURE_SEV);
>>>           setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
>>> +clear_snp:
>>>           setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>>       }
>>>   }
>>
>> ...
>>
>>> +bool snp_get_rmptable_info(u64 *start, u64 *len)
>>> +{
>>> +    u64 max_rmp_pfn, calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
>>> +
>>> +    rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
>>> +    rdmsrl(MSR_AMD64_RMP_END, rmp_end);
>>> +
>>> +    if (!(rmp_base & RMP_ADDR_MASK) || !(rmp_end & RMP_ADDR_MASK)) {
>>> +        pr_err("Memory for the RMP table has not been reserved by BIOS\n");
>>> +        return false;
>>> +    }
>>
>> If you're masking off bits 0-12 above...
> 
> Because the RMP_END MSR, most specifically, has a default value of 0x1fff, where bits [12:0] are reserved. So to specifically check if the MSR has been set to a non-zero end value, the bit are masked off. However, ...
> 

Do you have a source for this? Because the APM vol. 2, table A.7 says the reset value of RMP_END is all zeros.

Thanks,
Jeremi


^ permalink raw reply	[relevance 4%]

* Re: [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support
  2023-11-07 19:00  0%     ` Kalra, Ashish
@ 2023-11-07 19:19  0%       ` Borislav Petkov
  2023-12-08 17:09  0%       ` Jeremi Piotrowski
  1 sibling, 0 replies; 200+ results
From: Borislav Petkov @ 2023-11-07 19:19 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Michael Roth, kvm, linux-coco, linux-mm, linux-crypto, x86,
	linux-kernel, tglx, mingo, jroedel, thomas.lendacky, hpa, ardb,
	pbonzini, seanjc, vkuznets, jmattson, luto, dave.hansen, slp,
	pgonda, peterz, srinivas.pandruvada, rientjes, dovmurik, tobin,
	vbabka, kirill, ak, tony.luck, marcorr,
	sathyanarayanan.kuppuswamy, alpergun, jarkko, nikunj.dadhania,
	pankaj.gupta, liam.merwick, zhi.a.wang, Brijesh Singh

On Tue, Nov 07, 2023 at 01:00:00PM -0600, Kalra, Ashish wrote:
> > First of all, use the APM bit name here pls: MtrrFixDramModEn.
> > 
> > And then, for the life of me, I can't find any mention in the APM why
> > this bit is needed. Neither in "15.36.2 Enabling SEV-SNP" nor in
> > "15.34.3 Enabling SEV".
> > 
> > Looking at the bit defintions of WrMem an RdMem - read and write
> > requests get directed to system memory instead of MMIO so I guess you
> > don't want to be able to write MMIO for certain physical ranges when SNP
> > is enabled but it'll be good to have this properly explained instead of
> > a "this must happen" information-less sentence.
> 
> This is a per-requisite for SNP_INIT as per the SNP Firmware ABI
> specifications, section 8.8.2:

Did you even read the text you're responding to?

> > This looks backwards. AFAICT, the IOMMU code should call arch code to
> > enable SNP at the right time, not the other way around - arch code
> > calling driver code.
> > 
> > Especially if the SNP table enablement depends on some exact IOMMU
> > init_state:
> > 
> >          if (init_state > IOMMU_ENABLED) {
> > 		pr_err("SNP: Too late to enable SNP for IOMMU.\n");
> > 
> > 
> 
> This is again as per SNP_INIT requirements, that SNP support is enabled in
> the IOMMU before SNP_INIT is done. The above function snp_rmptable_init()
> only calls the IOMMU driver to enable SNP support when it has detected and
> enabled platform support for SNP.
>v
> It is not that IOMMU driver has to call the arch code to enable SNP at the
> right time but it is the other way around that once platform support for SNP
> is enabled then the IOMMU driver has to be called to enable the same for the
> IOMMU and this needs to be done before the CCP driver is loaded and does
> SNP_INIT.

You again didn't read the text you're responding to.

Arch code does not call drivers - arch code sets up the arch and
provides facilities which the drivers use.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support
  2023-11-07 16:31  4%   ` Borislav Petkov
  @ 2023-11-07 19:00  0%     ` Kalra, Ashish
  2023-11-07 19:19  0%       ` Borislav Petkov
  2023-12-08 17:09  0%       ` Jeremi Piotrowski
  1 sibling, 2 replies; 200+ results
From: Kalra, Ashish @ 2023-11-07 19:00 UTC (permalink / raw)
  To: Borislav Petkov, Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, vbabka, kirill,
	ak, tony.luck, marcorr, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, nikunj.dadhania, pankaj.gupta, liam.merwick, zhi.a.wang,
	Brijesh Singh

Hello Boris,

Addressing of some of the remaining comments:

On 11/7/2023 10:31 AM, Borislav Petkov wrote:
> On Mon, Oct 16, 2023 at 08:27:35AM -0500, Michael Roth wrote:
>> +static bool early_rmptable_check(void)
>> +{
>> +	u64 rmp_base, rmp_size;
>> +
>> +	/*
>> +	 * For early BSP initialization, max_pfn won't be set up yet, wait until
>> +	 * it is set before performing the RMP table calculations.
>> +	 */
>> +	if (!max_pfn)
>> +		return true;
> 
> This already says that this is called at the wrong point during init.
> 
> Right now we have
> 
> early_identify_cpu -> early_init_amd -> early_detect_mem_encrypt
> 
> which runs only on the BSP but then early_init_amd() is called in
> init_amd() too so that it takes care of the APs too.
> 
> Which ends up doing a lot of unnecessary work on each AP in
> early_detect_mem_encrypt() like calculating the RMP size on each AP
> unnecessarily where this needs to happen exactly once.
> 
> Is there any reason why this function cannot be moved to init_amd()
> where it'll do the normal, per-AP init?
> 
> And the stuff that needs to happen once, needs to be called once too.
> 
>> +
>> +	return snp_get_rmptable_info(&rmp_base, &rmp_size);
>> +}
>> +
>>   static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>   {
>>   	u64 msr;
>> @@ -659,6 +674,9 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>   		if (!(msr & MSR_K7_HWCR_SMMLOCK))
>>   			goto clear_sev;
>>   
>> +		if (cpu_has(c, X86_FEATURE_SEV_SNP) && !early_rmptable_check())
>> +			goto clear_snp;
>> +
>>   		return;
>>   
>>   clear_all:
>> @@ -666,6 +684,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>>   clear_sev:
>>   		setup_clear_cpu_cap(X86_FEATURE_SEV);
>>   		setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
>> +clear_snp:
>>   		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>>   	}
>>   }
> 
> ...
> 
>> +bool snp_get_rmptable_info(u64 *start, u64 *len)
>> +{
>> +	u64 max_rmp_pfn, calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
>> +
>> +	rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
>> +	rdmsrl(MSR_AMD64_RMP_END, rmp_end);
>> +
>> +	if (!(rmp_base & RMP_ADDR_MASK) || !(rmp_end & RMP_ADDR_MASK)) {
>> +		pr_err("Memory for the RMP table has not been reserved by BIOS\n");
>> +		return false;
>> +	}
> 
> If you're masking off bits 0-12 above...
> 
>> +
>> +	if (rmp_base > rmp_end) {
> 
> ... why aren't you using the masked out vars further on?
> 
> I know, the hw will say, yeah, those bits are 0 but still. IOW, do:
> 
> 	rmp_base &= RMP_ADDR_MASK;
> 	rmp_end  &= RMP_ADDR_MASK;
> 
> after reading them.
> 
>> +		pr_err("RMP configuration not valid: base=%#llx, end=%#llx\n", rmp_base, rmp_end);
>> +		return false;
>> +	}
>> +
>> +	rmp_sz = rmp_end - rmp_base + 1;
>> +
>> +	/*
>> +	 * Calculate the amount the memory that must be reserved by the BIOS to
>> +	 * address the whole RAM, including the bookkeeping area. The RMP itself
>> +	 * must also be covered.
>> +	 */
>> +	max_rmp_pfn = max_pfn;
>> +	if (PHYS_PFN(rmp_end) > max_pfn)
>> +		max_rmp_pfn = PHYS_PFN(rmp_end);
>> +
>> +	calc_rmp_sz = (max_rmp_pfn << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
>> +
>> +	if (calc_rmp_sz > rmp_sz) {
>> +		pr_err("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
>> +		       calc_rmp_sz, rmp_sz);
>> +		return false;
>> +	}
>> +
>> +	*start = rmp_base;
>> +	*len = rmp_sz;
>> +
>> +	return true;
>> +}
>> +
>> +static __init int __snp_rmptable_init(void)
>> +{
>> +	u64 rmp_base, rmp_size;
>> +	void *rmp_start;
>> +	u64 val;
>> +
>> +	if (!snp_get_rmptable_info(&rmp_base, &rmp_size))
>> +		return 1;
>> +
>> +	pr_info("RMP table physical address [0x%016llx - 0x%016llx]\n",
> 
> That's "RMP table physical range"
> 
>> +		rmp_base, rmp_base + rmp_size - 1);
>> +
>> +	rmp_start = memremap(rmp_base, rmp_size, MEMREMAP_WB);
>> +	if (!rmp_start) {
>> +		pr_err("Failed to map RMP table addr 0x%llx size 0x%llx\n", rmp_base, rmp_size);
> 
> No need to dump rmp_base and rmp_size again here - you're dumping them
> above.
> 
>> +		return 1;
>> +	}
>> +
>> +	/*
>> +	 * Check if SEV-SNP is already enabled, this can happen in case of
>> +	 * kexec boot.
>> +	 */
>> +	rdmsrl(MSR_AMD64_SYSCFG, val);
>> +	if (val & MSR_AMD64_SYSCFG_SNP_EN)
>> +		goto skip_enable;
>> +
>> +	/* Initialize the RMP table to zero */
> 
> Again: useless comment.
> 
>> +	memset(rmp_start, 0, rmp_size);
>> +
>> +	/* Flush the caches to ensure that data is written before SNP is enabled. */
>> +	wbinvd_on_all_cpus();
>> +
>> +	/* MFDM must be enabled on all the CPUs prior to enabling SNP. */
> 
> First of all, use the APM bit name here pls: MtrrFixDramModEn.
> 
> And then, for the life of me, I can't find any mention in the APM why
> this bit is needed. Neither in "15.36.2 Enabling SEV-SNP" nor in
> "15.34.3 Enabling SEV".
> 
> Looking at the bit defintions of WrMem an RdMem - read and write
> requests get directed to system memory instead of MMIO so I guess you
> don't want to be able to write MMIO for certain physical ranges when SNP
> is enabled but it'll be good to have this properly explained instead of
> a "this must happen" information-less sentence.

This is a per-requisite for SNP_INIT as per the SNP Firmware ABI 
specifications, section 8.8.2:

 From the SNP FW ABI specs:

If INIT_RMP is 1, then the firmware ensures the following system 
requirements are met:
• SYSCFG[MemoryEncryptionModEn] must be set to 1 across all cores. (SEV 
must be
enabled.)
• SYSCFG[SecureNestedPagingEn] must be set to 1 across all cores.
• SYSCFG[VMPLEn] must be set to 1 across all cores.
• SYSCFG[MFDM] must be set to 1 across all cores.
• VM_HSAVE_PA (MSR C001_0117) must be set to 0h across all cores.
• HWCR[SmmLock] (MSR C001_0015) must be set to 1 across all cores.

So, this platform enabling code for SNP needs to ensure that these 
conditions are met before SNP_INIT is called.

> 
>> +	on_each_cpu(mfd_enable, NULL, 1);
>> +
>> +	/* Enable SNP on all CPUs. */
> 
> Useless comment.
> 
>> +	on_each_cpu(snp_enable, NULL, 1);
>> +
>> +skip_enable:
>> +	rmp_start += RMPTABLE_CPU_BOOKKEEPING_SZ;
>> +	rmp_size -= RMPTABLE_CPU_BOOKKEEPING_SZ;
>> +
>> +	rmptable_start = (struct rmpentry *)rmp_start;
>> +	rmptable_max_pfn = rmp_size / sizeof(struct rmpentry) - 1;
>> +
>> +	return 0;
>> +}
>> +
>> +static int __init snp_rmptable_init(void)
>> +{
>> +	int family, model;
>> +
>> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>> +		return 0;
>> +
>> +	family = boot_cpu_data.x86;
>> +	model  = boot_cpu_data.x86_model;
> 
> Looks useless - just use boot_cpu_data directly below.
> 
> As mentioned here already https://lore.kernel.org/all/Y9ubi0i4Z750gdMm@zn.tnic/
> 
> And I already mentioned that for v9:
> 
> https://lore.kernel.org/r/20230621094236.GZZJLGDAicp1guNPvD@fat_crate.local
> 
> Next time I'm NAKing this patch until you incorporate all review
> comments or you give a technical reason why you disagree with them.
> 
>> +	/*
>> +	 * RMP table entry format is not architectural and it can vary by processor and
>> +	 * is defined by the per-processor PPR. Restrict SNP support on the known CPU
>> +	 * model and family for which the RMP table entry format is currently defined for.
>> +	 */
>> +	if (family != 0x19 || model > 0xaf)
>> +		goto nosnp;
>> +
>> +	if (amd_iommu_snp_enable())
>> +		goto nosnp;
>> +
>> +	if (__snp_rmptable_init())
>> +		goto nosnp;
>> +
>> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
>> +
>> +	return 0;
>> +
>> +nosnp:
>> +	setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>> +	return -ENOSYS;
>> +}
>> +
>> +/*
>> + * This must be called after the PCI subsystem. This is because amd_iommu_snp_enable()
>> + * is called to ensure the IOMMU supports the SEV-SNP feature, which can only be
>> + * called after subsys_initcall().
>> + *
>> + * NOTE: IOMMU is enforced by SNP to ensure that hypervisor cannot program DMA
>> + * directly into guest private memory. In case of SNP, the IOMMU ensures that
>> + * the page(s) used for DMA are hypervisor owned.
>> + */
>> +fs_initcall(snp_rmptable_init);
> 
> This looks backwards. AFAICT, the IOMMU code should call arch code to
> enable SNP at the right time, not the other way around - arch code
> calling driver code.
> 
> Especially if the SNP table enablement depends on some exact IOMMU
> init_state:
> 
>          if (init_state > IOMMU_ENABLED) {
> 		pr_err("SNP: Too late to enable SNP for IOMMU.\n");
> 
> 

This is again as per SNP_INIT requirements, that SNP support is enabled 
in the IOMMU before SNP_INIT is done. The above function 
snp_rmptable_init() only calls the IOMMU driver to enable SNP support 
when it has detected and enabled platform support for SNP.

It is not that IOMMU driver has to call the arch code to enable SNP at 
the right time but it is the other way around that once platform support 
for SNP is enabled then the IOMMU driver has to be called to enable the 
same for the IOMMU and this needs to be done before the CCP driver is 
loaded and does SNP_INIT.

Thanks,
Ashish

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support
  @ 2023-11-07 16:31  4%   ` Borislav Petkov
    2023-11-07 19:00  0%     ` Kalra, Ashish
  0 siblings, 2 replies; 200+ results
From: Borislav Petkov @ 2023-11-07 16:31 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, vbabka, kirill,
	ak, tony.luck, marcorr, sathyanarayanan.kuppuswamy, alpergun,
	jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, zhi.a.wang, Brijesh Singh

On Mon, Oct 16, 2023 at 08:27:35AM -0500, Michael Roth wrote:
> +static bool early_rmptable_check(void)
> +{
> +	u64 rmp_base, rmp_size;
> +
> +	/*
> +	 * For early BSP initialization, max_pfn won't be set up yet, wait until
> +	 * it is set before performing the RMP table calculations.
> +	 */
> +	if (!max_pfn)
> +		return true;

This already says that this is called at the wrong point during init.

Right now we have

early_identify_cpu -> early_init_amd -> early_detect_mem_encrypt

which runs only on the BSP but then early_init_amd() is called in
init_amd() too so that it takes care of the APs too.

Which ends up doing a lot of unnecessary work on each AP in
early_detect_mem_encrypt() like calculating the RMP size on each AP
unnecessarily where this needs to happen exactly once.

Is there any reason why this function cannot be moved to init_amd()
where it'll do the normal, per-AP init?

And the stuff that needs to happen once, needs to be called once too.

> +
> +	return snp_get_rmptable_info(&rmp_base, &rmp_size);
> +}
> +
>  static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  {
>  	u64 msr;
> @@ -659,6 +674,9 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  		if (!(msr & MSR_K7_HWCR_SMMLOCK))
>  			goto clear_sev;
>  
> +		if (cpu_has(c, X86_FEATURE_SEV_SNP) && !early_rmptable_check())
> +			goto clear_snp;
> +
>  		return;
>  
>  clear_all:
> @@ -666,6 +684,7 @@ static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
>  clear_sev:
>  		setup_clear_cpu_cap(X86_FEATURE_SEV);
>  		setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
> +clear_snp:
>  		setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
>  	}
>  }

...

> +bool snp_get_rmptable_info(u64 *start, u64 *len)
> +{
> +	u64 max_rmp_pfn, calc_rmp_sz, rmp_sz, rmp_base, rmp_end;
> +
> +	rdmsrl(MSR_AMD64_RMP_BASE, rmp_base);
> +	rdmsrl(MSR_AMD64_RMP_END, rmp_end);
> +
> +	if (!(rmp_base & RMP_ADDR_MASK) || !(rmp_end & RMP_ADDR_MASK)) {
> +		pr_err("Memory for the RMP table has not been reserved by BIOS\n");
> +		return false;
> +	}

If you're masking off bits 0-12 above...

> +
> +	if (rmp_base > rmp_end) {

... why aren't you using the masked out vars further on?

I know, the hw will say, yeah, those bits are 0 but still. IOW, do:

	rmp_base &= RMP_ADDR_MASK;
	rmp_end  &= RMP_ADDR_MASK;

after reading them.

> +		pr_err("RMP configuration not valid: base=%#llx, end=%#llx\n", rmp_base, rmp_end);
> +		return false;
> +	}
> +
> +	rmp_sz = rmp_end - rmp_base + 1;
> +
> +	/*
> +	 * Calculate the amount the memory that must be reserved by the BIOS to
> +	 * address the whole RAM, including the bookkeeping area. The RMP itself
> +	 * must also be covered.
> +	 */
> +	max_rmp_pfn = max_pfn;
> +	if (PHYS_PFN(rmp_end) > max_pfn)
> +		max_rmp_pfn = PHYS_PFN(rmp_end);
> +
> +	calc_rmp_sz = (max_rmp_pfn << 4) + RMPTABLE_CPU_BOOKKEEPING_SZ;
> +
> +	if (calc_rmp_sz > rmp_sz) {
> +		pr_err("Memory reserved for the RMP table does not cover full system RAM (expected 0x%llx got 0x%llx)\n",
> +		       calc_rmp_sz, rmp_sz);
> +		return false;
> +	}
> +
> +	*start = rmp_base;
> +	*len = rmp_sz;
> +
> +	return true;
> +}
> +
> +static __init int __snp_rmptable_init(void)
> +{
> +	u64 rmp_base, rmp_size;
> +	void *rmp_start;
> +	u64 val;
> +
> +	if (!snp_get_rmptable_info(&rmp_base, &rmp_size))
> +		return 1;
> +
> +	pr_info("RMP table physical address [0x%016llx - 0x%016llx]\n",

That's "RMP table physical range"

> +		rmp_base, rmp_base + rmp_size - 1);
> +
> +	rmp_start = memremap(rmp_base, rmp_size, MEMREMAP_WB);
> +	if (!rmp_start) {
> +		pr_err("Failed to map RMP table addr 0x%llx size 0x%llx\n", rmp_base, rmp_size);

No need to dump rmp_base and rmp_size again here - you're dumping them
above.

> +		return 1;
> +	}
> +
> +	/*
> +	 * Check if SEV-SNP is already enabled, this can happen in case of
> +	 * kexec boot.
> +	 */
> +	rdmsrl(MSR_AMD64_SYSCFG, val);
> +	if (val & MSR_AMD64_SYSCFG_SNP_EN)
> +		goto skip_enable;
> +
> +	/* Initialize the RMP table to zero */

Again: useless comment.

> +	memset(rmp_start, 0, rmp_size);
> +
> +	/* Flush the caches to ensure that data is written before SNP is enabled. */
> +	wbinvd_on_all_cpus();
> +
> +	/* MFDM must be enabled on all the CPUs prior to enabling SNP. */

First of all, use the APM bit name here pls: MtrrFixDramModEn.

And then, for the life of me, I can't find any mention in the APM why
this bit is needed. Neither in "15.36.2 Enabling SEV-SNP" nor in
"15.34.3 Enabling SEV".

Looking at the bit defintions of WrMem an RdMem - read and write
requests get directed to system memory instead of MMIO so I guess you
don't want to be able to write MMIO for certain physical ranges when SNP
is enabled but it'll be good to have this properly explained instead of
a "this must happen" information-less sentence.

> +	on_each_cpu(mfd_enable, NULL, 1);
> +
> +	/* Enable SNP on all CPUs. */

Useless comment.

> +	on_each_cpu(snp_enable, NULL, 1);
> +
> +skip_enable:
> +	rmp_start += RMPTABLE_CPU_BOOKKEEPING_SZ;
> +	rmp_size -= RMPTABLE_CPU_BOOKKEEPING_SZ;
> +
> +	rmptable_start = (struct rmpentry *)rmp_start;
> +	rmptable_max_pfn = rmp_size / sizeof(struct rmpentry) - 1;
> +
> +	return 0;
> +}
> +
> +static int __init snp_rmptable_init(void)
> +{
> +	int family, model;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> +		return 0;
> +
> +	family = boot_cpu_data.x86;
> +	model  = boot_cpu_data.x86_model;

Looks useless - just use boot_cpu_data directly below.

As mentioned here already https://lore.kernel.org/all/Y9ubi0i4Z750gdMm@zn.tnic/

And I already mentioned that for v9:

https://lore.kernel.org/r/20230621094236.GZZJLGDAicp1guNPvD@fat_crate.local

Next time I'm NAKing this patch until you incorporate all review
comments or you give a technical reason why you disagree with them.

> +	/*
> +	 * RMP table entry format is not architectural and it can vary by processor and
> +	 * is defined by the per-processor PPR. Restrict SNP support on the known CPU
> +	 * model and family for which the RMP table entry format is currently defined for.
> +	 */
> +	if (family != 0x19 || model > 0xaf)
> +		goto nosnp;
> +
> +	if (amd_iommu_snp_enable())
> +		goto nosnp;
> +
> +	if (__snp_rmptable_init())
> +		goto nosnp;
> +
> +	cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/rmptable_init:online", __snp_enable, NULL);
> +
> +	return 0;
> +
> +nosnp:
> +	setup_clear_cpu_cap(X86_FEATURE_SEV_SNP);
> +	return -ENOSYS;
> +}
> +
> +/*
> + * This must be called after the PCI subsystem. This is because amd_iommu_snp_enable()
> + * is called to ensure the IOMMU supports the SEV-SNP feature, which can only be
> + * called after subsys_initcall().
> + *
> + * NOTE: IOMMU is enforced by SNP to ensure that hypervisor cannot program DMA
> + * directly into guest private memory. In case of SNP, the IOMMU ensures that
> + * the page(s) used for DMA are hypervisor owned.
> + */
> +fs_initcall(snp_rmptable_init);

This looks backwards. AFAICT, the IOMMU code should call arch code to
enable SNP at the right time, not the other way around - arch code
calling driver code.

Especially if the SNP table enablement depends on some exact IOMMU
init_state:

        if (init_state > IOMMU_ENABLED) {
		pr_err("SNP: Too late to enable SNP for IOMMU.\n");


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[relevance 4%]

* Re: [PATCH] s390/vfio-ap: fix sysfs status attribute for AP queue devices
  2023-11-06 16:03  0% ` Tony Krowiak
@ 2023-11-07  8:07  0%   ` Harald Freudenberger
  2023-11-08 14:44  0%     ` Tony Krowiak
  0 siblings, 1 reply; 200+ results
From: Harald Freudenberger @ 2023-11-07  8:07 UTC (permalink / raw)
  To: Tony Krowiak
  Cc: linux-s390, linux-kernel, kvm, jjherne, pasic, borntraeger,
	frankja, imbrenda, david, stable

On 2023-11-06 17:03, Tony Krowiak wrote:
> PING
> This patch is pretty straight forward, does anyone see a reason why
> this shouldn't be integrated?
> 
> On 10/20/23 16:48, Tony Krowiak wrote:
>> The 'status' attribute for AP queue devices bound to the vfio_ap 
>> device
>> driver displays incorrect status when the mediated device is attached 
>> to a
>> guest, but the queue device is not passed through. In the current
>> implementation, the status displayed is 'in_use' which is not correct; 
>> it
>> should be 'assigned'. This can happen if one of the queue devices
>> associated with a given adapter is not bound to the vfio_ap device 
>> driver.
>> For example:
>> 
>> Queues listed in /sys/bus/ap/drivers/vfio_ap:
>> 14.0005
>> 14.0006
>> 14.000d
>> 16.0006
>> 16.000d
>> 
>> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/matrix
>> 14.0005
>> 14.0006
>> 14.000d
>> 16.0005
>> 16.0006
>> 16.000d
>> 
>> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/guest_matrix
>> 14.0005
>> 14.0006
>> 14.000d
>> 
>> The reason no queues for adapter 0x16 are listed in the guest_matrix 
>> is
>> because queue 16.0005 is not bound to the vfio_ap device driver, so no
>> queue associated with the adapter is passed through to the guest;
>> therefore, each queue device for adapter 0x16 should display 
>> 'assigned'
>> instead of 'in_use', because those queues are not in use by a guest, 
>> but
>> only assigned to the mediated device.
>> 
>> Let's check the AP configuration for the guest to determine whether a
>> queue device is passed through before displaying a status of 'in_use'.
>> 
>> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
>> Fixes: f139862b92cf ("s390/vfio-ap: add status attribute to AP queue 
>> device's sysfs dir")
>> Cc: stable@vger.kernel.org
>> ---
>>   drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/s390/crypto/vfio_ap_ops.c 
>> b/drivers/s390/crypto/vfio_ap_ops.c
>> index 4db538a55192..871c14a6921f 100644
>> --- a/drivers/s390/crypto/vfio_ap_ops.c
>> +++ b/drivers/s390/crypto/vfio_ap_ops.c
>> @@ -1976,6 +1976,7 @@ static ssize_t status_show(struct device *dev,
>>   {
>>   	ssize_t nchars = 0;
>>   	struct vfio_ap_queue *q;
>> +	unsigned long apid, apqi;
>>   	struct ap_matrix_mdev *matrix_mdev;
>>   	struct ap_device *apdev = to_ap_dev(dev);
>>   @@ -1984,7 +1985,11 @@ static ssize_t status_show(struct device 
>> *dev,
>>   	matrix_mdev = vfio_ap_mdev_for_queue(q);
>>     	if (matrix_mdev) {
>> -		if (matrix_mdev->kvm)
>> +		apid = AP_QID_CARD(q->apqn);
>> +		apqi = AP_QID_QUEUE(q->apqn);
>> +		if (matrix_mdev->kvm &&
>> +		    test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
>> +		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
>>   			nchars = scnprintf(buf, PAGE_SIZE, "%s\n",
>>   					   AP_QUEUE_IN_USE);
>>   		else

I can give you an
Acked-by: Harald Freudenberger <freude@linux.ibm.com>
for this. Your explanation sounds sane to me and fixes a wrong
display. However, I am not familiar with the code so, I can't tell
if that's correct.
Just a remark: How can it happen that one queue is not bound to the vfio 
dd?
Didn't we actively remove the unbind possibility from the sysfs for 
devices
assigned to the vfio dd?

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] s390/vfio-ap: fix sysfs status attribute for AP queue devices
  2023-10-20 20:48  4% [PATCH] s390/vfio-ap: fix sysfs status attribute for AP queue devices Tony Krowiak
@ 2023-11-06 16:03  0% ` Tony Krowiak
  2023-11-07  8:07  0%   ` Harald Freudenberger
  0 siblings, 1 reply; 200+ results
From: Tony Krowiak @ 2023-11-06 16:03 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, pasic, borntraeger, frankja, imbrenda, david, stable

PING
This patch is pretty straight forward, does anyone see a reason why this 
shouldn't be integrated?

On 10/20/23 16:48, Tony Krowiak wrote:
> The 'status' attribute for AP queue devices bound to the vfio_ap device
> driver displays incorrect status when the mediated device is attached to a
> guest, but the queue device is not passed through. In the current
> implementation, the status displayed is 'in_use' which is not correct; it
> should be 'assigned'. This can happen if one of the queue devices
> associated with a given adapter is not bound to the vfio_ap device driver.
> For example:
> 
> Queues listed in /sys/bus/ap/drivers/vfio_ap:
> 14.0005
> 14.0006
> 14.000d
> 16.0006
> 16.000d
> 
> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/matrix
> 14.0005
> 14.0006
> 14.000d
> 16.0005
> 16.0006
> 16.000d
> 
> Queues listed in /sys/devices/vfio_ap/matrix/$UUID/guest_matrix
> 14.0005
> 14.0006
> 14.000d
> 
> The reason no queues for adapter 0x16 are listed in the guest_matrix is
> because queue 16.0005 is not bound to the vfio_ap device driver, so no
> queue associated with the adapter is passed through to the guest;
> therefore, each queue device for adapter 0x16 should display 'assigned'
> instead of 'in_use', because those queues are not in use by a guest, but
> only assigned to the mediated device.
> 
> Let's check the AP configuration for the guest to determine whether a
> queue device is passed through before displaying a status of 'in_use'.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> Fixes: f139862b92cf ("s390/vfio-ap: add status attribute to AP queue device's sysfs dir")
> Cc: stable@vger.kernel.org
> ---
>   drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index 4db538a55192..871c14a6921f 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1976,6 +1976,7 @@ static ssize_t status_show(struct device *dev,
>   {
>   	ssize_t nchars = 0;
>   	struct vfio_ap_queue *q;
> +	unsigned long apid, apqi;
>   	struct ap_matrix_mdev *matrix_mdev;
>   	struct ap_device *apdev = to_ap_dev(dev);
>   
> @@ -1984,7 +1985,11 @@ static ssize_t status_show(struct device *dev,
>   	matrix_mdev = vfio_ap_mdev_for_queue(q);
>   
>   	if (matrix_mdev) {
> -		if (matrix_mdev->kvm)
> +		apid = AP_QID_CARD(q->apqn);
> +		apqi = AP_QID_QUEUE(q->apqn);
> +		if (matrix_mdev->kvm &&
> +		    test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
> +		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
>   			nchars = scnprintf(buf, PAGE_SIZE, "%s\n",
>   					   AP_QUEUE_IN_USE);
>   		else

^ permalink raw reply	[relevance 0%]

* [PULL 45/60] system/cpus: Fix CPUState.nr_cores' calculation
    2023-11-06 11:03  6% ` [PULL 44/60] tests/unit: Rename test-x86-cpuid.c to test-x86-topo.c Philippe Mathieu-Daudé
@ 2023-11-06 11:03  2% ` Philippe Mathieu-Daudé
  1 sibling, 0 replies; 200+ results
From: Philippe Mathieu-Daudé @ 2023-11-06 11:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, qemu-block, qemu-riscv, qemu-ppc, qemu-arm,
	Zhuocheng Ding, Zhao Liu, Babu Moger, Yongwei Ma,
	Michael S . Tsirkin, Philippe Mathieu-Daudé,
	Paolo Bonzini, Richard Henderson

From: Zhuocheng Ding <zhuocheng.ding@intel.com>

From CPUState.nr_cores' comment, it represents "number of cores within
this CPU package".

After 003f230e37d7 ("machine: Tweak the order of topology members in
struct CpuTopology"), the meaning of smp.cores changed to "the number of
cores in one die", but this commit missed to change CPUState.nr_cores'
calculation, so that CPUState.nr_cores became wrong and now it
misses to consider numbers of clusters and dies.

At present, only i386 is using CPUState.nr_cores.

But as for i386, which supports die level, the uses of CPUState.nr_cores
are very confusing:

Early uses are based on the meaning of "cores per package" (before die
is introduced into i386), and later uses are based on "cores per die"
(after die's introduction).

This difference is due to that commit a94e1428991f ("target/i386: Add
CPUID.1F generation support for multi-dies PCMachine") misunderstood
that CPUState.nr_cores means "cores per die" when calculated
CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
wrong understanding.

With the influence of 003f230e37d7 and a94e1428991f, for i386 currently
the result of CPUState.nr_cores is "cores per die", thus the original
uses of CPUState.cores based on the meaning of "cores per package" are
wrong when multiple dies exist:
1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
   incorrect because it expects "cpus per package" but now the
   result is "cpus per die".
2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
   EAX[bits 31:26] is incorrect because they expect "cpus per package"
   but now the result is "cpus per die". The error not only impacts the
   EAX calculation in cache_info_passthrough case, but also impacts other
   cases of setting cache topology for Intel CPU according to cpu
   topology (specifically, the incoming parameter "num_cores" expects
   "cores per package" in encode_cache_cpuid4()).
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
   15:00] is incorrect because the EBX of 0BH.01H (core level) expects
   "cpus per package", which may be different with 1FH.01H (The reason
   is 1FH can support more levels. For QEMU, 1FH also supports die,
   1FH.01H:EBX[bits 15:00] expects "cpus per die").
4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.80000001H is
   calculated, here "cpus per package" is expected to be checked, but in
   fact, now it checks "cpus per die". Though "cpus per die" also works
   for this code logic, this isn't consistent with AMD's APM.
5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.80000008H:ECX expects
   "cpus per package" but it obtains "cpus per die".
6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
   kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
   helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
   MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
   package", but in these functions, it obtains "cpus per die" and
   "cores per die".

On the other hand, these uses are correct now (they are added in/after
a94e1428991f):
1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
   meets the actual meaning of CPUState.nr_cores ("cores per die").
2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
   04H's calculation) considers number of dies, so it's correct.
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
   15:00] needs "cpus per die" and it gets the correct result, and
   CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".

When CPUState.nr_cores is correctly changed to "cores per package" again
, the above errors will be fixed without extra work, but the "currently"
correct cases will go wrong and need special handling to pass correct
"cpus/cores per die" they want.

Fix CPUState.nr_cores' calculation to fit the original meaning "cores
per package", as well as changing calculation of topo_info.cores_per_die,
vcpus_per_socket and CPUID.1FH.

Fixes: a94e1428991f ("target/i386: Add CPUID.1F generation support for multi-dies PCMachine")
Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct CpuTopology")
Signed-off-by: Zhuocheng Ding <zhuocheng.ding@intel.com>
Co-developed-by: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20231024090323.1859210-4-zhao1.liu@linux.intel.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 system/cpus.c     | 2 +-
 target/i386/cpu.c | 9 ++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 952f15868c..a444a747f0 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -631,7 +631,7 @@ void qemu_init_vcpu(CPUState *cpu)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
 
-    cpu->nr_cores = ms->smp.cores;
+    cpu->nr_cores = machine_topo_get_cores_per_socket(ms);
     cpu->nr_threads =  ms->smp.threads;
     cpu->stopped = true;
     cpu->random_seed = qemu_guest_random_seed_thread_part1();
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index fc8484cb5e..358d9c0a65 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6019,7 +6019,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
     X86CPUTopoInfo topo_info;
 
     topo_info.dies_per_pkg = env->nr_dies;
-    topo_info.cores_per_die = cs->nr_cores;
+    topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
     topo_info.threads_per_core = cs->nr_threads;
 
     /* Calculate & apply limits for different index ranges */
@@ -6095,8 +6095,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
              */
             if (*eax & 31) {
                 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
-                int vcpus_per_socket = env->nr_dies * cs->nr_cores *
-                                       cs->nr_threads;
+                int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
                 if (cs->nr_cores > 1) {
                     *eax &= ~0xFC000000;
                     *eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
@@ -6273,12 +6272,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
             break;
         case 1:
             *eax = apicid_die_offset(&topo_info);
-            *ebx = cs->nr_cores * cs->nr_threads;
+            *ebx = topo_info.cores_per_die * topo_info.threads_per_core;
             *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
             break;
         case 2:
             *eax = apicid_pkg_offset(&topo_info);
-            *ebx = env->nr_dies * cs->nr_cores * cs->nr_threads;
+            *ebx = cs->nr_cores * cs->nr_threads;
             *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
             break;
         default:
-- 
2.41.0


^ permalink raw reply related	[relevance 2%]

* [PULL 44/60] tests/unit: Rename test-x86-cpuid.c to test-x86-topo.c
  @ 2023-11-06 11:03  6% ` Philippe Mathieu-Daudé
  2023-11-06 11:03  2% ` [PULL 45/60] system/cpus: Fix CPUState.nr_cores' calculation Philippe Mathieu-Daudé
  1 sibling, 0 replies; 200+ results
From: Philippe Mathieu-Daudé @ 2023-11-06 11:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, qemu-block, qemu-riscv, qemu-ppc, qemu-arm,
	Zhao Liu, Philippe Mathieu-Daudé,
	Babu Moger, Yongwei Ma, Michael S . Tsirkin, Thomas Huth,
	Marcel Apfelbaum

From: Zhao Liu <zhao1.liu@intel.com>

The tests in this file actually test the APIC ID combinations.
Rename to test-x86-topo.c to make its name more in line with its
actual content.

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20231024090323.1859210-3-zhao1.liu@linux.intel.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 MAINTAINERS                                      | 2 +-
 tests/unit/{test-x86-cpuid.c => test-x86-topo.c} | 2 +-
 tests/unit/meson.build                           | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)
 rename tests/unit/{test-x86-cpuid.c => test-x86-topo.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8e8a7d5be5..126cddd285 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1772,7 +1772,7 @@ F: include/hw/southbridge/ich9.h
 F: include/hw/southbridge/piix.h
 F: hw/isa/apm.c
 F: include/hw/isa/apm.h
-F: tests/unit/test-x86-cpuid.c
+F: tests/unit/test-x86-topo.c
 F: tests/qtest/test-x86-cpuid-compat.c
 
 PC Chipset
diff --git a/tests/unit/test-x86-cpuid.c b/tests/unit/test-x86-topo.c
similarity index 99%
rename from tests/unit/test-x86-cpuid.c
rename to tests/unit/test-x86-topo.c
index bfabc0403a..2b104f86d7 100644
--- a/tests/unit/test-x86-cpuid.c
+++ b/tests/unit/test-x86-topo.c
@@ -1,5 +1,5 @@
 /*
- *  Test code for x86 CPUID and Topology functions
+ *  Test code for x86 APIC ID and Topology functions
  *
  *  Copyright (c) 2012 Red Hat Inc.
  *
diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index f33ae64b8d..0dbe32ba9b 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -21,8 +21,8 @@ tests = {
   'test-opts-visitor': [testqapi],
   'test-visitor-serialization': [testqapi],
   'test-bitmap': [],
-  # all code tested by test-x86-cpuid is inside topology.h
-  'test-x86-cpuid': [],
+  # all code tested by test-x86-topo is inside topology.h
+  'test-x86-topo': [],
   'test-cutils': [],
   'test-div128': [],
   'test-shift128': [],
-- 
2.41.0


^ permalink raw reply related	[relevance 6%]

* Re: [PATCH 5/9] KVM: SVM: Save shadow stack host state on VMRUN
  @ 2023-11-02 18:07  4%   ` Maxim Levitsky
  2024-02-26 16:56  0%     ` John Allen
  0 siblings, 1 reply; 200+ results
From: Maxim Levitsky @ 2023-11-02 18:07 UTC (permalink / raw)
  To: John Allen, kvm
  Cc: linux-kernel, pbonzini, weijiang.yang, rick.p.edgecombe, seanjc,
	x86, thomas.lendacky, bp

On Tue, 2023-10-10 at 20:02 +0000, John Allen wrote:
> When running as an SEV-ES guest, the PL0_SSP, PL1_SSP, PL2_SSP, PL3_SSP,
> and U_CET fields in the VMCB save area are type B, meaning the host
> state is automatically loaded on a VMEXIT, but is not saved on a VMRUN.
> The other shadow stack MSRs, S_CET, SSP, and ISST_ADDR are type A,
> meaning they are loaded on VMEXIT and saved on VMRUN. PL0_SSP, PL1_SSP,
> and PL2_SSP are currently unused. Manually save the other type B host
> MSR values before VMRUN.
> 
> Signed-off-by: John Allen <john.allen@amd.com>
> ---
>  arch/x86/kvm/svm/sev.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index b9a0a939d59f..bb4b18baa6f7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3098,6 +3098,15 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
>  		hostsa->dr2_addr_mask = amd_get_dr_addr_mask(2);
>  		hostsa->dr3_addr_mask = amd_get_dr_addr_mask(3);
>  	}
> +
> +	if (boot_cpu_has(X86_FEATURE_SHSTK)) {
> +		/*
> +		 * MSR_IA32_U_CET and MSR_IA32_PL3_SSP are restored on VMEXIT,
> +		 * save the current host values.
> +		 */
> +		rdmsrl(MSR_IA32_U_CET, hostsa->u_cet);
> +		rdmsrl(MSR_IA32_PL3_SSP, hostsa->pl3_ssp);
> +	}
>  }


Do we actually need this patch?

Host FPU state is not encrypted, and as far as I understood from the common CET patch series,
that on return to userspace these msrs will be restored.

Best regards,
	Maxim Levitsky


PS: AMD's APM is silent on how 'S_CET, SSP, and ISST_ADDR' are saved/restored for non encrypted guests.
Are they also type A?

Can the VMSA table be trusted in general to provide the same swap type as for the non encrypted guests?


>  
>  void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)



^ permalink raw reply	[relevance 4%]

* Re: [PATCH 4/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  2023-10-10 20:02  4% ` [PATCH 4/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp John Allen
@ 2023-11-02 18:06  4%   ` Maxim Levitsky
  0 siblings, 0 replies; 200+ results
From: Maxim Levitsky @ 2023-11-02 18:06 UTC (permalink / raw)
  To: John Allen, kvm
  Cc: linux-kernel, pbonzini, weijiang.yang, rick.p.edgecombe, seanjc,
	x86, thomas.lendacky, bp

On Tue, 2023-10-10 at 20:02 +0000, John Allen wrote:
> Rename SEV-ES save area SSP fields to be consistent with the APM.
> 
> Signed-off-by: John Allen <john.allen@amd.com>
> ---
>  arch/x86/include/asm/svm.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 19bf955b67e0..568d97084e44 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -361,10 +361,10 @@ struct sev_es_save_area {
>  	struct vmcb_seg ldtr;
>  	struct vmcb_seg idtr;
>  	struct vmcb_seg tr;
> -	u64 vmpl0_ssp;
> -	u64 vmpl1_ssp;
> -	u64 vmpl2_ssp;
> -	u64 vmpl3_ssp;
> +	u64 pl0_ssp;
> +	u64 pl1_ssp;
> +	u64 pl2_ssp;
> +	u64 pl3_ssp;
>  	u64 u_cet;
>  	u8 reserved_0xc8[2];
>  	u8 vmpl;

Matches the APM.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[relevance 4%]

* [PATCH v5 00/14] Add Secure TSC support for SNP guests
@ 2023-10-30  6:36  3% Nikunj A Dadhania
  0 siblings, 0 replies; 200+ results
From: Nikunj A Dadhania @ 2023-10-30  6:36 UTC (permalink / raw)
  To: linux-kernel, thomas.lendacky, x86, kvm
  Cc: bp, mingo, tglx, dave.hansen, dionnaglaze, pgonda, seanjc,
	pbonzini, nikunj

Secure TSC allows guests to securely use RDTSC/RDTSCP instructions as the
parameters being used cannot be changed by hypervisor once the guest is
launched. More details in the AMD64 APM Vol 2, Section "Secure TSC".

During the boot-up of the secondary cpus, SecureTSC enabled guests need to
query TSC info from AMD Security Processor. This communication channel is
encrypted between the AMD Security Processor and the guest, the hypervisor
is just the conduit to deliver the guest messages to the AMD Security
Processor. Each message is protected with an AEAD (AES-256 GCM). See "SEV
Secure Nested Paging Firmware ABI Specification" document (currently at
https://www.amd.com/system/files/TechDocs/56860.pdf) section "TSC Info"

Use a minimal GCM library to encrypt/decrypt SNP Guest messages to
communicate with the AMD Security Processor which is available at early
boot.

SEV-guest driver has the implementation for guest and AMD Security
Processor communication. As the TSC_INFO needs to be initialized during
early boot before smp cpus are started, move most of the sev-guest driver
code to kernel/sev.c and provide well defined APIs to the sev-guest driver
to use the interface to avoid code-duplication.

Patches:
01-07: Preparation and movement of sev-guest driver code
08-14: SecureTSC enablement patches.

Testing SecureTSC
-----------------

SecureTSC hypervisor patches based on top of SEV-SNP UPM series:
https://github.com/nikunjad/linux/tree/snp-host-latest-securetsc_v5

QEMU changes:
https://github.com/nikunjad/qemu/tree/snp_securetsc_v5

QEMU commandline SEV-SNP-UPM with SecureTSC:

  qemu-system-x86_64 -cpu EPYC-Milan-v2,+secure-tsc -smp 4 \
    -object memory-backend-memfd-private,id=ram1,size=1G,share=true \
    -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,secure-tsc=on \
    -machine q35,confidential-guest-support=sev0,memory-backend=ram1,kvm-type=snp \
    ...

Changelog:
----------
v5:
* Rebased on v6.6 kernel
* Dropped link tag in first patch
* Dropped get_ctx_authsize() as it was redundant

v4:
* Drop handle_guest_request() and handle_guest_request_ext()
* Drop NULL check for key
* Corrected commit subject
* Added Reviewed-by from Tom

https://lore.kernel.org/lkml/20230814055222.1056404-1-nikunj@amd.com/

v3:
* Updated commit messages
* Made snp_setup_psp_messaging() generic that is accessed by both the
  kernel and the driver
* Moved most of the context information to sev.c, sev-guest driver
  does not need to know the secrets page layout anymore
* Add CC_ATTR_GUEST_SECURE_TSC early in the series therefore it can be
  used in later patches.
* Removed data_gpa and data_npages from struct snp_req_data, as certs_data
  and its size is passed to handle_guest_request_ext()
* Make vmpck_id as unsigned int
* Dropped unnecessary usage of memzero_explicit()
* Cache secrets_pa instead of remapping the cc_blob always
* Rebase on top of v6.4 kernel
https://lore.kernel.org/lkml/20230722111909.15166-1-nikunj@amd.com/

v2:
* Rebased on top of v6.3-rc3 that has Boris's sev-guest cleanup series
  https://lore.kernel.org/r/20230307192449.24732-1-bp@alien8.de/

v1: https://lore.kernel.org/r/20230130120327.977460-1-nikunj@amd.com/

Nikunj A Dadhania (14):
  virt: sev-guest: Use AES GCM crypto library
  virt: sev-guest: Move mutex to SNP guest device structure
  virt: sev-guest: Replace dev_dbg with pr_debug
  virt: sev-guest: Add SNP guest request structure
  virt: sev-guest: Add vmpck_id to snp_guest_dev struct
  x86/sev: Cache the secrets page address
  x86/sev: Move and reorganize sev guest request api
  x86/mm: Add generic guest initialization hook
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled
    guests
  x86/kvmclock: Skip kvmclock when Secure TSC is available
  x86/tsc: Mark Secure TSC as reliable clocksource
  x86/sev: Enable Secure TSC for SNP guests

 arch/x86/Kconfig                        |   1 +
 arch/x86/boot/compressed/sev.c          |   3 +-
 arch/x86/coco/core.c                    |   3 +
 arch/x86/include/asm/sev-guest.h        | 175 +++++++
 arch/x86/include/asm/sev.h              |  20 +-
 arch/x86/include/asm/svm.h              |   6 +-
 arch/x86/include/asm/x86_init.h         |   2 +
 arch/x86/kernel/kvmclock.c              |   2 +-
 arch/x86/kernel/sev-shared.c            |   7 +
 arch/x86/kernel/sev.c                   | 631 +++++++++++++++++++++--
 arch/x86/kernel/tsc.c                   |   2 +-
 arch/x86/kernel/x86_init.c              |   2 +
 arch/x86/mm/mem_encrypt.c               |  13 +-
 arch/x86/mm/mem_encrypt_amd.c           |   6 +
 drivers/virt/coco/sev-guest/Kconfig     |   3 -
 drivers/virt/coco/sev-guest/sev-guest.c | 646 +++---------------------
 drivers/virt/coco/sev-guest/sev-guest.h |  63 ---
 include/linux/cc_platform.h             |   8 +
 18 files changed, 877 insertions(+), 716 deletions(-)
 create mode 100644 arch/x86/include/asm/sev-guest.h
 delete mode 100644 drivers/virt/coco/sev-guest/sev-guest.h


base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
-- 
2.34.1


^ permalink raw reply	[relevance 3%]

* [PATCH v5 19/20] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
    2023-10-24  9:03  6% ` [PATCH v5 02/20] tests: Rename test-x86-cpuid.c to test-x86-topo.c Zhao Liu
  2023-10-24  9:03  2% ` [PATCH v5 03/20] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
@ 2023-10-24  9:03  4% ` Zhao Liu
  2 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-10-24  9:03 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Yongwei Ma,
	Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes since v3:
 * Rewrite the subject. (Babu)
 * Delete the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Rename "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Add the description of the original commit and add Cc.
---
 target/i386/cpu.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 90393396a0a4..226c5be6ea95 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -483,7 +483,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -492,13 +492,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->modules_per_die *
-                     topo_info->cores_per_module *
-                     topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* [PATCH v5 03/20] softmmu: Fix CPUSTATE.nr_cores' calculation
    2023-10-24  9:03  6% ` [PATCH v5 02/20] tests: Rename test-x86-cpuid.c to test-x86-topo.c Zhao Liu
@ 2023-10-24  9:03  2% ` Zhao Liu
  2023-10-24  9:03  4% ` [PATCH v5 19/20] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
  2 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-10-24  9:03 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Yongwei Ma,
	Zhao Liu, Zhuocheng Ding

From: Zhuocheng Ding <zhuocheng.ding@intel.com>

From CPUState.nr_cores' comment, it represents "number of cores within
this CPU package".

After 003f230e37d7 ("machine: Tweak the order of topology members in
struct CpuTopology"), the meaning of smp.cores changed to "the number of
cores in one die", but this commit missed to change CPUState.nr_cores'
calculation, so that CPUState.nr_cores became wrong and now it
misses to consider numbers of clusters and dies.

At present, only i386 is using CPUState.nr_cores.

But as for i386, which supports die level, the uses of CPUState.nr_cores
are very confusing:

Early uses are based on the meaning of "cores per package" (before die
is introduced into i386), and later uses are based on "cores per die"
(after die's introduction).

This difference is due to that commit a94e1428991f ("target/i386: Add
CPUID.1F generation support for multi-dies PCMachine") misunderstood
that CPUState.nr_cores means "cores per die" when calculated
CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
wrong understanding.

With the influence of 003f230e37d7 and a94e1428991f, for i386 currently
the result of CPUState.nr_cores is "cores per die", thus the original
uses of CPUState.cores based on the meaning of "cores per package" are
wrong when multiple dies exist:
1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
   incorrect because it expects "cpus per package" but now the
   result is "cpus per die".
2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
   EAX[bits 31:26] is incorrect because they expect "cpus per package"
   but now the result is "cpus per die". The error not only impacts the
   EAX calculation in cache_info_passthrough case, but also impacts other
   cases of setting cache topology for Intel CPU according to cpu
   topology (specifically, the incoming parameter "num_cores" expects
   "cores per package" in encode_cache_cpuid4()).
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
   15:00] is incorrect because the EBX of 0BH.01H (core level) expects
   "cpus per package", which may be different with 1FH.01H (The reason
   is 1FH can support more levels. For QEMU, 1FH also supports die,
   1FH.01H:EBX[bits 15:00] expects "cpus per die").
4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.80000001H is
   calculated, here "cpus per package" is expected to be checked, but in
   fact, now it checks "cpus per die". Though "cpus per die" also works
   for this code logic, this isn't consistent with AMD's APM.
5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.80000008H:ECX expects
   "cpus per package" but it obtains "cpus per die".
6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
   kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
   helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
   MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
   package", but in these functions, it obtains "cpus per die" and
   "cores per die".

On the other hand, these uses are correct now (they are added in/after
a94e1428991f):
1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
   meets the actual meaning of CPUState.nr_cores ("cores per die").
2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
   04H's calculation) considers number of dies, so it's correct.
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
   15:00] needs "cpus per die" and it gets the correct result, and
   CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".

When CPUState.nr_cores is correctly changed to "cores per package" again
, the above errors will be fixed without extra work, but the "currently"
correct cases will go wrong and need special handling to pass correct
"cpus/cores per die" they want.

Fix CPUState.nr_cores' calculation to fit the original meaning "cores
per package", as well as changing calculation of topo_info.cores_per_die,
vcpus_per_socket and CPUID.1FH.

Fixes: a94e1428991f ("target/i386: Add CPUID.1F generation support for multi-dies PCMachine")
Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct CpuTopology")
Signed-off-by: Zhuocheng Ding <zhuocheng.ding@intel.com>
Co-developed-by: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes since v3:
 * Describe changes in imperative mood. (Babu)
 * Fix spelling typo. (Babu)
 * Split the comment change into a separate patch. (Xiaoyao)

Changes since v2:
 * Use wrapped helper to get cores per socket in qemu_init_vcpu().

Changes since v1:
 * Add comment for nr_dies in CPUX86State. (Yanan)
---
 system/cpus.c     | 2 +-
 target/i386/cpu.c | 9 ++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 0848e0dbdb3f..fa8239c217ff 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -624,7 +624,7 @@ void qemu_init_vcpu(CPUState *cpu)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
 
-    cpu->nr_cores = ms->smp.cores;
+    cpu->nr_cores = machine_topo_get_cores_per_socket(ms);
     cpu->nr_threads =  ms->smp.threads;
     cpu->stopped = true;
     cpu->random_seed = qemu_guest_random_seed_thread_part1();
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index bdca901dfaa4..36214c84ec14 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6019,7 +6019,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
     X86CPUTopoInfo topo_info;
 
     topo_info.dies_per_pkg = env->nr_dies;
-    topo_info.cores_per_die = cs->nr_cores;
+    topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
     topo_info.threads_per_core = cs->nr_threads;
 
     /* Calculate & apply limits for different index ranges */
@@ -6095,8 +6095,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
              */
             if (*eax & 31) {
                 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
-                int vcpus_per_socket = env->nr_dies * cs->nr_cores *
-                                       cs->nr_threads;
+                int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
                 if (cs->nr_cores > 1) {
                     *eax &= ~0xFC000000;
                     *eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
@@ -6273,12 +6272,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
             break;
         case 1:
             *eax = apicid_die_offset(&topo_info);
-            *ebx = cs->nr_cores * cs->nr_threads;
+            *ebx = topo_info.cores_per_die * topo_info.threads_per_core;
             *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
             break;
         case 2:
             *eax = apicid_pkg_offset(&topo_info);
-            *ebx = env->nr_dies * cs->nr_cores * cs->nr_threads;
+            *ebx = cs->nr_cores * cs->nr_threads;
             *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
             break;
         default:
-- 
2.34.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v5 02/20] tests: Rename test-x86-cpuid.c to test-x86-topo.c
  @ 2023-10-24  9:03  6% ` Zhao Liu
  2023-10-24  9:03  2% ` [PATCH v5 03/20] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
  2023-10-24  9:03  4% ` [PATCH v5 19/20] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
  2 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-10-24  9:03 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Yongwei Ma,
	Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The tests in this file actually test the APIC ID combinations.
Rename to test-x86-topo.c to make its name more in line with its
actual content.

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes since v3:
 * Modify the description in commit message to emphasize this file tests
   APIC ID combinations. (Babu)

Changes since v1:
 * Rename test-x86-apicid.c to test-x86-topo.c. (Yanan)
---
 MAINTAINERS                                      | 2 +-
 tests/unit/meson.build                           | 4 ++--
 tests/unit/{test-x86-cpuid.c => test-x86-topo.c} | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)
 rename tests/unit/{test-x86-cpuid.c => test-x86-topo.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 7f9912baa094..60a69822c262 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1748,7 +1748,7 @@ F: include/hw/southbridge/ich9.h
 F: include/hw/southbridge/piix.h
 F: hw/isa/apm.c
 F: include/hw/isa/apm.h
-F: tests/unit/test-x86-cpuid.c
+F: tests/unit/test-x86-topo.c
 F: tests/qtest/test-x86-cpuid-compat.c
 
 PC Chipset
diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index f33ae64b8dc6..0dbe32ba9b83 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -21,8 +21,8 @@ tests = {
   'test-opts-visitor': [testqapi],
   'test-visitor-serialization': [testqapi],
   'test-bitmap': [],
-  # all code tested by test-x86-cpuid is inside topology.h
-  'test-x86-cpuid': [],
+  # all code tested by test-x86-topo is inside topology.h
+  'test-x86-topo': [],
   'test-cutils': [],
   'test-div128': [],
   'test-shift128': [],
diff --git a/tests/unit/test-x86-cpuid.c b/tests/unit/test-x86-topo.c
similarity index 99%
rename from tests/unit/test-x86-cpuid.c
rename to tests/unit/test-x86-topo.c
index bfabc0403a1a..2b104f86d7c2 100644
--- a/tests/unit/test-x86-cpuid.c
+++ b/tests/unit/test-x86-topo.c
@@ -1,5 +1,5 @@
 /*
- *  Test code for x86 CPUID and Topology functions
+ *  Test code for x86 APIC ID and Topology functions
  *
  *  Copyright (c) 2012 Red Hat Inc.
  *
-- 
2.34.1


^ permalink raw reply related	[relevance 6%]

* Re: [PATCH v2] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-18 19:20  3% [PATCH v2] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled Sean Christopherson
@ 2023-10-22 14:59  0% ` Santosh Shukla
  0 siblings, 0 replies; 200+ results
From: Santosh Shukla @ 2023-10-22 14:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Maxim Levitsky



On 10/19/2023 12:50 AM, Sean Christopherson wrote:
> When vNMI is enabled, rely entirely on hardware to correctly handle NMI
> blocking, i.e. don't intercept IRET to detect when NMIs are no longer
> blocked.  KVM already correctly ignores svm->nmi_masked when vNMI is
> enabled, so the effect of the bug is essentially an unnecessary VM-Exit.
> 
> KVM intercepts IRET for two reasons:
>  - To track NMI masking to be able to know at any point of time if NMI
>    is masked.
>  - To track NMI windows (to inject another NMI after the guest executes
>    IRET, i.e. unblocks NMIs)
> 
> When vNMI is enabled, both cases are handled by hardware:
> - NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by
>   KVM at will.
> - Hardware automatically "injects" pending virtual NMIs when virtual NMIs
>   become unblocked.
> 
> However, even though pending a virtual NMI for hardware to handle is the
> most common way to synthesize a guest NMI, KVM may still directly inject
> an NMI via when KVM is handling two "simultaneous" NMIs (see comments in
> process_nmi() for details on KVM's simultaneous NMI handling).  Per AMD's
> APM, hardware sets the BLOCKING flag when software directly injects an NMI
> as well, i.e. KVM doesn't need to manually mark vNMIs as blocked:
> 
>   If Event Injection is used to inject an NMI when NMI Virtualization is
>   enabled, VMRUN sets V_NMI_MASK in the guest state.
> 
> Note, it's still possible that KVM could trigger a spurious IRET VM-Exit.
> When running a nested guest, KVM disables vNMI for L2 and thus will enable
> IRET interception (in both vmcb01 and vmcb02) while running L2 reason.  If
> a nested VM-Exit happens before L2 executes IRET, KVM can end up running
> L1 with vNMI enable and IRET intercepted.  This is also a benign bug, and
> even less likely to happen, i.e. can be safely punted to a future fix.
> 
> Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
> Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
> Cc: Santosh Shukla <santosh.shukla@amd.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> v2: Expand changelog to explain the various behaviors and combos. [Maxim]
> 
> v1: https://lore.kernel.org/all/20231009212919.221810-1-seanjc@google.com
> 

Tested-by: Santosh Shukla <santosh.shukla@amd.com>

Thanks,


^ permalink raw reply	[relevance 0%]

* [PATCH] s390/vfio-ap: fix sysfs status attribute for AP queue devices
@ 2023-10-20 20:48  4% Tony Krowiak
  2023-11-06 16:03  0% ` Tony Krowiak
  0 siblings, 1 reply; 200+ results
From: Tony Krowiak @ 2023-10-20 20:48 UTC (permalink / raw)
  To: linux-s390, linux-kernel, kvm
  Cc: jjherne, pasic, borntraeger, frankja, imbrenda, david, stable

The 'status' attribute for AP queue devices bound to the vfio_ap device
driver displays incorrect status when the mediated device is attached to a
guest, but the queue device is not passed through. In the current
implementation, the status displayed is 'in_use' which is not correct; it
should be 'assigned'. This can happen if one of the queue devices
associated with a given adapter is not bound to the vfio_ap device driver.
For example:

Queues listed in /sys/bus/ap/drivers/vfio_ap:
14.0005
14.0006
14.000d
16.0006
16.000d

Queues listed in /sys/devices/vfio_ap/matrix/$UUID/matrix
14.0005
14.0006
14.000d
16.0005
16.0006
16.000d

Queues listed in /sys/devices/vfio_ap/matrix/$UUID/guest_matrix
14.0005
14.0006
14.000d

The reason no queues for adapter 0x16 are listed in the guest_matrix is
because queue 16.0005 is not bound to the vfio_ap device driver, so no
queue associated with the adapter is passed through to the guest;
therefore, each queue device for adapter 0x16 should display 'assigned'
instead of 'in_use', because those queues are not in use by a guest, but
only assigned to the mediated device.

Let's check the AP configuration for the guest to determine whether a
queue device is passed through before displaying a status of 'in_use'.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Fixes: f139862b92cf ("s390/vfio-ap: add status attribute to AP queue device's sysfs dir")
Cc: stable@vger.kernel.org
---
 drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 4db538a55192..871c14a6921f 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1976,6 +1976,7 @@ static ssize_t status_show(struct device *dev,
 {
 	ssize_t nchars = 0;
 	struct vfio_ap_queue *q;
+	unsigned long apid, apqi;
 	struct ap_matrix_mdev *matrix_mdev;
 	struct ap_device *apdev = to_ap_dev(dev);
 
@@ -1984,7 +1985,11 @@ static ssize_t status_show(struct device *dev,
 	matrix_mdev = vfio_ap_mdev_for_queue(q);
 
 	if (matrix_mdev) {
-		if (matrix_mdev->kvm)
+		apid = AP_QID_CARD(q->apqn);
+		apqi = AP_QID_QUEUE(q->apqn);
+		if (matrix_mdev->kvm &&
+		    test_bit_inv(apid, matrix_mdev->shadow_apcb.apm) &&
+		    test_bit_inv(apqi, matrix_mdev->shadow_apcb.aqm))
 			nchars = scnprintf(buf, PAGE_SIZE, "%s\n",
 					   AP_QUEUE_IN_USE);
 		else
-- 
2.41.0


^ permalink raw reply related	[relevance 4%]

* [PATCH 1/2] Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"
  @ 2023-10-18 19:41  5% ` Sean Christopherson
  2023-11-20 11:41  0%   ` Maxim Levitsky
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-10-18 19:41 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Stefan Sterz

Revert KVM's made-up consistency check on SVM's TLB control.  The APM says
that unsupported encodings are reserved, but the APM doesn't state that
VMRUN checks for a supported encoding.  Unless something is called out
in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be
Zero), AMD behavior is typically to let software shoot itself in the foot.

This reverts commit 174a921b6975ef959dd82ee9e8844067a62e3ec1.

Fixes: 174a921b6975 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB")
Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/nested.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 3fea8c47679e..60891b9ce25f 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -247,18 +247,6 @@ static bool nested_svm_check_bitmap_pa(struct kvm_vcpu *vcpu, u64 pa, u32 size)
 	    kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1);
 }
 
-static bool nested_svm_check_tlb_ctl(struct kvm_vcpu *vcpu, u8 tlb_ctl)
-{
-	/* Nested FLUSHBYASID is not supported yet.  */
-	switch(tlb_ctl) {
-		case TLB_CONTROL_DO_NOTHING:
-		case TLB_CONTROL_FLUSH_ALL_ASID:
-			return true;
-		default:
-			return false;
-	}
-}
-
 static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
 					 struct vmcb_ctrl_area_cached *control)
 {
@@ -278,9 +266,6 @@ static bool __nested_vmcb_check_controls(struct kvm_vcpu *vcpu,
 					   IOPM_SIZE)))
 		return false;
 
-	if (CC(!nested_svm_check_tlb_ctl(vcpu, control->tlb_ctl)))
-		return false;
-
 	if (CC((control->int_ctl & V_NMI_ENABLE_MASK) &&
 	       !vmcb12_is_intercept(control, INTERCEPT_NMI))) {
 		return false;
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[relevance 5%]

* [PATCH v2] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
@ 2023-10-18 19:20  3% Sean Christopherson
  2023-10-22 14:59  0% ` Santosh Shukla
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-10-18 19:20 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Santosh Shukla, Maxim Levitsky

When vNMI is enabled, rely entirely on hardware to correctly handle NMI
blocking, i.e. don't intercept IRET to detect when NMIs are no longer
blocked.  KVM already correctly ignores svm->nmi_masked when vNMI is
enabled, so the effect of the bug is essentially an unnecessary VM-Exit.

KVM intercepts IRET for two reasons:
 - To track NMI masking to be able to know at any point of time if NMI
   is masked.
 - To track NMI windows (to inject another NMI after the guest executes
   IRET, i.e. unblocks NMIs)

When vNMI is enabled, both cases are handled by hardware:
- NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by
  KVM at will.
- Hardware automatically "injects" pending virtual NMIs when virtual NMIs
  become unblocked.

However, even though pending a virtual NMI for hardware to handle is the
most common way to synthesize a guest NMI, KVM may still directly inject
an NMI via when KVM is handling two "simultaneous" NMIs (see comments in
process_nmi() for details on KVM's simultaneous NMI handling).  Per AMD's
APM, hardware sets the BLOCKING flag when software directly injects an NMI
as well, i.e. KVM doesn't need to manually mark vNMIs as blocked:

  If Event Injection is used to inject an NMI when NMI Virtualization is
  enabled, VMRUN sets V_NMI_MASK in the guest state.

Note, it's still possible that KVM could trigger a spurious IRET VM-Exit.
When running a nested guest, KVM disables vNMI for L2 and thus will enable
IRET interception (in both vmcb01 and vmcb02) while running L2 reason.  If
a nested VM-Exit happens before L2 executes IRET, KVM can end up running
L1 with vNMI enable and IRET intercepted.  This is also a benign bug, and
even less likely to happen, i.e. can be safely punted to a future fix.

Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

v2: Expand changelog to explain the various behaviors and combos. [Maxim]

v1: https://lore.kernel.org/all/20231009212919.221810-1-seanjc@google.com

 arch/x86/kvm/svm/svm.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1785de7dc98b..517a12e0f1fd 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3568,8 +3568,15 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 	if (svm->nmi_l1_to_l2)
 		return;
 
-	svm->nmi_masked = true;
-	svm_set_iret_intercept(svm);
+	/*
+	 * No need to manually track NMI masking when vNMI is enabled, hardware
+	 * automatically sets V_NMI_BLOCKING_MASK as appropriate, including the
+	 * case where software directly injects an NMI.
+	 */
+	if (!is_vnmi_enabled(svm)) {
+		svm->nmi_masked = true;
+		svm_set_iret_intercept(svm);
+	}
 	++vcpu->stat.nmi_injections;
 }
 

base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[relevance 3%]

* Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event
  @ 2023-10-18  0:01  4%                         ` Mingwei Zhang
  0 siblings, 0 replies; 200+ results
From: Mingwei Zhang @ 2023-10-18  0:01 UTC (permalink / raw)
  To: Manali Shukla
  Cc: Peter Zijlstra, Sean Christopherson, Ingo Molnar, Dapeng Mi,
	Paolo Bonzini, Arnaldo Carvalho de Melo, Kan Liang, Like Xu,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, kvm, linux-perf-users, linux-kernel,
	Zhenyu Wang, Zhang Xiong, Lv Zhiyuan, Yang Weijiang, Dapeng Mi,
	David Dunn, Thomas Gleixner, Jim Mattson, Like Xu

On Tue, Oct 17, 2023 at 9:58 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 3:24 AM Manali Shukla <manali.shukla@amd.com> wrote:
> >
> > On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> > > On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> > >> Hi all,
> > >>
> > >> I would like to add following things to the discussion just for the awareness of
> > >> everyone.
> > >>
> > >> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> > >> working on prototyping it.
> > >>
> > >> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> > >> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> > >> are not saved by hardware.
> > >
> > > Per the previous discussion, doing this while host has active counters
> > > that do not have ::exclude_guest=1 is invalid and must result in an
> > > error.
> > >
> >
> > Yeah, exclude_guest should be enforced on host, while host has active PMC
> > counters and VPMC is enabled.
> >
> > > Also, I'm assuming it is all optional, a host can still profile a guest
> > > if all is configured just so?
> > >
> >
> > Correct, host should be able to profile guest, if VPMC is not enabled.
> >
> > >> If hypervisor is using the performance counters, it
> > >> is hypervisor's responsibility to save PERF_CTL registers to host save area
> > >> prior to VMRUN and restore them after VMEXIT.
> > >
> > > Does VMEXIT clear global_ctrl at least?
> > >
> >
> > global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
> > all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.
>
> Are these guest values (automatically) saved in the guest area of VMCB
> on VMEXIT?
>

Never mind on this one. So, if both values are in Type C, then they
should be saved to the guest area of VMCB on VMEXIT (according to APM
vol 2 Table B-3). So, this means KVM does not need to intercept these
MSRs on pass-through implementation.

Thanks.
-Mingwei

-Mingwei

^ permalink raw reply	[relevance 4%]

* [PATCH v10 35/50] KVM: SEV: Add support to handle RMP nested page faults
      2023-10-16 13:28  4% ` [PATCH v10 31/50] KVM: SEV: Add KVM_EXIT_VMGEXIT Michael Roth
@ 2023-10-16 13:28  2% ` Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2023-10-16 13:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, marcorr, sathyanarayanan.kuppuswamy,
	alpergun, jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, zhi.a.wang, Brijesh Singh

From: Brijesh Singh <brijesh.singh@amd.com>

When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.

When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev-common.h |  3 +
 arch/x86/kvm/svm/sev.c            | 92 +++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c            | 21 +++++--
 arch/x86/kvm/svm/svm.h            |  1 +
 4 files changed, 113 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 9febc1474a30..15d8e9805963 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -188,6 +188,9 @@ struct snp_psc_desc {
 /* RMUPDATE detected 4K page and 2MB page overlap. */
 #define RMPUPDATE_FAIL_OVERLAP		4
 
+/* PSMASH failed due to concurrent access by another CPU */
+#define PSMASH_FAIL_INUSE		3
+
 /* RMP page size */
 #define RMP_PG_SIZE_4K			0
 #define RMP_PG_SIZE_2M			1
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0287fadeae76..0a45031386c2 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3270,6 +3270,13 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_rmptable_psmash(kvm_pfn_t pfn)
+{
+	pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
+
+	return psmash(pfn);
+}
+
 static int snp_complete_psc_msr_protocol(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3816,3 +3823,88 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return p;
 }
+
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = vcpu->kvm;
+	int order, rmp_level, ret;
+	bool assigned;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+
+	gfn = gpa >> PAGE_SHIFT;
+
+	/*
+	 * The only time RMP faults occur for shared pages is when the guest is
+	 * triggering an RMP fault for an implicit page-state change from
+	 * shared->private. Implicit page-state changes are forwarded to
+	 * userspace via KVM_EXIT_MEMORY_FAULT events, however, so RMP faults
+	 * for shared pages should not end up here.
+	 */
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, size-mismatch for non-private GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	slot = gfn_to_memslot(kvm, gfn);
+	if (!kvm_slot_can_be_private(slot)) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, slot, gfn, &pfn, &order);
+	if (ret) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no private backing page for GPA 0x%llx\n",
+				    gpa);
+		return;
+	}
+
+	ret = snp_lookup_rmpentry(pfn, &assigned, &rmp_level);
+	if (ret || !assigned) {
+		pr_warn_ratelimited("SEV: Unexpected RMP fault, no assigned RMP entry found for GPA 0x%llx PFN 0x%llx error %d\n",
+				    gpa, pfn, ret);
+		goto out;
+	}
+
+	/*
+	 * There are 2 cases where a PSMASH may be needed to resolve an #NPF
+	 * with PFERR_GUEST_RMP_BIT set:
+	 *
+	 * 1) RMPADJUST/PVALIDATE can trigger an #NPF with PFERR_GUEST_SIZEM
+	 *    bit set if the guest issues them with a smaller granularity than
+	 *    what is indicated by the page-size bit in the 2MB-aligned RMP
+	 *    entry for the PFN that backs the GPA.
+	 *
+	 * 2) Guest access via NPT can trigger an #NPF if the NPT mapping is
+	 *    smaller than what is indicated by the 2MB-aligned RMP entry for
+	 *    the PFN that backs the GPA.
+	 *
+	 * In both these cases, the corresponding 2M RMP entry needs to
+	 * be PSMASH'd to 512 4K RMP entries.  If the RMP entry is already
+	 * split into 4K RMP entries, then this is likely a spurious case which
+	 * can occur when there are concurrent accesses by the guest to a 2MB
+	 * GPA range that is backed by a 2MB-aligned PFN who's RMP entry is in
+	 * the process of being PMASH'd into 4K entries. These cases should
+	 * resolve automatically on subsequent accesses, so just ignore them
+	 * here.
+	 */
+	if (rmp_level == PG_LEVEL_4K) {
+		pr_debug_ratelimited("%s: Spurious RMP fault for GPA 0x%llx, error_code 0x%llx",
+				     __func__, gpa, error_code);
+		goto out;
+	}
+
+	pr_debug_ratelimited("%s: Splitting 2M RMP entry for GPA 0x%llx, error_code 0x%llx",
+			     __func__, gpa, error_code);
+	ret = snp_rmptable_psmash(pfn);
+	if (ret && ret != PSMASH_FAIL_INUSE)
+		pr_err_ratelimited("SEV: Unable to split RMP entry for GPA 0x%llx PFN 0x%llx ret %d\n",
+				   gpa, pfn, ret);
+
+	kvm_zap_gfn_range(kvm, gfn, gfn + PTRS_PER_PMD);
+out:
+	put_page(pfn_to_page(pfn));
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8e4ef0cd968a..563c9839428d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2046,15 +2046,28 @@ static int pf_interception(struct kvm_vcpu *vcpu)
 static int npf_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int rc;
 
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
 
 	trace_kvm_page_fault(vcpu, fault_address, error_code);
-	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
-			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
-			svm->vmcb->control.insn_bytes : NULL,
-			svm->vmcb->control.insn_len);
+	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
+				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
+				svm->vmcb->control.insn_bytes : NULL,
+				svm->vmcb->control.insn_len);
+
+	/*
+	 * rc == 0 indicates a userspace exit is needed to handle page
+	 * transitions, so do that first before updating the RMP table.
+	 */
+	if (error_code & PFERR_GUEST_RMP_MASK) {
+		if (rc == 0)
+			return rc;
+		handle_rmp_page_fault(vcpu, fault_address, error_code);
+	}
+
+	return rc;
 }
 
 static int db_interception(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index c4449a88e629..c3a37136fa30 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -715,6 +715,7 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 
 /* vmenter.S */
 
-- 
2.25.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v10 31/50] KVM: SEV: Add KVM_EXIT_VMGEXIT
    @ 2023-10-16 13:28  4% ` Michael Roth
  2023-10-16 13:28  2% ` [PATCH v10 35/50] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
  2 siblings, 0 replies; 200+ results
From: Michael Roth @ 2023-10-16 13:28 UTC (permalink / raw)
  To: kvm
  Cc: linux-coco, linux-mm, linux-crypto, x86, linux-kernel, tglx,
	mingo, jroedel, thomas.lendacky, hpa, ardb, pbonzini, seanjc,
	vkuznets, jmattson, luto, dave.hansen, slp, pgonda, peterz,
	srinivas.pandruvada, rientjes, dovmurik, tobin, bp, vbabka,
	kirill, ak, tony.luck, marcorr, sathyanarayanan.kuppuswamy,
	alpergun, jarkko, ashish.kalra, nikunj.dadhania, pankaj.gupta,
	liam.merwick, zhi.a.wang

For private memslots, GHCB page state change requests will be forwarded
to userspace for processing. Define a new KVM_EXIT_VMGEXIT for exits of
this type, as well as other potential userspace handling for VMGEXITs in
the future.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 Documentation/virt/kvm/api.rst | 34 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |  6 ++++++
 2 files changed, 40 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5e08f2a157ef..e84c62423ab7 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6847,6 +6847,40 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct {
+			__u64 ghcb_msr; /* GHCB MSR contents */
+			__u64 ret;      /* user -> kernel return value */
+		} memory;
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest has
+issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. This is generally handled via the Guest-Hypervisor Communication
+Block (GHCB) specification. The value of 'ghcb_msr' will be the contents of
+the GHCB MSR register at the time of the VMGEXIT, which can either be the GPA
+of the GHCB page for page-based GHCB requests, or an encoding of an MSR-based
+GHCB request. The mechanism to distinguish between these two and determine the
+type of request is the same as what is documented in the GHCB specification.
+
+Not all VMGEXITs or GHCB requests will be forwarded to userspace. Currently
+this will only be the case for "SNP Page State Change" requests (PSCs), and
+only for the subset of these which involve actual shared <-> private
+transition. Userspace is expected to process these requests in accordance
+with the GHCB specification and issue KVM_SET_MEMORY_ATTRIBUTE ioctls to
+perform the shared/private transitions.
+
+GHCB page-based PSC requests require returning a 64-bit return value to the
+guest via the SW_EXITINFO2 field of the vCPU's VMCB structure, as documented
+in the GHCB. Userspace must set 'ret' to what the GHCB specification documents
+the SW_EXITINFO2 VMCB field should be set to after processing a PSC request.
+
+For MSR-based PSC requests, userspace must set the value of 'ghcb_msr' to be
+the same as what the GHCB specification documents the actual GHCB MSR register
+should be set to after processing a PSC request.
+
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6f7b44b32497..3af546adb962 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -279,6 +279,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_MEMORY_FAULT     38
+#define KVM_EXIT_VMGEXIT          50
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -525,6 +526,11 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_VMGEXIT */
+		struct {
+			__u64 ghcb_msr; /* GHCB MSR contents */
+			__u64 ret; /* user -> kernel */
+		} vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-14 10:16  0%     ` Santosh Shukla
@ 2023-10-14 14:49  0%       ` Santosh Shukla
  0 siblings, 0 replies; 200+ results
From: Santosh Shukla @ 2023-10-14 14:49 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky; +Cc: Paolo Bonzini, kvm, linux-kernel



On 10/14/2023 3:46 PM, Santosh Shukla wrote:
> 
> 
> On 10/10/2023 8:16 PM, Sean Christopherson wrote:
>> On Tue, Oct 10, 2023, Maxim Levitsky wrote:
>>> У пн, 2023-10-09 у 14:29 -0700, Sean Christopherson пише:
>>>> Note, per the APM, hardware sets the BLOCKING flag when software directly
>>>> directly injects an NMI:
>>>>
>>>>   If Event Injection is used to inject an NMI when NMI Virtualization is
>>>>   enabled, VMRUN sets V_NMI_MASK in the guest state.
>>>
>>> I think that this comment is not needed in the commit message. It describes
>>> a different unrelated concern and can be put somewhere in the code but
>>> not in the commit message.
>>
>> I strongly disagree, this blurb in the APM directly affects the patch.  If hardware
>> didn't set V_NMI_MASK, then the patch would need to be at least this:
>>
HW sets the V_NMI_MASK.

>> --
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index b7472ad183b9..d34ee3b8293e 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -3569,8 +3569,12 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>>  	if (svm->nmi_l1_to_l2)
>>  		return;
>>  
>> -	svm->nmi_masked = true;
>> -	svm_set_iret_intercept(svm);
>> +	if (is_vnmi_enabled(svm)) {
>> +		svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK;
>> +	} else {
>> +		svm->nmi_masked = true;
>> +		svm_set_iret_intercept(svm);
>> +	}
>>  	++vcpu->stat.nmi_injections;
>>  }
>>  
>>
> 
> quick testing worked fine, KUT test ran fine and tested for non-nested mode so far.
> Will do more nested testing and share the feedback.
> 

Sean - I have tested original patch[1] for nested and KUT, worked fine.

Thanks,
Santosh 

[1] https://lore.kernel.org/r/20231009212919.221810-1-seanjc@google.com 

> Thanks,
> Santosh
> 
>> base-commit: 86701e115030e020a052216baa942e8547e0b487
> 


^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-10 14:46  4%   ` Sean Christopherson
  2023-10-10 16:06  4%     ` Maxim Levitsky
@ 2023-10-14 10:16  0%     ` Santosh Shukla
  2023-10-14 14:49  0%       ` Santosh Shukla
  1 sibling, 1 reply; 200+ results
From: Santosh Shukla @ 2023-10-14 10:16 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky; +Cc: Paolo Bonzini, kvm, linux-kernel



On 10/10/2023 8:16 PM, Sean Christopherson wrote:
> On Tue, Oct 10, 2023, Maxim Levitsky wrote:
>> У пн, 2023-10-09 у 14:29 -0700, Sean Christopherson пише:
>>> Note, per the APM, hardware sets the BLOCKING flag when software directly
>>> directly injects an NMI:
>>>
>>>   If Event Injection is used to inject an NMI when NMI Virtualization is
>>>   enabled, VMRUN sets V_NMI_MASK in the guest state.
>>
>> I think that this comment is not needed in the commit message. It describes
>> a different unrelated concern and can be put somewhere in the code but
>> not in the commit message.
> 
> I strongly disagree, this blurb in the APM directly affects the patch.  If hardware
> didn't set V_NMI_MASK, then the patch would need to be at least this:
> 
> --
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b7472ad183b9..d34ee3b8293e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3569,8 +3569,12 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>  	if (svm->nmi_l1_to_l2)
>  		return;
>  
> -	svm->nmi_masked = true;
> -	svm_set_iret_intercept(svm);
> +	if (is_vnmi_enabled(svm)) {
> +		svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK;
> +	} else {
> +		svm->nmi_masked = true;
> +		svm_set_iret_intercept(svm);
> +	}
>  	++vcpu->stat.nmi_injections;
>  }
>  
>

quick testing worked fine, KUT test ran fine and tested for non-nested mode so far.
Will do more nested testing and share the feedback.

Thanks,
Santosh

> base-commit: 86701e115030e020a052216baa942e8547e0b487


^ permalink raw reply	[relevance 0%]

* [PATCH 4/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp
  @ 2023-10-10 20:02  4% ` John Allen
  2023-11-02 18:06  4%   ` Maxim Levitsky
    1 sibling, 1 reply; 200+ results
From: John Allen @ 2023-10-10 20:02 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, pbonzini, weijiang.yang, rick.p.edgecombe, seanjc,
	x86, thomas.lendacky, bp, John Allen

Rename SEV-ES save area SSP fields to be consistent with the APM.

Signed-off-by: John Allen <john.allen@amd.com>
---
 arch/x86/include/asm/svm.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 19bf955b67e0..568d97084e44 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -361,10 +361,10 @@ struct sev_es_save_area {
 	struct vmcb_seg ldtr;
 	struct vmcb_seg idtr;
 	struct vmcb_seg tr;
-	u64 vmpl0_ssp;
-	u64 vmpl1_ssp;
-	u64 vmpl2_ssp;
-	u64 vmpl3_ssp;
+	u64 pl0_ssp;
+	u64 pl1_ssp;
+	u64 pl2_ssp;
+	u64 pl3_ssp;
 	u64 u_cet;
 	u8 reserved_0xc8[2];
 	u8 vmpl;
-- 
2.40.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-10 16:06  4%     ` Maxim Levitsky
@ 2023-10-10 17:50  0%       ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2023-10-10 17:50 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Paolo Bonzini, kvm, linux-kernel, Santosh Shukla

On Tue, Oct 10, 2023, Maxim Levitsky wrote:
> У вт, 2023-10-10 у 07:46 -0700, Sean Christopherson пише:
> > On Tue, Oct 10, 2023, Maxim Levitsky wrote:
> > > У пн, 2023-10-09 у 14:29 -0700, Sean Christopherson пише:
> > > > Note, per the APM, hardware sets the BLOCKING flag when software directly
> > > > directly injects an NMI:
> > > > 
> > > >   If Event Injection is used to inject an NMI when NMI Virtualization is
> > > >   enabled, VMRUN sets V_NMI_MASK in the guest state.
> > > 
> > > I think that this comment is not needed in the commit message. It describes
> > > a different unrelated concern and can be put somewhere in the code but
> > > not in the commit message.
> > 
> > I strongly disagree, this blurb in the APM directly affects the patch.  If hardware
> > didn't set V_NMI_MASK, then the patch would need to be at least this:
> 
> I don't see how 'the blurb in the APM' relates to the removal of the 
> IRET intercept, which is what this patch is about.

No, it's not *just* about IRET interception.  This patch also guards:

	svm->nmi_masked = true;

If the reader doesn't already know that hardware sets V_NMI_BLOCK_MASK on direct
injection, as was the case for me when I stumbled upon this issue, it's not at
all obvious that not doing something analogous to setting nmi_masked is correct.

I mentioned only IRET interception in the shortlog because that's the only practical
impact of the change.  I can massage the shortlog if it's confusing/misleading,
but I really don't want to drop the reference to hardware setting V_NMI_BLOCK_MASK.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-10 14:46  4%   ` Sean Christopherson
@ 2023-10-10 16:06  4%     ` Maxim Levitsky
  2023-10-10 17:50  0%       ` Sean Christopherson
  2023-10-14 10:16  0%     ` Santosh Shukla
  1 sibling, 1 reply; 200+ results
From: Maxim Levitsky @ 2023-10-10 16:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Santosh Shukla

У вт, 2023-10-10 у 07:46 -0700, Sean Christopherson пише:
> On Tue, Oct 10, 2023, Maxim Levitsky wrote:
> > У пн, 2023-10-09 у 14:29 -0700, Sean Christopherson пише:
> > > Note, per the APM, hardware sets the BLOCKING flag when software directly
> > > directly injects an NMI:
> > > 
> > >   If Event Injection is used to inject an NMI when NMI Virtualization is
> > >   enabled, VMRUN sets V_NMI_MASK in the guest state.
> > 
> > I think that this comment is not needed in the commit message. It describes
> > a different unrelated concern and can be put somewhere in the code but
> > not in the commit message.
> 
> I strongly disagree, this blurb in the APM directly affects the patch.  If hardware
> didn't set V_NMI_MASK, then the patch would need to be at least this:

I don't see how 'the blurb in the APM' relates to the removal of the 
IRET intercept, which is what this patch is about.

If the hardware was not to set the V_NMI_BLOCKING_MASK during EVENTINJ NMI injection, 
we would have had a bigger problem, a problem which would have to be addressed 
before this patch, because kvm reads back the V_NMI_BLOCKING_MASK 
(see: svm_get_nmi_mask()) to check if NMI is blocked, something that
has no relation to the IRET interception.


Best regards,
	Maxim Levtsky


> 
> --
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b7472ad183b9..d34ee3b8293e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3569,8 +3569,12 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>  	if (svm->nmi_l1_to_l2)
>  		return;
>  
> -	svm->nmi_masked = true;
> -	svm_set_iret_intercept(svm);
> +	if (is_vnmi_enabled(svm)) {
> +		svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK;
> +	} else {
> +		svm->nmi_masked = true;
> +		svm_set_iret_intercept(svm);
> +	}
>  	++vcpu->stat.nmi_injections;
>  }
>  
> 
> base-commit: 86701e115030e020a052216baa942e8547e0b487



^ permalink raw reply	[relevance 4%]

* Re: [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-10 12:03  0% ` Maxim Levitsky
@ 2023-10-10 14:46  4%   ` Sean Christopherson
  2023-10-10 16:06  4%     ` Maxim Levitsky
  2023-10-14 10:16  0%     ` Santosh Shukla
  0 siblings, 2 replies; 200+ results
From: Sean Christopherson @ 2023-10-10 14:46 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: Paolo Bonzini, kvm, linux-kernel, Santosh Shukla

On Tue, Oct 10, 2023, Maxim Levitsky wrote:
> У пн, 2023-10-09 у 14:29 -0700, Sean Christopherson пише:
> > Note, per the APM, hardware sets the BLOCKING flag when software directly
> > directly injects an NMI:
> > 
> >   If Event Injection is used to inject an NMI when NMI Virtualization is
> >   enabled, VMRUN sets V_NMI_MASK in the guest state.
> 
> I think that this comment is not needed in the commit message. It describes
> a different unrelated concern and can be put somewhere in the code but
> not in the commit message.

I strongly disagree, this blurb in the APM directly affects the patch.  If hardware
didn't set V_NMI_MASK, then the patch would need to be at least this:

--
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b7472ad183b9..d34ee3b8293e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3569,8 +3569,12 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 	if (svm->nmi_l1_to_l2)
 		return;
 
-	svm->nmi_masked = true;
-	svm_set_iret_intercept(svm);
+	if (is_vnmi_enabled(svm)) {
+		svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK;
+	} else {
+		svm->nmi_masked = true;
+		svm_set_iret_intercept(svm);
+	}
 	++vcpu->stat.nmi_injections;
 }
 

base-commit: 86701e115030e020a052216baa942e8547e0b487
-- 

and maybe even more to deal with canceled injection.

^ permalink raw reply related	[relevance 4%]

* Re: [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
  2023-10-09 21:29  4% [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled Sean Christopherson
@ 2023-10-10 12:03  0% ` Maxim Levitsky
  2023-10-10 14:46  4%   ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Maxim Levitsky @ 2023-10-10 12:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Santosh Shukla

У пн, 2023-10-09 у 14:29 -0700, Sean Christopherson пише:
> When vNMI is enabled, rely entirely on hardware to correctly handle NMI
> blocking, i.e. don't intercept IRET to detect when NMIs are no longer
> blocked.  KVM already correctly ignores svm->nmi_masked when vNMI is
> enabled, so the effect of the bug is essentially an unnecessary VM-Exit.

I would re-phrase this like that:

KVM intercepts IRET for two reasons:
- To track NMI masking to be able to know at any point of time if NMI is masked.
- To track NMI window (to inject another NMI after IRET finishes executing).

When L1 uses vNMI, both cases are fulfilled by the vNMI hardware:
- NMI masking state resides in V_NMI_BLOCKING bit of int_ctl and can be read by KVM
  at will.
- vNMI hardware injects the NMIs autonomically every time NMI is unblocked.

Thus there is no need to intercept IRET while vNMI is active.

However, even when vNMI is active in L1, the svm_inject_nmi() can still 
be called to do a direct NMI injection to support the case when KVM is 
trying to inject two NMIs simultaneously.

In this case there is no need to enable IRET interception.

Note that the effect of this bug is essentially an unnecessary VM-Exit.

Also note that even when vNMI is supported and used, running a nested guest
disables vNMI of the L1 guest, thus IRET will still be intercepted.
In this case if the nested VM exit happens before the NMI is delivered,
an unnecessary VM exit can still happen but this is even less likely.

> 
> Note, per the APM, hardware sets the BLOCKING flag when software directly
> directly injects an NMI:
> 
>   If Event Injection is used to inject an NMI when NMI Virtualization is
>   enabled, VMRUN sets V_NMI_MASK in the guest state.

I think that this comment is not needed in the commit message. It describes
a different unrelated concern and can be put somewhere in the code but
not in the commit message.

> 
> Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
> Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
> Cc: Santosh Shukla <santosh.shukla@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> Santosh, can you verify that I didn't break vNMI?  I don't have access to the
> right hardware.  Thanks!
> 
>  arch/x86/kvm/svm/svm.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index b7472ad183b9..4f22d12b5d60 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3569,8 +3569,15 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>  	if (svm->nmi_l1_to_l2)
>  		return;
>  
> -	svm->nmi_masked = true;
> -	svm_set_iret_intercept(svm);
> +	/*
> +	 * No need to manually track NMI masking when vNMI is enabled, hardware
> +	 * automatically sets V_NMI_BLOCKING_MASK as appropriate, including the
> +	 * case where software directly injects an NMI.
> +	 */
> +	if (!is_vnmi_enabled(svm)) {
> +		svm->nmi_masked = true;
> +		svm_set_iret_intercept(svm);
> +	}
>  	++vcpu->stat.nmi_injections;
>  }
>  
> 
> base-commit: 86701e115030e020a052216baa942e8547e0b487


Note that while nested, the 'is_vnmi_enabled()' will return false because L1's vnmi is indeed disabled
(I wonder if is_vnmi_enabled should be renamed is_l1_vnmi_enabled() to clarify this),

So when nested VM exit happens, that intercept can still continue to be true, 
which should not cause an issue but this is still something to keep in mind.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[relevance 0%]

* [kvm-unit-tests PATCH v2 3/7] lib: s390x: ap: Add ap_setup
    2023-10-10  8:49  4% ` [kvm-unit-tests PATCH v2 1/7] lib: s390x: Add ap library Janosch Frank
@ 2023-10-10  8:49  6% ` Janosch Frank
  1 sibling, 0 replies; 200+ results
From: Janosch Frank @ 2023-10-10  8:49 UTC (permalink / raw)
  To: kvm; +Cc: imbrenda, thuth, david, nsg, nrb

For the next tests we need a valid queue which means we need to grab
the qci info and search the first set bit in the ap and aq masks.

Let's move from the ap_check function to a proper setup function that
also returns arrays for the aps and qns.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 lib/s390x/ap.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++--
 lib/s390x/ap.h | 17 +++++++++-
 s390x/ap.c     |  4 ++-
 3 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
index 4af3cdee..9ba5a037 100644
--- a/lib/s390x/ap.c
+++ b/lib/s390x/ap.c
@@ -13,10 +13,18 @@
 
 #include <libcflat.h>
 #include <interrupt.h>
+#include <alloc.h>
+#include <bitops.h>
 #include <ap.h>
 #include <asm/time.h>
 #include <asm/facility.h>
 
+static uint8_t num_ap;
+static uint8_t num_queue;
+static uint8_t *array_ap;
+static uint8_t *array_queue;
+static struct ap_config_info qci;
+
 int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
 		 struct pqap_r2 *r2)
 {
@@ -78,15 +86,90 @@ int ap_pqap_qci(struct ap_config_info *info)
 	return cc;
 }
 
-bool ap_check(void)
+static int get_entry(uint8_t *ptr, int i, size_t len)
 {
+	/* Skip over the last entry */
+	if (i)
+		i++;
+
+	for (; i < 8 * len; i++) {
+		/* Do we even need to check the byte? */
+		if (!ptr[i / 8]) {
+			i += 7;
+			continue;
+		}
+
+		/* Check the bit in big-endian order */
+		if (ptr[i / 8] & BIT(7 - (i % 8)))
+			return i;
+	}
+	return -1;
+}
+
+static void setup_info(void)
+{
+	int i, entry = 0;
+
+	ap_pqap_qci(&qci);
+
+	for (i = 0; i < AP_ARRAY_MAX_NUM; i++) {
+		entry = get_entry((uint8_t *)qci.apm, entry, sizeof(qci.apm));
+		if (entry == -1)
+			break;
+
+		if (!num_ap) {
+			/*
+			 * We have at least one AP, time to
+			 * allocate the queue arrays
+			 */
+			array_ap = malloc(AP_ARRAY_MAX_NUM);
+			array_queue = malloc(AP_ARRAY_MAX_NUM);
+		}
+		array_ap[num_ap] = entry;
+		num_ap += 1;
+	}
+
+	/* Without an AP we don't even need to look at the queues */
+	if (!num_ap)
+		return;
+
+	entry = 0;
+	for (i = 0; i < AP_ARRAY_MAX_NUM; i++) {
+		entry = get_entry((uint8_t *)qci.aqm, entry, sizeof(qci.aqm));
+		if (entry == -1)
+			return;
+
+		array_queue[num_queue] = entry;
+		num_queue += 1;
+	}
+
+}
+
+int ap_setup(uint8_t **ap_array, uint8_t **qn_array, uint8_t *naps, uint8_t *nqns)
+{
+	assert(!num_ap);
+
 	/*
 	 * Base AP support has no STFLE or SCLP feature bit but the
 	 * PQAP QCI support is indicated via stfle bit 12. As this
 	 * library relies on QCI we bail out if it's not available.
 	 */
 	if (!test_facility(12))
-		return false;
+		return AP_SETUP_NOINSTR;
 
-	return true;
+	/* Setup ap and queue arrays */
+	setup_info();
+
+	if (!num_ap)
+		return AP_SETUP_NOAPQN;
+
+	/* Only provide the data if the caller actually needs it. */
+	if (ap_array && qn_array && naps && nqns) {
+		*ap_array = array_ap;
+		*qn_array = array_queue;
+		*naps = num_ap;
+		*nqns = num_queue;
+	}
+
+	return AP_SETUP_READY;
 }
diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
index 411591f2..ac9e59d1 100644
--- a/lib/s390x/ap.h
+++ b/lib/s390x/ap.h
@@ -81,7 +81,22 @@ struct pqap_r2 {
 } __attribute__((packed))  __attribute__((aligned(8)));
 _Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
 
-bool ap_check(void);
+/*
+ * Maximum number of APQNs that we keep track of.
+ *
+ * This value is already way larger than the number of APQNs a AP test
+ * is probably going to use.
+ */
+#define AP_ARRAY_MAX_NUM	16
+
+/* Return values of ap_setup() */
+enum {
+	AP_SETUP_NOINSTR,	/* AP instructions not available */
+	AP_SETUP_NOAPQN,	/* Instructions available but no APQNs */
+	AP_SETUP_READY		/* Full setup complete, at least one APQN */
+};
+
+int ap_setup(uint8_t **ap_array, uint8_t **qn_array, uint8_t *naps, uint8_t *nqns);
 int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
 		 struct pqap_r2 *r2);
 int ap_pqap_qci(struct ap_config_info *info);
diff --git a/s390x/ap.c b/s390x/ap.c
index 94f08783..32feb8db 100644
--- a/s390x/ap.c
+++ b/s390x/ap.c
@@ -292,8 +292,10 @@ static void test_priv(void)
 
 int main(void)
 {
+	int setup_rc = ap_setup(NULL, NULL, NULL, NULL);
+
 	report_prefix_push("ap");
-	if (!ap_check()) {
+	if (setup_rc == AP_SETUP_NOINSTR) {
 		report_skip("AP instructions not available");
 		goto done;
 	}
-- 
2.34.1


^ permalink raw reply related	[relevance 6%]

* [kvm-unit-tests PATCH v2 1/7] lib: s390x: Add ap library
  @ 2023-10-10  8:49  4% ` Janosch Frank
  2023-10-10  8:49  6% ` [kvm-unit-tests PATCH v2 3/7] lib: s390x: ap: Add ap_setup Janosch Frank
  1 sibling, 0 replies; 200+ results
From: Janosch Frank @ 2023-10-10  8:49 UTC (permalink / raw)
  To: kvm; +Cc: imbrenda, thuth, david, nsg, nrb

Add functions and definitions needed to test the Adjunct
Processor (AP) crypto interface.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 lib/s390x/ap.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/s390x/ap.h | 88 +++++++++++++++++++++++++++++++++++++++++++++++
 s390x/Makefile |  1 +
 3 files changed, 181 insertions(+)
 create mode 100644 lib/s390x/ap.c
 create mode 100644 lib/s390x/ap.h

diff --git a/lib/s390x/ap.c b/lib/s390x/ap.c
new file mode 100644
index 00000000..4af3cdee
--- /dev/null
+++ b/lib/s390x/ap.c
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * AP crypto functions
+ *
+ * Some parts taken from the Linux AP driver.
+ *
+ * Copyright IBM Corp. 2023
+ * Author: Janosch Frank <frankja@linux.ibm.com>
+ *	   Tony Krowiak <akrowia@linux.ibm.com>
+ *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *	   Harald Freudenberger <freude@de.ibm.com>
+ */
+
+#include <libcflat.h>
+#include <interrupt.h>
+#include <ap.h>
+#include <asm/time.h>
+#include <asm/facility.h>
+
+int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
+		 struct pqap_r2 *r2)
+{
+	struct pqap_r0 r0 = {
+		.ap = ap,
+		.qn = qn,
+		.fc = PQAP_TEST_APQ
+	};
+	int cc;
+
+	/*
+	 * Test AP Queue
+	 *
+	 * Writes AP configuration information to the memory pointed
+	 * at by GR2.
+	 *
+	 * Inputs: GR0
+	 * Outputs: GR1 (APQSW), GR2 (tapq data)
+	 * Synchronous
+	 */
+	asm volatile(
+		"	lgr	0,%[r0]\n"
+		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
+		"	stg	1,%[apqsw]\n"
+		"	stg	2,%[r2]\n"
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [apqsw] "=&T" (*apqsw), [r2] "=&T" (*r2), [cc] "=&d" (cc)
+		: [r0] "d" (r0));
+
+	return cc;
+}
+
+int ap_pqap_qci(struct ap_config_info *info)
+{
+	struct pqap_r0 r0 = { .fc = PQAP_QUERY_AP_CONF_INFO };
+	int cc;
+
+	/*
+	 * Query AP Configuration Information
+	 *
+	 * Writes AP configuration information to the memory pointed
+	 * at by GR2.
+	 *
+	 * Inputs: GR0, GR2 (QCI block address)
+	 * Outputs: memory at GR2 address
+	 * Synchronous
+	 */
+	asm volatile(
+		"	lgr	0,%[r0]\n"
+		"	lgr	2,%[info]\n"
+		"	.insn	rre,0xb2af0000,0,0\n" /* PQAP */
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [cc] "=&d" (cc)
+		: [r0] "d" (r0), [info] "d" (info)
+		: "cc", "memory", "0", "2");
+
+	return cc;
+}
+
+bool ap_check(void)
+{
+	/*
+	 * Base AP support has no STFLE or SCLP feature bit but the
+	 * PQAP QCI support is indicated via stfle bit 12. As this
+	 * library relies on QCI we bail out if it's not available.
+	 */
+	if (!test_facility(12))
+		return false;
+
+	return true;
+}
diff --git a/lib/s390x/ap.h b/lib/s390x/ap.h
new file mode 100644
index 00000000..411591f2
--- /dev/null
+++ b/lib/s390x/ap.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * AP definitions
+ *
+ * Some parts taken from the Linux AP driver.
+ *
+ * Copyright IBM Corp. 2023
+ * Author: Janosch Frank <frankja@linux.ibm.com>
+ *	   Tony Krowiak <akrowia@linux.ibm.com>
+ *	   Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *	   Harald Freudenberger <freude@de.ibm.com>
+ */
+
+#ifndef _S390X_AP_H_
+#define _S390X_AP_H_
+
+enum PQAP_FC {
+	PQAP_TEST_APQ,
+	PQAP_RESET_APQ,
+	PQAP_ZEROIZE_APQ,
+	PQAP_QUEUE_INT_CONTRL,
+	PQAP_QUERY_AP_CONF_INFO,
+	PQAP_QUERY_AP_COMP_TYPE,
+	PQAP_BEST_AP,
+};
+
+struct ap_queue_status {
+	uint32_t pad0;			/* Ignored padding for zArch  */
+	uint32_t empty		: 1;
+	uint32_t replies_waiting: 1;
+	uint32_t full		: 1;
+	uint32_t pad1		: 4;
+	uint32_t irq_enabled	: 1;
+	uint32_t rc		: 8;
+	uint32_t pad2		: 16;
+} __attribute__((packed))  __attribute__((aligned(8)));
+_Static_assert(sizeof(struct ap_queue_status) == sizeof(uint64_t), "APQSW size");
+
+struct ap_config_info {
+	uint8_t apsc	 : 1;	/* S bit */
+	uint8_t apxa	 : 1;	/* N bit */
+	uint8_t qact	 : 1;	/* C bit */
+	uint8_t rc8a	 : 1;	/* R bit */
+	uint8_t l	 : 1;	/* L bit */
+	uint8_t lext	 : 3;	/* Lext bits */
+	uint8_t reserved2[3];
+	uint8_t Na;		/* max # of APs - 1 */
+	uint8_t Nd;		/* max # of Domains - 1 */
+	uint8_t reserved6[10];
+	uint32_t apm[8];	/* AP ID mask */
+	uint32_t aqm[8];	/* AP (usage) queue mask */
+	uint32_t adm[8];	/* AP (control) domain mask */
+	uint8_t _reserved4[16];
+} __attribute__((aligned(8))) __attribute__ ((__packed__));
+_Static_assert(sizeof(struct ap_config_info) == 128, "PQAP QCI size");
+
+struct pqap_r0 {
+	uint32_t pad0;
+	uint8_t fc;
+	uint8_t t : 1;		/* Test facilities (TAPQ)*/
+	uint8_t pad1 : 7;
+	uint8_t ap;
+	uint8_t qn;
+} __attribute__((packed))  __attribute__((aligned(8)));
+
+struct pqap_r2 {
+	uint8_t s : 1;		/* Special Command facility */
+	uint8_t m : 1;		/* AP4KM */
+	uint8_t c : 1;		/* AP4KC */
+	uint8_t cop : 1;	/* AP is in coprocessor mode */
+	uint8_t acc : 1;	/* AP is in accelerator mode */
+	uint8_t xcp : 1;	/* AP is in XCP-mode */
+	uint8_t n : 1;		/* AP extended addressing facility */
+	uint8_t pad_0 : 1;
+	uint8_t pad_1[3];
+	uint8_t at;
+	uint8_t nd;
+	uint8_t pad_6;
+	uint8_t pad_7 : 4;
+	uint8_t qd : 4;
+} __attribute__((packed))  __attribute__((aligned(8)));
+_Static_assert(sizeof(struct pqap_r2) == sizeof(uint64_t), "pqap_r2 size");
+
+bool ap_check(void);
+int ap_pqap_tapq(uint8_t ap, uint8_t qn, struct ap_queue_status *apqsw,
+		 struct pqap_r2 *r2);
+int ap_pqap_qci(struct ap_config_info *info);
+#endif
diff --git a/s390x/Makefile b/s390x/Makefile
index 6e967194..d9abe5c1 100644
--- a/s390x/Makefile
+++ b/s390x/Makefile
@@ -109,6 +109,7 @@ cflatobjs += lib/s390x/malloc_io.o
 cflatobjs += lib/s390x/uv.o
 cflatobjs += lib/s390x/sie.o
 cflatobjs += lib/s390x/fault.o
+cflatobjs += lib/s390x/ap.o
 
 OBJDIRS += lib/s390x
 
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
@ 2023-10-09 21:29  4% Sean Christopherson
  2023-10-10 12:03  0% ` Maxim Levitsky
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-10-09 21:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Santosh Shukla

When vNMI is enabled, rely entirely on hardware to correctly handle NMI
blocking, i.e. don't intercept IRET to detect when NMIs are no longer
blocked.  KVM already correctly ignores svm->nmi_masked when vNMI is
enabled, so the effect of the bug is essentially an unnecessary VM-Exit.

Note, per the APM, hardware sets the BLOCKING flag when software directly
directly injects an NMI:

  If Event Injection is used to inject an NMI when NMI Virtualization is
  enabled, VMRUN sets V_NMI_MASK in the guest state.

Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
Cc: Santosh Shukla <santosh.shukla@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

Santosh, can you verify that I didn't break vNMI?  I don't have access to the
right hardware.  Thanks!

 arch/x86/kvm/svm/svm.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b7472ad183b9..4f22d12b5d60 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3569,8 +3569,15 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
 	if (svm->nmi_l1_to_l2)
 		return;
 
-	svm->nmi_masked = true;
-	svm_set_iret_intercept(svm);
+	/*
+	 * No need to manually track NMI masking when vNMI is enabled, hardware
+	 * automatically sets V_NMI_BLOCKING_MASK as appropriate, including the
+	 * case where software directly injects an NMI.
+	 */
+	if (!is_vnmi_enabled(svm)) {
+		svm->nmi_masked = true;
+		svm_set_iret_intercept(svm);
+	}
 	++vcpu->stat.nmi_injections;
 }
 

base-commit: 86701e115030e020a052216baa942e8547e0b487
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[relevance 4%]

* [PATCH] x86: KVM: Add feature flag for AMD's FsGsKernelGsBaseNonSerializing
@ 2023-10-04  0:20  4% Jim Mattson
  0 siblings, 0 replies; 200+ results
From: Jim Mattson @ 2023-10-04  0:20 UTC (permalink / raw)
  To: kvm, linux-kernel, Pawan Gupta, Jiaxi Chen, Kim Phillips,
	Paolo Bonzini, Sean Christopherson, H. Peter Anvin, x86,
	Dave Hansen, Borislav Petkov, Ingo Molnar, Thomas Gleixner
  Cc: Jim Mattson

Define an X86_FEATURE_* flag for
CPUID.80000021H:EAX.FsGsKernelGsBaseNonSerializing[bit 1], and
advertise the feature to userspace via KVM_GET_SUPPORTED_CPUID.

This feature is not yet documented in the APM. See AMD's "Processor
Programming Reference (PPR) for AMD Family 19h Model 61h, Revision B1
Processors (56713-B1-PUB)."

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kvm/cpuid.c               | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..b53951c83d1d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -443,6 +443,7 @@
 
 /* AMD-defined Extended Feature 2 EAX, CPUID level 0x80000021 (EAX), word 20 */
 #define X86_FEATURE_NO_NESTED_DATA_BP	(20*32+ 0) /* "" No Nested Data Breakpoints */
+#define X86_FEATURE_BASES_NON_SERIAL	(20*32+ 1) /* "" FSBASE, GSBASE, and KERNELGSBASE are non-serializing */
 #define X86_FEATURE_LFENCE_RDTSC	(20*32+ 2) /* "" LFENCE always serializing / synchronizes RDTSC */
 #define X86_FEATURE_NULL_SEL_CLR_BASE	(20*32+ 6) /* "" Null Selector Clears Base */
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* "" Automatic IBRS */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0544e30b4946..5e776e8619be 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -761,7 +761,8 @@ void kvm_set_cpu_caps(void)
 
 	kvm_cpu_cap_mask(CPUID_8000_0021_EAX,
 		F(NO_NESTED_DATA_BP) | F(LFENCE_RDTSC) | 0 /* SmmPgCfgLock */ |
-		F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */
+		F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */ |
+		F(BASES_NON_SERIAL)
 	);
 
 	if (cpu_feature_enabled(X86_FEATURE_SRSO_NO))
-- 
2.42.0.582.g8ccd20d70d-goog


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v2 2/4] x86: KVM: SVM: add support for Invalid IPI Vector interception
  2023-09-28 17:33  3% ` [PATCH v2 2/4] x86: KVM: SVM: add support for Invalid IPI Vector interception Maxim Levitsky
@ 2023-09-29  0:42  0%   ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2023-09-29  0:42 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, iommu, H. Peter Anvin, Paolo Bonzini, Thomas Gleixner,
	Borislav Petkov, Joerg Roedel, x86, Suravee Suthikulpanit,
	linux-kernel, Dave Hansen, Will Deacon, Ingo Molnar,
	Robin Murphy, stable

On Thu, Sep 28, 2023, Maxim Levitsky wrote:
> In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:
> 
> "Invalid IPI Vector - The vector for the specified IPI was set to an
> illegal value (VEC < 16)"
> 
> Note that tests on Zen2 machine show that this VM exit doesn't happen and
> instead AVIC just does nothing.
> 
> Add support for this exit code by doing nothing, instead of filling
> the kernel log with errors.
> 
> Also replace an unthrottled 'pr_err()' if another unknown incomplete
> IPI exit happens with vcpu_unimpl()
> 
> (e.g in case AMD adds yet another 'Invalid IPI' exit reason)
> 
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---

Reviewed-by: Sean Christopherson <seanjc@google.com>

^ permalink raw reply	[relevance 0%]

* [PATCH v2 2/4] x86: KVM: SVM: add support for Invalid IPI Vector interception
  @ 2023-09-28 17:33  3% ` Maxim Levitsky
  2023-09-29  0:42  0%   ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Maxim Levitsky @ 2023-09-28 17:33 UTC (permalink / raw)
  To: kvm
  Cc: iommu, H. Peter Anvin, Sean Christopherson, Maxim Levitsky,
	Paolo Bonzini, Thomas Gleixner, Borislav Petkov, Joerg Roedel,
	x86, Suravee Suthikulpanit, linux-kernel, Dave Hansen,
	Will Deacon, Ingo Molnar, Robin Murphy, stable

In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:

"Invalid IPI Vector - The vector for the specified IPI was set to an
illegal value (VEC < 16)"

Note that tests on Zen2 machine show that this VM exit doesn't happen and
instead AVIC just does nothing.

Add support for this exit code by doing nothing, instead of filling
the kernel log with errors.

Also replace an unthrottled 'pr_err()' if another unknown incomplete
IPI exit happens with vcpu_unimpl()

(e.g in case AMD adds yet another 'Invalid IPI' exit reason)

Cc: <stable@vger.kernel.org>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/svm.h | 1 +
 arch/x86/kvm/svm/avic.c    | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 19bf955b67e0da0..3ac0ffc4f3e202b 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -268,6 +268,7 @@ enum avic_ipi_failure_cause {
 	AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
 	AVIC_IPI_FAILURE_INVALID_TARGET,
 	AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
+	AVIC_IPI_FAILURE_INVALID_IPI_VECTOR,
 };
 
 #define AVIC_PHYSICAL_MAX_INDEX_MASK	GENMASK_ULL(8, 0)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 2092db892d7d052..4b74ea91f4e6bb6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -529,8 +529,11 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
 	case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE:
 		WARN_ONCE(1, "Invalid backing page\n");
 		break;
+	case AVIC_IPI_FAILURE_INVALID_IPI_VECTOR:
+		/* Invalid IPI with vector < 16 */
+		break;
 	default:
-		pr_err("Unknown IPI interception\n");
+		vcpu_unimpl(vcpu, "Unknown avic incomplete IPI interception\n");
 	}
 
 	return 1;
-- 
2.26.3


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH 2/5] x86: KVM: SVM: add support for Invalid IPI Vector interception
  2023-09-28 15:04  3% ` [PATCH 2/5] x86: KVM: SVM: add support for Invalid IPI Vector interception Maxim Levitsky
@ 2023-09-28 15:46  0%   ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2023-09-28 15:46 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Will Deacon, Borislav Petkov, Dave Hansen,
	Suravee Suthikulpanit, Thomas Gleixner, Paolo Bonzini, x86,
	Robin Murphy, iommu, Ingo Molnar, Joerg Roedel, H. Peter Anvin,
	linux-kernel, stable

On Thu, Sep 28, 2023, Maxim Levitsky wrote:
> In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:
> 
> "Invalid IPI Vector - The vector for the specified IPI was set to an
> illegal value (VEC < 16)"
> 
> Note that tests on Zen2 machine show that this VM exit doesn't happen and
> instead AVIC just does nothing.
> 
> Add support for this exit code by doing nothing, instead of filling
> the kernel log with errors.
> 
> Also replace an unthrottled 'pr_err()' if another unknown incomplete
> IPI exit happens with WARN_ON_ONCE()
> 
> (e.g in case AMD adds yet another 'Invalid IPI' exit reason)
> 
> Cc: <stable@vger.kernel.org>
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/include/asm/svm.h | 1 +
>  arch/x86/kvm/svm/avic.c    | 5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 19bf955b67e0da0..3ac0ffc4f3e202b 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -268,6 +268,7 @@ enum avic_ipi_failure_cause {
>  	AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
>  	AVIC_IPI_FAILURE_INVALID_TARGET,
>  	AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
> +	AVIC_IPI_FAILURE_INVALID_IPI_VECTOR,
>  };
>  
>  #define AVIC_PHYSICAL_MAX_INDEX_MASK	GENMASK_ULL(8, 0)
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 2092db892d7d052..c44b65af494e3ff 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -529,8 +529,11 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
>  	case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE:
>  		WARN_ONCE(1, "Invalid backing page\n");
>  		break;
> +	case AVIC_IPI_FAILURE_INVALID_IPI_VECTOR:
> +		/* Invalid IPI with vector < 16 */
> +		break;
>  	default:
> -		pr_err("Unknown IPI interception\n");
> +		WARN_ONCE(1, "Unknown avic incomplete IPI interception\n");

Hrm, I'm not sure KVM should WARN here.  E.g. if someone runs with panic_on_warn=1,
running on new hardware might crash the host.  I hope that AMD is smart enough to
make any future failure types "optional" in the sense that they're either opt-in,
or are largely informational-only (like AVIC_IPI_FAILURE_INVALID_IPI_VECTOR).

I think switching to vcpu_unimpl(), or maybe even pr_err_once(), is more appropriate.

^ permalink raw reply	[relevance 0%]

* [PATCH 2/5] x86: KVM: SVM: add support for Invalid IPI Vector interception
  @ 2023-09-28 15:04  3% ` Maxim Levitsky
  2023-09-28 15:46  0%   ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Maxim Levitsky @ 2023-09-28 15:04 UTC (permalink / raw)
  To: kvm
  Cc: Will Deacon, Borislav Petkov, Dave Hansen, Suravee Suthikulpanit,
	Thomas Gleixner, Paolo Bonzini, x86, Robin Murphy, iommu,
	Ingo Molnar, Joerg Roedel, Sean Christopherson, H. Peter Anvin,
	linux-kernel, Maxim Levitsky, stable

In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:

"Invalid IPI Vector - The vector for the specified IPI was set to an
illegal value (VEC < 16)"

Note that tests on Zen2 machine show that this VM exit doesn't happen and
instead AVIC just does nothing.

Add support for this exit code by doing nothing, instead of filling
the kernel log with errors.

Also replace an unthrottled 'pr_err()' if another unknown incomplete
IPI exit happens with WARN_ON_ONCE()

(e.g in case AMD adds yet another 'Invalid IPI' exit reason)

Cc: <stable@vger.kernel.org>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/svm.h | 1 +
 arch/x86/kvm/svm/avic.c    | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 19bf955b67e0da0..3ac0ffc4f3e202b 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -268,6 +268,7 @@ enum avic_ipi_failure_cause {
 	AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
 	AVIC_IPI_FAILURE_INVALID_TARGET,
 	AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
+	AVIC_IPI_FAILURE_INVALID_IPI_VECTOR,
 };
 
 #define AVIC_PHYSICAL_MAX_INDEX_MASK	GENMASK_ULL(8, 0)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 2092db892d7d052..c44b65af494e3ff 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -529,8 +529,11 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
 	case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE:
 		WARN_ONCE(1, "Invalid backing page\n");
 		break;
+	case AVIC_IPI_FAILURE_INVALID_IPI_VECTOR:
+		/* Invalid IPI with vector < 16 */
+		break;
 	default:
-		pr_err("Unknown IPI interception\n");
+		WARN_ONCE(1, "Unknown avic incomplete IPI interception\n");
 	}
 
 	return 1;
-- 
2.26.3


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH V2] KVM: x86/pmu: Disable vPMU if EVENTSEL_GUESTONLY bit doesn't exist
  2023-09-26  3:49  0%       ` Like Xu
@ 2023-09-26 15:44  0%         ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2023-09-26 15:44 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel, Ravi Bangoria, Manali Shukla

On Tue, Sep 26, 2023, Like Xu wrote:
> On 26/9/2023 7:31 am, Sean Christopherson wrote:
> > On Thu, Sep 14, 2023, Like Xu wrote:
> > > On 7/4/2023 11:37 pm, Sean Christopherson wrote:
> > > > On Fri, Apr 07, 2023, Like Xu wrote:
> > > /*
> > >   * The guest vPMU counter emulation depends on the EVENTSEL_GUESTONLY bit.
> > >   * If this bit is present on the host, the host needs to support at least
> > > the PERFCTR_CORE.
> > >   */
> > 
> > ...
> > 
> > > > 	/*
> > > > 	 * KVM requires guest-only event support in order to isolate guest PMCs
> > > > 	 * from host PMCs.  SVM doesn't provide a way to atomically load MSRs
> > > > 	 * on VMRUN, and manually adjusting counts before/after VMRUN is not
> > > > 	 * accurate enough to properly virtualize a PMU.
> > > > 	 */
> > > > 
> > > > But now I'm really confused, because if I'm reading the code correctly, perf
> > > > invokes amd_core_hw_config() for legacy PMUs, i.e. even if PERFCTR_CORE isn't
> > > > supported.  And the APM documents the host/guest bits only for "Core Performance
> > > > Event-Select Registers".
> > > > 
> > > > So either (a) GUESTONLY isn't supported on legacy CPUs and perf is relying on AMD
> > > > CPUs ignoring reserved bits or (b) GUESTONLY _is_ supported on legacy PMUs and
> > > > pmu_has_guestonly_mode() is checking the wrong MSR when running on older CPUs.
> > > > 
> > > > And if (a) is true, then how on earth does KVM support vPMU when running on a
> > > > legacy PMU?  Is vPMU on AMD just wildly broken?  Am I missing something?
> > > > 
> > > 
> > > (a) It's true and AMD guest vPMU have only been implemented accurately with
> > > the help of this GUESTONLY bit.
> > > 
> > > There are two other scenarios worth discussing here: one is support L2 vPMU
> > > on the PERFCTR_CORE+ host and this proposal is disabling it; and the other
> > > case is to support AMD legacy vPMU on the PERFCTR_CORE+ host.
> > 
> > Oooh, so the really problematic case is when PERFCTR_CORE+ is supported but
> > GUESTONLY is not, in which case KVM+perf *think* they can use GUESTONLY (and
> > HOSTONLY).
> > 
> > That's a straight up KVM (as L0) bug, no?  I don't see anything in the APM that
> > suggests those bits are optional, i.e. KVM is blatantly violating AMD's architecture
> > by ignoring those bits.
> 
> For L2 guest, it often doesn't see all the cpu features corresponding to the
> cpu model because KVM and VMM filter some of the capabilities. We can't say
> that the absence of these features violates spec, can we ?

Yes, KVM hides features via architectural means.  This is enumerating a feature,
PERFCTR_CORE, and not providing all its functionalality.  The two things are
distinctly different.

> I treat it as a KVM flaw or a lack of emulation capability.

A.k.a. a bug.

> > I would rather fix KVM (as L0).  It doesn't seem _that_ hard to support, e.g.
> > modify reprogram_counter() to disable the counter if it's supposed to be silent
> > for the current mode, and reprogram all counters if EFER.SVME is toggled, and on
> > all nested transitions.
> 
> I thought about that too, setting up EFER.SVME and VMRUN is still a little
> bit far away, and more micro-testing is needed to correct the behavior
> of the emulation here, considering KVM also has to support emulated ins.
> 
> It's safe to say that there are no real user scenarios using vPMU in a nested
> guest, so I'm more inclined to disable it provisionally (for the sake of more
> stable tree users), enabling this feature is honestly at the end of my to-do list.

I don't see a way to do that without violating AMD's architecture while still
exposing the PMU to L1.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH V2] KVM: x86/pmu: Disable vPMU if EVENTSEL_GUESTONLY bit doesn't exist
  2023-09-25 23:31  4%     ` Sean Christopherson
@ 2023-09-26  3:49  0%       ` Like Xu
  2023-09-26 15:44  0%         ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Like Xu @ 2023-09-26  3:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ravi Bangoria, Manali Shukla

On 26/9/2023 7:31 am, Sean Christopherson wrote:
> On Thu, Sep 14, 2023, Like Xu wrote:
>> On 7/4/2023 11:37 pm, Sean Christopherson wrote:
>>> On Fri, Apr 07, 2023, Like Xu wrote:
>> /*
>>   * The guest vPMU counter emulation depends on the EVENTSEL_GUESTONLY bit.
>>   * If this bit is present on the host, the host needs to support at least
>> the PERFCTR_CORE.
>>   */
> 
> ...
> 
>>> 	/*
>>> 	 * KVM requires guest-only event support in order to isolate guest PMCs
>>> 	 * from host PMCs.  SVM doesn't provide a way to atomically load MSRs
>>> 	 * on VMRUN, and manually adjusting counts before/after VMRUN is not
>>> 	 * accurate enough to properly virtualize a PMU.
>>> 	 */
>>>
>>> But now I'm really confused, because if I'm reading the code correctly, perf
>>> invokes amd_core_hw_config() for legacy PMUs, i.e. even if PERFCTR_CORE isn't
>>> supported.  And the APM documents the host/guest bits only for "Core Performance
>>> Event-Select Registers".
>>>
>>> So either (a) GUESTONLY isn't supported on legacy CPUs and perf is relying on AMD
>>> CPUs ignoring reserved bits or (b) GUESTONLY _is_ supported on legacy PMUs and
>>> pmu_has_guestonly_mode() is checking the wrong MSR when running on older CPUs.
>>>
>>> And if (a) is true, then how on earth does KVM support vPMU when running on a
>>> legacy PMU?  Is vPMU on AMD just wildly broken?  Am I missing something?
>>>
>>
>> (a) It's true and AMD guest vPMU have only been implemented accurately with
>> the help of this GUESTONLY bit.
>>
>> There are two other scenarios worth discussing here: one is support L2 vPMU
>> on the PERFCTR_CORE+ host and this proposal is disabling it; and the other
>> case is to support AMD legacy vPMU on the PERFCTR_CORE+ host.
> 
> Oooh, so the really problematic case is when PERFCTR_CORE+ is supported but
> GUESTONLY is not, in which case KVM+perf *think* they can use GUESTONLY (and
> HOSTONLY).
> 
> That's a straight up KVM (as L0) bug, no?  I don't see anything in the APM that
> suggests those bits are optional, i.e. KVM is blatantly violating AMD's architecture
> by ignoring those bits.

For L2 guest, it often doesn't see all the cpu features corresponding to the
cpu model because KVM and VMM filter some of the capabilities. We can't say
that the absence of these features violates spec, can we ?

I treat it as a KVM flaw or a lack of emulation capability.

> 
> I would rather fix KVM (as L0).  It doesn't seem _that_ hard to support, e.g.
> modify reprogram_counter() to disable the counter if it's supposed to be silent
> for the current mode, and reprogram all counters if EFER.SVME is toggled, and on
> all nested transitions.

I thought about that too, setting up EFER.SVME and VMRUN is still a little
bit far away, and more micro-testing is needed to correct the behavior
of the emulation here, considering KVM also has to support emulated ins.

It's safe to say that there are no real user scenarios using vPMU in a nested
guest, so I'm more inclined to disable it provisionally (for the sake of more
stable tree users), enabling this feature is honestly at the end of my to-do list.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2023-09-22 19:27  0%   ` Moger, Babu
@ 2023-09-26  3:10  0%     ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-09-26  3:10 UTC (permalink / raw)
  To: Babu Moger
  Cc: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti, qemu-devel, kvm, Zhenyu Wang,
	Xiaoyao Li, Zhao Liu

On Fri, Sep 22, 2023 at 02:27:18PM -0500, Moger, Babu wrote:
> Date: Fri, 22 Sep 2023 14:27:18 -0500
> From: "Moger, Babu" <bmoger@amd.com>
> Subject: Re: [PATCH v4 19/21] i386: Use offsets get NumSharingCache for
>  CPUID[0x8000001D].EAX[bits 25:14]
> 
> 
> On 9/14/2023 2:21 AM, Zhao Liu wrote:
> > From: Zhao Liu <zhao1.liu@intel.com>
> > 
> > The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> > for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> > the number of sharing threads directly.
> > 
> >  From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> > means [1]:
> > 
> > The number of logical processors sharing this cache is the value of
> > this field incremented by 1. To determine which logical processors are
> > sharing a cache, determine a Share Id for each processor as follows:
> > 
> > ShareId = LocalApicId >> log2(NumSharingCache+1)
> > 
> > Logical processors with the same ShareId then share a cache. If
> > NumSharingCache+1 is not a power of two, round it up to the next power
> > of two.
> > 
> >  From the description above, the calculation of this field should be same
> > as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> > APIC ID to calculate this field.
> > 
> > [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
> >       Information
> > 
> > Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
> Reviewed-by: Babu Moger <babu.moger@amd.com>

Thanks Babu!

-Zhao

> > ---
> > Changes since v3:
> >   * Rewrite the subject. (Babu)
> >   * Delete the original "comment/help" expression, as this behavior is
> >     confirmed for AMD CPUs. (Babu)
> >   * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
> >     definition. (Babu)
> > 
> > Changes since v1:
> >   * Rename "l3_threads" to "num_apic_ids" in
> >     encode_cache_cpuid8000001d(). (Yanan)
> >   * Add the description of the original commit and add Cc.
> > ---
> >   target/i386/cpu.c | 10 ++++------
> >   1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index 5d066107d6ce..bc28c59df089 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -482,7 +482,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
> >                                          uint32_t *eax, uint32_t *ebx,
> >                                          uint32_t *ecx, uint32_t *edx)
> >   {
> > -    uint32_t l3_threads;
> > +    uint32_t num_sharing_cache;
> >       assert(cache->size == cache->line_size * cache->associativity *
> >                             cache->partitions * cache->sets);
> > @@ -491,13 +491,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
> >       /* L3 is shared among multiple cores */
> >       if (cache->level == 3) {
> > -        l3_threads = topo_info->modules_per_die *
> > -                     topo_info->cores_per_module *
> > -                     topo_info->threads_per_core;
> > -        *eax |= (l3_threads - 1) << 14;
> > +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
> >       } else {
> > -        *eax |= ((topo_info->threads_per_core - 1) << 14);
> > +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
> >       }
> > +    *eax |= (num_sharing_cache - 1) << 14;
> >       assert(cache->line_size > 0);
> >       assert(cache->partitions > 0);

^ permalink raw reply	[relevance 0%]

* Re: [PATCH V2] KVM: x86/pmu: Disable vPMU if EVENTSEL_GUESTONLY bit doesn't exist
  2023-09-14  8:23  0%   ` Like Xu
@ 2023-09-25 23:31  4%     ` Sean Christopherson
  2023-09-26  3:49  0%       ` Like Xu
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-09-25 23:31 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel, Ravi Bangoria, Manali Shukla

On Thu, Sep 14, 2023, Like Xu wrote:
> On 7/4/2023 11:37 pm, Sean Christopherson wrote:
> > On Fri, Apr 07, 2023, Like Xu wrote:
> /*
>  * The guest vPMU counter emulation depends on the EVENTSEL_GUESTONLY bit.
>  * If this bit is present on the host, the host needs to support at least
> the PERFCTR_CORE.
>  */

...

> > 	/*
> > 	 * KVM requires guest-only event support in order to isolate guest PMCs
> > 	 * from host PMCs.  SVM doesn't provide a way to atomically load MSRs
> > 	 * on VMRUN, and manually adjusting counts before/after VMRUN is not
> > 	 * accurate enough to properly virtualize a PMU.
> > 	 */
> > 
> > But now I'm really confused, because if I'm reading the code correctly, perf
> > invokes amd_core_hw_config() for legacy PMUs, i.e. even if PERFCTR_CORE isn't
> > supported.  And the APM documents the host/guest bits only for "Core Performance
> > Event-Select Registers".
> > 
> > So either (a) GUESTONLY isn't supported on legacy CPUs and perf is relying on AMD
> > CPUs ignoring reserved bits or (b) GUESTONLY _is_ supported on legacy PMUs and
> > pmu_has_guestonly_mode() is checking the wrong MSR when running on older CPUs.
> > 
> > And if (a) is true, then how on earth does KVM support vPMU when running on a
> > legacy PMU?  Is vPMU on AMD just wildly broken?  Am I missing something?
> > 
> 
> (a) It's true and AMD guest vPMU have only been implemented accurately with
> the help of this GUESTONLY bit.
> 
> There are two other scenarios worth discussing here: one is support L2 vPMU
> on the PERFCTR_CORE+ host and this proposal is disabling it; and the other
> case is to support AMD legacy vPMU on the PERFCTR_CORE+ host.

Oooh, so the really problematic case is when PERFCTR_CORE+ is supported but
GUESTONLY is not, in which case KVM+perf *think* they can use GUESTONLY (and
HOSTONLY).

That's a straight up KVM (as L0) bug, no?  I don't see anything in the APM that
suggests those bits are optional, i.e. KVM is blatantly violating AMD's architecture
by ignoring those bits.

I would rather fix KVM (as L0).  It doesn't seem _that_ hard to support, e.g.
modify reprogram_counter() to disable the counter if it's supposed to be silent
for the current mode, and reprogram all counters if EFER.SVME is toggled, and on
all nested transitions.

^ permalink raw reply	[relevance 4%]

* Re: [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
  2023-09-14  7:21  5% ` [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
@ 2023-09-22 19:27  0%   ` Moger, Babu
  2023-09-26  3:10  0%     ` Zhao Liu
  0 siblings, 1 reply; 200+ results
From: Moger, Babu @ 2023-09-22 19:27 UTC (permalink / raw)
  To: Zhao Liu, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Zhao Liu


On 9/14/2023 2:21 AM, Zhao Liu wrote:
> From: Zhao Liu <zhao1.liu@intel.com>
>
> The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
> for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
> the number of sharing threads directly.
>
>  From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
> means [1]:
>
> The number of logical processors sharing this cache is the value of
> this field incremented by 1. To determine which logical processors are
> sharing a cache, determine a Share Id for each processor as follows:
>
> ShareId = LocalApicId >> log2(NumSharingCache+1)
>
> Logical processors with the same ShareId then share a cache. If
> NumSharingCache+1 is not a power of two, round it up to the next power
> of two.
>
>  From the description above, the calculation of this field should be same
> as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
> APIC ID to calculate this field.
>
> [1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
>       Information
>
> Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
> ---
> Changes since v3:
>   * Rewrite the subject. (Babu)
>   * Delete the original "comment/help" expression, as this behavior is
>     confirmed for AMD CPUs. (Babu)
>   * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
>     definition. (Babu)
>
> Changes since v1:
>   * Rename "l3_threads" to "num_apic_ids" in
>     encode_cache_cpuid8000001d(). (Yanan)
>   * Add the description of the original commit and add Cc.
> ---
>   target/i386/cpu.c | 10 ++++------
>   1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 5d066107d6ce..bc28c59df089 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -482,7 +482,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>                                          uint32_t *eax, uint32_t *ebx,
>                                          uint32_t *ecx, uint32_t *edx)
>   {
> -    uint32_t l3_threads;
> +    uint32_t num_sharing_cache;
>       assert(cache->size == cache->line_size * cache->associativity *
>                             cache->partitions * cache->sets);
>   
> @@ -491,13 +491,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
>   
>       /* L3 is shared among multiple cores */
>       if (cache->level == 3) {
> -        l3_threads = topo_info->modules_per_die *
> -                     topo_info->cores_per_module *
> -                     topo_info->threads_per_core;
> -        *eax |= (l3_threads - 1) << 14;
> +        num_sharing_cache = 1 << apicid_die_offset(topo_info);
>       } else {
> -        *eax |= ((topo_info->threads_per_core - 1) << 14);
> +        num_sharing_cache = 1 << apicid_core_offset(topo_info);
>       }
> +    *eax |= (num_sharing_cache - 1) << 14;
>   
>       assert(cache->line_size > 0);
>       assert(cache->partitions > 0);

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation
  2023-09-14  7:31  0%   ` Philippe Mathieu-Daudé
@ 2023-09-15  7:39  0%     ` Zhao Liu
  0 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-09-15  7:39 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Eduardo Habkost, Marcel Apfelbaum, Yanan Wang,
	Michael S . Tsirkin, Richard Henderson, Paolo Bonzini,
	Marcelo Tosatti, qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li,
	Babu Moger, Zhao Liu, Zhuocheng Ding

Hi Philippe,

On Thu, Sep 14, 2023 at 09:31:52AM +0200, Philippe Mathieu-Daudé wrote:
> Date: Thu, 14 Sep 2023 09:31:52 +0200
> From: Philippe Mathieu-Daudé <philmd@linaro.org>
> Subject: Re: [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation
> 
> Hi,
> 
> On 14/9/23 09:21, Zhao Liu wrote:
> > From: Zhuocheng Ding <zhuocheng.ding@intel.com>
> > 
> >  From CPUState.nr_cores' comment, it represents "number of cores within
> > this CPU package".
> > 
> > After 003f230e37d7 ("machine: Tweak the order of topology members in
> > struct CpuTopology"), the meaning of smp.cores changed to "the number of
> > cores in one die", but this commit missed to change CPUState.nr_cores'
> > calculation, so that CPUState.nr_cores became wrong and now it
> > misses to consider numbers of clusters and dies.
> > 
> > At present, only i386 is using CPUState.nr_cores.
> > 
> > But as for i386, which supports die level, the uses of CPUState.nr_cores
> > are very confusing:
> > 
> > Early uses are based on the meaning of "cores per package" (before die
> > is introduced into i386), and later uses are based on "cores per die"
> > (after die's introduction).
> > 
> > This difference is due to that commit a94e1428991f ("target/i386: Add
> > CPUID.1F generation support for multi-dies PCMachine") misunderstood
> > that CPUState.nr_cores means "cores per die" when calculated
> > CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
> > wrong understanding.
> > 
> > With the influence of 003f230e37d7 and a94e1428991f, for i386 currently
> > the result of CPUState.nr_cores is "cores per die", thus the original
> > uses of CPUState.cores based on the meaning of "cores per package" are
> > wrong when multiple dies exist:
> > 1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
> >     incorrect because it expects "cpus per package" but now the
> >     result is "cpus per die".
> > 2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
> >     EAX[bits 31:26] is incorrect because they expect "cpus per package"
> >     but now the result is "cpus per die". The error not only impacts the
> >     EAX calculation in cache_info_passthrough case, but also impacts other
> >     cases of setting cache topology for Intel CPU according to cpu
> >     topology (specifically, the incoming parameter "num_cores" expects
> >     "cores per package" in encode_cache_cpuid4()).
> > 3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
> >     15:00] is incorrect because the EBX of 0BH.01H (core level) expects
> >     "cpus per package", which may be different with 1FH.01H (The reason
> >     is 1FH can support more levels. For QEMU, 1FH also supports die,
> >     1FH.01H:EBX[bits 15:00] expects "cpus per die").
> > 4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.80000001H is
> >     calculated, here "cpus per package" is expected to be checked, but in
> >     fact, now it checks "cpus per die". Though "cpus per die" also works
> >     for this code logic, this isn't consistent with AMD's APM.
> > 5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.80000008H:ECX expects
> >     "cpus per package" but it obtains "cpus per die".
> > 6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
> >     kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
> >     helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
> >     MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
> >     package", but in these functions, it obtains "cpus per die" and
> >     "cores per die".
> > 
> > On the other hand, these uses are correct now (they are added in/after
> > a94e1428991f):
> > 1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
> >     meets the actual meaning of CPUState.nr_cores ("cores per die").
> > 2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
> >     04H's calculation) considers number of dies, so it's correct.
> > 3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
> >     15:00] needs "cpus per die" and it gets the correct result, and
> >     CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".
> > 
> > When CPUState.nr_cores is correctly changed to "cores per package" again
> > , the above errors will be fixed without extra work, but the "currently"
> > correct cases will go wrong and need special handling to pass correct
> > "cpus/cores per die" they want.
> > 
> > Fix CPUState.nr_cores' calculation to fit the original meaning "cores
> > per package", as well as changing calculation of topo_info.cores_per_die,
> > vcpus_per_socket and CPUID.1FH.
> 
> What a pain. Can we split this patch in 2, first the x86 part
> and then the common part (softmmu/cpus.c)?

Since x86 uses this nr_cores to calculate many things, if the first
patch just fix nr_cores without changing x86 related code, then that
first patch will break many places (there will be many incorrect CPUIDs).

So I think it’s more appropriate to make these changes into one patch.

Thanks,
Zhao

> 
> > Fixes: a94e1428991f ("target/i386: Add CPUID.1F generation support for multi-dies PCMachine")
> > Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct CpuTopology")
> > Signed-off-by: Zhuocheng Ding <zhuocheng.ding@intel.com>
> > Co-developed-by: Zhao Liu <zhao1.liu@intel.com>
> > Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
> > ---
> > Changes since v3:
> >   * Describe changes in imperative mood. (Babu)
> >   * Fix spelling typo. (Babu)
> >   * Split the comment change into a separate patch. (Xiaoyao)
> > 
> > Changes since v2:
> >   * Use wrapped helper to get cores per socket in qemu_init_vcpu().
> > 
> > Changes since v1:
> >   * Add comment for nr_dies in CPUX86State. (Yanan)
> > ---
> >   softmmu/cpus.c    | 2 +-
> >   target/i386/cpu.c | 9 ++++-----
> >   2 files changed, 5 insertions(+), 6 deletions(-)
> > 
> > diff --git a/softmmu/cpus.c b/softmmu/cpus.c
> > index 0848e0dbdb3f..fa8239c217ff 100644
> > --- a/softmmu/cpus.c
> > +++ b/softmmu/cpus.c
> > @@ -624,7 +624,7 @@ void qemu_init_vcpu(CPUState *cpu)
> >   {
> >       MachineState *ms = MACHINE(qdev_get_machine());
> > -    cpu->nr_cores = ms->smp.cores;
> > +    cpu->nr_cores = machine_topo_get_cores_per_socket(ms);
> >       cpu->nr_threads =  ms->smp.threads;
> >       cpu->stopped = true;
> >       cpu->random_seed = qemu_guest_random_seed_thread_part1();
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index 24ee67b42d05..709c055c8468 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -6015,7 +6015,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >       X86CPUTopoInfo topo_info;
> >       topo_info.dies_per_pkg = env->nr_dies;
> > -    topo_info.cores_per_die = cs->nr_cores;
> > +    topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
> >       topo_info.threads_per_core = cs->nr_threads;
> >       /* Calculate & apply limits for different index ranges */
> > @@ -6091,8 +6091,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >                */
> >               if (*eax & 31) {
> >                   int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
> > -                int vcpus_per_socket = env->nr_dies * cs->nr_cores *
> > -                                       cs->nr_threads;
> > +                int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
> >                   if (cs->nr_cores > 1) {
> >                       *eax &= ~0xFC000000;
> >                       *eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
> > @@ -6270,12 +6269,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >               break;
> >           case 1:
> >               *eax = apicid_die_offset(&topo_info);
> > -            *ebx = cs->nr_cores * cs->nr_threads;
> > +            *ebx = topo_info.cores_per_die * topo_info.threads_per_core;
> >               *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
> >               break;
> >           case 2:
> >               *eax = apicid_pkg_offset(&topo_info);
> > -            *ebx = env->nr_dies * cs->nr_cores * cs->nr_threads;
> > +            *ebx = cs->nr_cores * cs->nr_threads;
> >               *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
> >               break;
> >           default:
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH V2] KVM: x86/pmu: Disable vPMU if EVENTSEL_GUESTONLY bit doesn't exist
  @ 2023-09-14  8:23  0%   ` Like Xu
  2023-09-25 23:31  4%     ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Like Xu @ 2023-09-14  8:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ravi Bangoria, Manali Shukla

On 7/4/2023 11:37 pm, Sean Christopherson wrote:
> On Fri, Apr 07, 2023, Like Xu wrote:
>> From: Like Xu <likexu@tencent.com>
>>
>> Unlike Intel's MSR atomic_switch mechanism, AMD supports guest pmu
>> basic counter feature by setting the GUESTONLY bit on the host, so the
>> presence or absence of this bit determines whether vPMU is emulatable
>> (e.g. in nested virtualization). Since on AMD, writing reserved bits of
>> EVENTSEL register does not bring #GP, KVM needs to update the global
>> enable_pmu value by checking the persistence of this GUESTONLY bit.
> 
> This is looking more and more like a bug fix, i.e. needs a Fixes:, no?

Fix: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") ?
Fix: 4b6e4dca7011 ("KVM: SVM: enable nested svm by default") ?

As far as I can tell, the current KVM code makes the PMU counter of AMD nested
VMs have ridiculous values. One conservative fix is to disable nested PMU on AMD.

> 
>> Cc: Ravi Bangoria <ravi.bangoria@amd.com>
>> Signed-off-by: Like Xu <likexu@tencent.com>
>> ---
>> V1:
>> https://lore.kernel.org/kvm/20230307113819.34089-1-likexu@tencent.com
>> V1 -> V2 Changelog:
>> - Preemption needs to be disabled to ensure a stable CPU; (Sean)
>> - KVM should be restoring the original value too; (Sean)
>> - Disable vPMU once guest_only mode is not supported; (Sean)
> 
> Please respond to my questions, don't just send a new version.  When I asked
> 
>   : Why does lack of AMD64_EVENTSEL_GUESTONLY disable the PMU, but if and only if
>   : X86_FEATURE_PERFCTR_CORE?  E.g. why does the behavior not also apply to legacy
>   : perfmon support?
> 
> I wanted an actual answer because I genuinely do not know what the correct
> behavior is.

As far as I know, only AMD CPUs that support PERFCTR_CORE+ will have
the GUESTONLY bit, if the BIOS doesn't disable it.

The "GO/HO bit" is first introduced for perf-kvm usage (2011)
before adding support for AMD guest vPMU (2015).

As the guest_only name implies, this bit can be used for SVM to emulate
core counter for AMD guests. So if the host does not have this bit (e.g.
on the L1 VM), the L2 VM's vPMU will work abnormally.

> 
>> - Appreciate any better way to probe for GUESTONLY support;
> 
> Again, wait for discussion in previous versions to resolve before posting a new
> version.  If your answer is "not as far as I know", that's totally fine, but
> sending a new version without responding makes it unnecessarily difficult to
> track down your "answer".  E.g. instead of seeing a very direct "I don't know",
> I had to discover that answer by finding a hint buried in the ignored section of
> a new patch.

Okay, then. Nice rule to follow.

> 
>>   arch/x86/kvm/svm/svm.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 7584eb85410b..1ab885596510 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -4884,6 +4884,20 @@ static __init void svm_adjust_mmio_mask(void)
>>   	kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
>>   }
>>   
>> +static __init bool pmu_has_guestonly_mode(void)
>> +{
>> +	u64 original, value;
>> +
>> +	preempt_disable();
>> +	rdmsrl(MSR_F15H_PERF_CTL0, original);
> 
> What guarantees this MSR actually exists?  In v1, it was guarded by enable_pmu=%true,
> but that's longer the case.  And KVM does
> 
> 	case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
> 		if (!guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE))
> 			return NULL;
> 
> which very strongly suggests this MSR doesn't exist if the CPU supports only the
> "legacy" PMU.

Yes, the check for boot_cpu_has(X86_FEATURE_PERFCTR_CORE) is missing here.

> 
>> +	wrmsrl(MSR_F15H_PERF_CTL0, AMD64_EVENTSEL_GUESTONLY);
>> +	rdmsrl(MSR_F15H_PERF_CTL0, value);
>> +	wrmsrl(MSR_F15H_PERF_CTL0, original);
>> +	preempt_enable();
>> +
>> +	return value == AMD64_EVENTSEL_GUESTONLY;
>> +}
>> +
>>   static __init void svm_set_cpu_caps(void)
>>   {
>>   	kvm_set_cpu_caps();
>> @@ -4928,6 +4942,9 @@ static __init void svm_set_cpu_caps(void)
>>   	    boot_cpu_has(X86_FEATURE_AMD_SSBD))
>>   		kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
>>   
>> +	/* Probe for AMD64_EVENTSEL_GUESTONLY support */
> 
> I've said this several times recently: use comments to explain _why_ and to call
> out subtleties.  The code quite obviously is probing for guest-only support, what's
> not obvious is why guest-only support is mandatory for vPMU support.  It may be
> obvious to you, but pease try to view all of this code from the perspective of
> someone who has only passing knowledge of the various components, i.e. doesn't
> know the gory details of exactly what KVM supports.

How about:

/*
  * The guest vPMU counter emulation depends on the EVENTSEL_GUESTONLY bit.
  * If this bit is present on the host, the host needs to support at least the 
PERFCTR_CORE.
  */

> 
> Poking around, I see that pmc_reprogram_counter() unconditionally does
> 
> 	.exclude_host = 1,
> 
> and amd_core_hw_config()
> 
> 	if (event->attr.exclude_host && event->attr.exclude_guest)
> 		/*
> 		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
> 		 * and will count in both modes. We don't want to count in that
> 		 * case so we emulate no-counting by setting US = OS = 0.
> 		 */
> 		event->hw.config &= ~(ARCH_PERFMON_EVENTSEL_USR |
> 				      ARCH_PERFMON_EVENTSEL_OS);
> 	else if (event->attr.exclude_host)
> 		event->hw.config |= AMD64_EVENTSEL_GUESTONLY;
> 	else if (event->attr.exclude_guest)
> 		event->hw.config |= AMD64_EVENTSEL_HOSTONLY;
> 
> and so something like this seems appropriate
> 
> 	/*
> 	 * KVM requires guest-only event support in order to isolate guest PMCs
> 	 * from host PMCs.  SVM doesn't provide a way to atomically load MSRs
> 	 * on VMRUN, and manually adjusting counts before/after VMRUN is not
> 	 * accurate enough to properly virtualize a PMU.
> 	 */
> 
> But now I'm really confused, because if I'm reading the code correctly, perf
> invokes amd_core_hw_config() for legacy PMUs, i.e. even if PERFCTR_CORE isn't
> supported.  And the APM documents the host/guest bits only for "Core Performance
> Event-Select Registers".
> 
> So either (a) GUESTONLY isn't supported on legacy CPUs and perf is relying on AMD
> CPUs ignoring reserved bits or (b) GUESTONLY _is_ supported on legacy PMUs and
> pmu_has_guestonly_mode() is checking the wrong MSR when running on older CPUs.
> 
> And if (a) is true, then how on earth does KVM support vPMU when running on a
> legacy PMU?  Is vPMU on AMD just wildly broken?  Am I missing something?
> 

(a) It's true and AMD guest vPMU have only been implemented accurately with
the help of this GUESTONLY bit.

There are two other scenarios worth discussing here: one is support L2 vPMU
on the PERFCTR_CORE+ host and this proposal is disabling it; and the other
case is to support AMD legacy vPMU on the PERFCTR_CORE+ host.

Please let me know if you have more concerns.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation
  2023-09-14  7:21  2% ` [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
@ 2023-09-14  7:31  0%   ` Philippe Mathieu-Daudé
  2023-09-15  7:39  0%     ` Zhao Liu
  0 siblings, 1 reply; 200+ results
From: Philippe Mathieu-Daudé @ 2023-09-14  7:31 UTC (permalink / raw)
  To: Zhao Liu, Eduardo Habkost, Marcel Apfelbaum, Yanan Wang,
	Michael S . Tsirkin, Richard Henderson, Paolo Bonzini,
	Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Zhao Liu,
	Zhuocheng Ding

Hi,

On 14/9/23 09:21, Zhao Liu wrote:
> From: Zhuocheng Ding <zhuocheng.ding@intel.com>
> 
>  From CPUState.nr_cores' comment, it represents "number of cores within
> this CPU package".
> 
> After 003f230e37d7 ("machine: Tweak the order of topology members in
> struct CpuTopology"), the meaning of smp.cores changed to "the number of
> cores in one die", but this commit missed to change CPUState.nr_cores'
> calculation, so that CPUState.nr_cores became wrong and now it
> misses to consider numbers of clusters and dies.
> 
> At present, only i386 is using CPUState.nr_cores.
> 
> But as for i386, which supports die level, the uses of CPUState.nr_cores
> are very confusing:
> 
> Early uses are based on the meaning of "cores per package" (before die
> is introduced into i386), and later uses are based on "cores per die"
> (after die's introduction).
> 
> This difference is due to that commit a94e1428991f ("target/i386: Add
> CPUID.1F generation support for multi-dies PCMachine") misunderstood
> that CPUState.nr_cores means "cores per die" when calculated
> CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
> wrong understanding.
> 
> With the influence of 003f230e37d7 and a94e1428991f, for i386 currently
> the result of CPUState.nr_cores is "cores per die", thus the original
> uses of CPUState.cores based on the meaning of "cores per package" are
> wrong when multiple dies exist:
> 1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
>     incorrect because it expects "cpus per package" but now the
>     result is "cpus per die".
> 2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
>     EAX[bits 31:26] is incorrect because they expect "cpus per package"
>     but now the result is "cpus per die". The error not only impacts the
>     EAX calculation in cache_info_passthrough case, but also impacts other
>     cases of setting cache topology for Intel CPU according to cpu
>     topology (specifically, the incoming parameter "num_cores" expects
>     "cores per package" in encode_cache_cpuid4()).
> 3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
>     15:00] is incorrect because the EBX of 0BH.01H (core level) expects
>     "cpus per package", which may be different with 1FH.01H (The reason
>     is 1FH can support more levels. For QEMU, 1FH also supports die,
>     1FH.01H:EBX[bits 15:00] expects "cpus per die").
> 4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.80000001H is
>     calculated, here "cpus per package" is expected to be checked, but in
>     fact, now it checks "cpus per die". Though "cpus per die" also works
>     for this code logic, this isn't consistent with AMD's APM.
> 5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.80000008H:ECX expects
>     "cpus per package" but it obtains "cpus per die".
> 6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
>     kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
>     helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
>     MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
>     package", but in these functions, it obtains "cpus per die" and
>     "cores per die".
> 
> On the other hand, these uses are correct now (they are added in/after
> a94e1428991f):
> 1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
>     meets the actual meaning of CPUState.nr_cores ("cores per die").
> 2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
>     04H's calculation) considers number of dies, so it's correct.
> 3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
>     15:00] needs "cpus per die" and it gets the correct result, and
>     CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".
> 
> When CPUState.nr_cores is correctly changed to "cores per package" again
> , the above errors will be fixed without extra work, but the "currently"
> correct cases will go wrong and need special handling to pass correct
> "cpus/cores per die" they want.
> 
> Fix CPUState.nr_cores' calculation to fit the original meaning "cores
> per package", as well as changing calculation of topo_info.cores_per_die,
> vcpus_per_socket and CPUID.1FH.

What a pain. Can we split this patch in 2, first the x86 part
and then the common part (softmmu/cpus.c)?

> Fixes: a94e1428991f ("target/i386: Add CPUID.1F generation support for multi-dies PCMachine")
> Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct CpuTopology")
> Signed-off-by: Zhuocheng Ding <zhuocheng.ding@intel.com>
> Co-developed-by: Zhao Liu <zhao1.liu@intel.com>
> Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
> ---
> Changes since v3:
>   * Describe changes in imperative mood. (Babu)
>   * Fix spelling typo. (Babu)
>   * Split the comment change into a separate patch. (Xiaoyao)
> 
> Changes since v2:
>   * Use wrapped helper to get cores per socket in qemu_init_vcpu().
> 
> Changes since v1:
>   * Add comment for nr_dies in CPUX86State. (Yanan)
> ---
>   softmmu/cpus.c    | 2 +-
>   target/i386/cpu.c | 9 ++++-----
>   2 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/softmmu/cpus.c b/softmmu/cpus.c
> index 0848e0dbdb3f..fa8239c217ff 100644
> --- a/softmmu/cpus.c
> +++ b/softmmu/cpus.c
> @@ -624,7 +624,7 @@ void qemu_init_vcpu(CPUState *cpu)
>   {
>       MachineState *ms = MACHINE(qdev_get_machine());
>   
> -    cpu->nr_cores = ms->smp.cores;
> +    cpu->nr_cores = machine_topo_get_cores_per_socket(ms);
>       cpu->nr_threads =  ms->smp.threads;
>       cpu->stopped = true;
>       cpu->random_seed = qemu_guest_random_seed_thread_part1();
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 24ee67b42d05..709c055c8468 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -6015,7 +6015,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>       X86CPUTopoInfo topo_info;
>   
>       topo_info.dies_per_pkg = env->nr_dies;
> -    topo_info.cores_per_die = cs->nr_cores;
> +    topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
>       topo_info.threads_per_core = cs->nr_threads;
>   
>       /* Calculate & apply limits for different index ranges */
> @@ -6091,8 +6091,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>                */
>               if (*eax & 31) {
>                   int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
> -                int vcpus_per_socket = env->nr_dies * cs->nr_cores *
> -                                       cs->nr_threads;
> +                int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
>                   if (cs->nr_cores > 1) {
>                       *eax &= ~0xFC000000;
>                       *eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
> @@ -6270,12 +6269,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>               break;
>           case 1:
>               *eax = apicid_die_offset(&topo_info);
> -            *ebx = cs->nr_cores * cs->nr_threads;
> +            *ebx = topo_info.cores_per_die * topo_info.threads_per_core;
>               *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
>               break;
>           case 2:
>               *eax = apicid_pkg_offset(&topo_info);
> -            *ebx = env->nr_dies * cs->nr_cores * cs->nr_threads;
> +            *ebx = cs->nr_cores * cs->nr_threads;
>               *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
>               break;
>           default:


^ permalink raw reply	[relevance 0%]

* [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]
    2023-09-14  7:21  6% ` [PATCH v4 02/21] tests: Rename test-x86-cpuid.c to test-x86-topo.c Zhao Liu
  2023-09-14  7:21  2% ` [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
@ 2023-09-14  7:21  5% ` Zhao Liu
  2023-09-22 19:27  0%   ` Moger, Babu
  2 siblings, 1 reply; 200+ results
From: Zhao Liu @ 2023-09-14  7:21 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Zhao Liu

From: Zhao Liu <zhao1.liu@intel.com>

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x8000001D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

From AMD's APM, NumSharingCache (CPUID[0x8000001D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

From the description above, the calculation of this field should be same
as CPUID[4].EAX[bits 25:14] for Intel CPUs. So also use the offsets of
APIC ID to calculate this field.

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
     Information

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
---
Changes since v3:
 * Rewrite the subject. (Babu)
 * Delete the original "comment/help" expression, as this behavior is
   confirmed for AMD CPUs. (Babu)
 * Rename "num_apic_ids" (v3) to "num_sharing_cache" to match spec
   definition. (Babu)

Changes since v1:
 * Rename "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid8000001d(). (Yanan)
 * Add the description of the original commit and add Cc.
---
 target/i386/cpu.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 5d066107d6ce..bc28c59df089 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -482,7 +482,7 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
                                        uint32_t *eax, uint32_t *ebx,
                                        uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t l3_threads;
+    uint32_t num_sharing_cache;
     assert(cache->size == cache->line_size * cache->associativity *
                           cache->partitions * cache->sets);
 
@@ -491,13 +491,11 @@ static void encode_cache_cpuid8000001d(CPUCacheInfo *cache,
 
     /* L3 is shared among multiple cores */
     if (cache->level == 3) {
-        l3_threads = topo_info->modules_per_die *
-                     topo_info->cores_per_module *
-                     topo_info->threads_per_core;
-        *eax |= (l3_threads - 1) << 14;
+        num_sharing_cache = 1 << apicid_die_offset(topo_info);
     } else {
-        *eax |= ((topo_info->threads_per_core - 1) << 14);
+        num_sharing_cache = 1 << apicid_core_offset(topo_info);
     }
+    *eax |= (num_sharing_cache - 1) << 14;
 
     assert(cache->line_size > 0);
     assert(cache->partitions > 0);
-- 
2.34.1


^ permalink raw reply related	[relevance 5%]

* [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation
    2023-09-14  7:21  6% ` [PATCH v4 02/21] tests: Rename test-x86-cpuid.c to test-x86-topo.c Zhao Liu
@ 2023-09-14  7:21  2% ` Zhao Liu
  2023-09-14  7:31  0%   ` Philippe Mathieu-Daudé
  2023-09-14  7:21  5% ` [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
  2 siblings, 1 reply; 200+ results
From: Zhao Liu @ 2023-09-14  7:21 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Zhao Liu,
	Zhuocheng Ding

From: Zhuocheng Ding <zhuocheng.ding@intel.com>

From CPUState.nr_cores' comment, it represents "number of cores within
this CPU package".

After 003f230e37d7 ("machine: Tweak the order of topology members in
struct CpuTopology"), the meaning of smp.cores changed to "the number of
cores in one die", but this commit missed to change CPUState.nr_cores'
calculation, so that CPUState.nr_cores became wrong and now it
misses to consider numbers of clusters and dies.

At present, only i386 is using CPUState.nr_cores.

But as for i386, which supports die level, the uses of CPUState.nr_cores
are very confusing:

Early uses are based on the meaning of "cores per package" (before die
is introduced into i386), and later uses are based on "cores per die"
(after die's introduction).

This difference is due to that commit a94e1428991f ("target/i386: Add
CPUID.1F generation support for multi-dies PCMachine") misunderstood
that CPUState.nr_cores means "cores per die" when calculated
CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
wrong understanding.

With the influence of 003f230e37d7 and a94e1428991f, for i386 currently
the result of CPUState.nr_cores is "cores per die", thus the original
uses of CPUState.cores based on the meaning of "cores per package" are
wrong when multiple dies exist:
1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
   incorrect because it expects "cpus per package" but now the
   result is "cpus per die".
2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
   EAX[bits 31:26] is incorrect because they expect "cpus per package"
   but now the result is "cpus per die". The error not only impacts the
   EAX calculation in cache_info_passthrough case, but also impacts other
   cases of setting cache topology for Intel CPU according to cpu
   topology (specifically, the incoming parameter "num_cores" expects
   "cores per package" in encode_cache_cpuid4()).
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
   15:00] is incorrect because the EBX of 0BH.01H (core level) expects
   "cpus per package", which may be different with 1FH.01H (The reason
   is 1FH can support more levels. For QEMU, 1FH also supports die,
   1FH.01H:EBX[bits 15:00] expects "cpus per die").
4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.80000001H is
   calculated, here "cpus per package" is expected to be checked, but in
   fact, now it checks "cpus per die". Though "cpus per die" also works
   for this code logic, this isn't consistent with AMD's APM.
5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.80000008H:ECX expects
   "cpus per package" but it obtains "cpus per die".
6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
   kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
   helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
   MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
   package", but in these functions, it obtains "cpus per die" and
   "cores per die".

On the other hand, these uses are correct now (they are added in/after
a94e1428991f):
1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
   meets the actual meaning of CPUState.nr_cores ("cores per die").
2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
   04H's calculation) considers number of dies, so it's correct.
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
   15:00] needs "cpus per die" and it gets the correct result, and
   CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".

When CPUState.nr_cores is correctly changed to "cores per package" again
, the above errors will be fixed without extra work, but the "currently"
correct cases will go wrong and need special handling to pass correct
"cpus/cores per die" they want.

Fix CPUState.nr_cores' calculation to fit the original meaning "cores
per package", as well as changing calculation of topo_info.cores_per_die,
vcpus_per_socket and CPUID.1FH.

Fixes: a94e1428991f ("target/i386: Add CPUID.1F generation support for multi-dies PCMachine")
Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct CpuTopology")
Signed-off-by: Zhuocheng Ding <zhuocheng.ding@intel.com>
Co-developed-by: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
---
Changes since v3:
 * Describe changes in imperative mood. (Babu)
 * Fix spelling typo. (Babu)
 * Split the comment change into a separate patch. (Xiaoyao)

Changes since v2:
 * Use wrapped helper to get cores per socket in qemu_init_vcpu().

Changes since v1:
 * Add comment for nr_dies in CPUX86State. (Yanan)
---
 softmmu/cpus.c    | 2 +-
 target/i386/cpu.c | 9 ++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 0848e0dbdb3f..fa8239c217ff 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -624,7 +624,7 @@ void qemu_init_vcpu(CPUState *cpu)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
 
-    cpu->nr_cores = ms->smp.cores;
+    cpu->nr_cores = machine_topo_get_cores_per_socket(ms);
     cpu->nr_threads =  ms->smp.threads;
     cpu->stopped = true;
     cpu->random_seed = qemu_guest_random_seed_thread_part1();
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 24ee67b42d05..709c055c8468 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -6015,7 +6015,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
     X86CPUTopoInfo topo_info;
 
     topo_info.dies_per_pkg = env->nr_dies;
-    topo_info.cores_per_die = cs->nr_cores;
+    topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
     topo_info.threads_per_core = cs->nr_threads;
 
     /* Calculate & apply limits for different index ranges */
@@ -6091,8 +6091,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
              */
             if (*eax & 31) {
                 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
-                int vcpus_per_socket = env->nr_dies * cs->nr_cores *
-                                       cs->nr_threads;
+                int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
                 if (cs->nr_cores > 1) {
                     *eax &= ~0xFC000000;
                     *eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
@@ -6270,12 +6269,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
             break;
         case 1:
             *eax = apicid_die_offset(&topo_info);
-            *ebx = cs->nr_cores * cs->nr_threads;
+            *ebx = topo_info.cores_per_die * topo_info.threads_per_core;
             *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
             break;
         case 2:
             *eax = apicid_pkg_offset(&topo_info);
-            *ebx = env->nr_dies * cs->nr_cores * cs->nr_threads;
+            *ebx = cs->nr_cores * cs->nr_threads;
             *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
             break;
         default:
-- 
2.34.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v4 02/21] tests: Rename test-x86-cpuid.c to test-x86-topo.c
  @ 2023-09-14  7:21  6% ` Zhao Liu
  2023-09-14  7:21  2% ` [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
  2023-09-14  7:21  5% ` [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
  2 siblings, 0 replies; 200+ results
From: Zhao Liu @ 2023-09-14  7:21 UTC (permalink / raw)
  To: Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Michael S . Tsirkin, Richard Henderson,
	Paolo Bonzini, Marcelo Tosatti
  Cc: qemu-devel, kvm, Zhenyu Wang, Xiaoyao Li, Babu Moger, Zhao Liu,
	Yongwei Ma

From: Zhao Liu <zhao1.liu@intel.com>

The tests in this file actually test the APIC ID combinations.
Rename to test-x86-topo.c to make its name more in line with its
actual content.

Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes since v3:
 * Modify the description in commit message to emphasize this file tests
   APIC ID combinations. (Babu)

Changes since v1:
 * Rename test-x86-apicid.c to test-x86-topo.c. (Yanan)
---
 MAINTAINERS                                      | 2 +-
 tests/unit/meson.build                           | 4 ++--
 tests/unit/{test-x86-cpuid.c => test-x86-topo.c} | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)
 rename tests/unit/{test-x86-cpuid.c => test-x86-topo.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 00562f924f7a..eaaa041804b2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1721,7 +1721,7 @@ F: include/hw/southbridge/ich9.h
 F: include/hw/southbridge/piix.h
 F: hw/isa/apm.c
 F: include/hw/isa/apm.h
-F: tests/unit/test-x86-cpuid.c
+F: tests/unit/test-x86-topo.c
 F: tests/qtest/test-x86-cpuid-compat.c
 
 PC Chipset
diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index 0299ef6906cc..bb6f8177341d 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -21,8 +21,8 @@ tests = {
   'test-opts-visitor': [testqapi],
   'test-visitor-serialization': [testqapi],
   'test-bitmap': [],
-  # all code tested by test-x86-cpuid is inside topology.h
-  'test-x86-cpuid': [],
+  # all code tested by test-x86-topo is inside topology.h
+  'test-x86-topo': [],
   'test-cutils': [],
   'test-div128': [],
   'test-shift128': [],
diff --git a/tests/unit/test-x86-cpuid.c b/tests/unit/test-x86-topo.c
similarity index 99%
rename from tests/unit/test-x86-cpuid.c
rename to tests/unit/test-x86-topo.c
index bfabc0403a1a..2b104f86d7c2 100644
--- a/tests/unit/test-x86-cpuid.c
+++ b/tests/unit/test-x86-topo.c
@@ -1,5 +1,5 @@
 /*
- *  Test code for x86 CPUID and Topology functions
+ *  Test code for x86 APIC ID and Topology functions
  *
  *  Copyright (c) 2012 Red Hat Inc.
  *
-- 
2.34.1


^ permalink raw reply related	[relevance 6%]

* Re: [PATCH 00/13] Implement support for IBS virtualization
  2023-09-06 19:56  0%     ` Peter Zijlstra
@ 2023-09-07 15:49  0%       ` Manali Shukla
  0 siblings, 0 replies; 200+ results
From: Manali Shukla @ 2023-09-07 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, seanjc, linux-doc, linux-perf-users, x86, pbonzini, bp,
	santosh.shukla, ravi.bangoria, thomas.lendacky, nikunj

Hi Peter,

On 9/7/2023 1:26 AM, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 09:08:25PM +0530, Manali Shukla wrote:
>> Hi Peter,
>>
>> Thank you for looking into this.
>>
>> On 9/5/2023 9:17 PM, Peter Zijlstra wrote:
>>> On Mon, Sep 04, 2023 at 09:53:34AM +0000, Manali Shukla wrote:
>>>
>>>> Note that, since IBS registers are swap type C [2], the hypervisor is
>>>> responsible for saving and restoring of IBS host state. Hypervisor
>>>> does so only when IBS is active on the host to avoid unnecessary
>>>> rdmsrs/wrmsrs. Hypervisor needs to disable host IBS before saving the
>>>> state and enter the guest. After a guest exit, the hypervisor needs to
>>>> restore host IBS state and re-enable IBS.
>>>
>>> Why do you think it is OK for a guest to disable the host IBS when
>>> entering a guest? Perhaps the host was wanting to profile the guest.
>>>
>>
>> 1. Since IBS registers are of swap type C [1], only guest state is saved
>> and restored by the hardware. Host state needs to be saved and restored by
>> hypervisor. In order to save IBS registers correctly, IBS needs to be
>> disabled before saving the IBS registers.
>>
>> 2. As per APM [2],
>> "When a VMRUN is executed to an SEV-ES guest with IBS virtualization enabled, the
>> IbsFetchCtl[IbsFetchEn] and IbsOpCtl[IbsOpEn] MSR bits must be 0. If either of 
>> these bits are not 0, the VMRUN will fail with a VMEXIT_INVALID error code."
>> This is enforced by hardware on SEV-ES guests when VIBS is enabled on SEV-ES
>> guests.
> 
> I'm not sure I'm fluent in virt speak (in fact, I'm sure I'm not). Is
> the above saying that a host can never IBS profile a guest?

Host can profile a guest with IBS if VIBS is disabled for the guest. This is
the default behavior. Host can not profile guest if VIBS is enabled for guest.

> 
> Does the current IBS thing assert perf_event_attr::exclude_guest is set?

Unlike AMD core pmu, IBS doesn't have Host/Guest filtering capability, thus
perf_event_open() fails if exclude_guest is set for an IBS event.

> 
> I can't quickly find anything :-(

Thank you,
Manali


^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/5] nSVM: Check for optional commands and reserved encodings of TLB_CONTROL in nested VMCB
  2023-09-06 15:59  0%     ` Stefan Sterz
@ 2023-09-06 20:40  4%       ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2023-09-06 20:40 UTC (permalink / raw)
  To: Stefan Sterz
  Cc: Paolo Bonzini, Krish Sadhukhan, kvm, jmattson, vkuznets, wanpengli, joro

On Wed, Sep 06, 2023, Stefan Sterz wrote:
> On 28.09.21 18:55, Paolo Bonzini wrote:
> > On 21/09/21 01:51, Krish Sadhukhan wrote:
> >> +{
> >> +    switch(tlb_ctl) {
> >> +        case TLB_CONTROL_DO_NOTHING:
> >> +        case TLB_CONTROL_FLUSH_ALL_ASID:
> >> +            return true;
> >> +        case TLB_CONTROL_FLUSH_ASID:
> >> +        case TLB_CONTROL_FLUSH_ASID_LOCAL:
> >> +            if (guest_cpuid_has(vcpu, X86_FEATURE_FLUSHBYASID))
> >> +                return true;
> >> +            fallthrough;
> >
> > Since nested FLUSHBYASID is not supported yet, this second set of case
> > labels can go away.
> >
> > Queued with that change, thanks.
> >
> > Paolo
> >
> 
> Are there any plans to support FLUSHBYASID in the future? It seems
> VMWare Workstation and ESXi require this feature to run on top of KVM
> [1]. This means that after the introduction of this check these VMs fail
> to boot and report missing features. Hence, upgrading to a newer kernel
> version is not an option for some users.

IIUC, there's two different issues.  KVM "broke" Workstation 16 by adding a bogus
consistency check.  And Workstation 17 managed to make things worse by trying to
do the right thing (assert that a feature is supported instead of blindly using it).

I say the above consistency check is bogus because I can't find anything in the APM
that states that TLB_CONTROL is actually checked.  Unless something is called out
in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be Zero), AMD
behavior is typically to let software shoot itself in the foot.

So unless I'm missing something, commit 174a921b6975 ("nSVM: Check for reserved
encodings of TLB_CONTROL in nested VMCB") should be reverted.

However, that doesn't help Workstation 17.  On the other hand, I don't see how
Workstation 17 could ever have worked on KVM, since KVM has never advertised
FLUSHBYASID to L1.

That said, "supporting" FLUSHBYASID is trivial, KVM just needs to advertise the
bit to userspace.  That'll work because KVM's TLB flush "handling" for nSVM is
to just flush everything on every transition (it's been a TODO for a long, long
time).

> Sorry if I misunderstood something or if this is not the right place to ask.
> 
> [1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2008583

^ permalink raw reply	[relevance 4%]

* Re: [PATCH 00/13] Implement support for IBS virtualization
  2023-09-06 15:38  4%   ` Manali Shukla
@ 2023-09-06 19:56  0%     ` Peter Zijlstra
  2023-09-07 15:49  0%       ` Manali Shukla
  0 siblings, 1 reply; 200+ results
From: Peter Zijlstra @ 2023-09-06 19:56 UTC (permalink / raw)
  To: Manali Shukla
  Cc: kvm, seanjc, linux-doc, linux-perf-users, x86, pbonzini, bp,
	santosh.shukla, ravi.bangoria, thomas.lendacky, nikunj

On Wed, Sep 06, 2023 at 09:08:25PM +0530, Manali Shukla wrote:
> Hi Peter,
> 
> Thank you for looking into this.
> 
> On 9/5/2023 9:17 PM, Peter Zijlstra wrote:
> > On Mon, Sep 04, 2023 at 09:53:34AM +0000, Manali Shukla wrote:
> > 
> >> Note that, since IBS registers are swap type C [2], the hypervisor is
> >> responsible for saving and restoring of IBS host state. Hypervisor
> >> does so only when IBS is active on the host to avoid unnecessary
> >> rdmsrs/wrmsrs. Hypervisor needs to disable host IBS before saving the
> >> state and enter the guest. After a guest exit, the hypervisor needs to
> >> restore host IBS state and re-enable IBS.
> > 
> > Why do you think it is OK for a guest to disable the host IBS when
> > entering a guest? Perhaps the host was wanting to profile the guest.
> > 
> 
> 1. Since IBS registers are of swap type C [1], only guest state is saved
> and restored by the hardware. Host state needs to be saved and restored by
> hypervisor. In order to save IBS registers correctly, IBS needs to be
> disabled before saving the IBS registers.
> 
> 2. As per APM [2],
> "When a VMRUN is executed to an SEV-ES guest with IBS virtualization enabled, the
> IbsFetchCtl[IbsFetchEn] and IbsOpCtl[IbsOpEn] MSR bits must be 0. If either of 
> these bits are not 0, the VMRUN will fail with a VMEXIT_INVALID error code."
> This is enforced by hardware on SEV-ES guests when VIBS is enabled on SEV-ES
> guests.

I'm not sure I'm fluent in virt speak (in fact, I'm sure I'm not). Is
the above saying that a host can never IBS profile a guest?

Does the current IBS thing assert perf_event_attr::exclude_guest is set?

I can't quickly find anything :-(

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/5] nSVM: Check for optional commands and reserved encodings of TLB_CONTROL in nested VMCB
  @ 2023-09-06 15:59  0%     ` Stefan Sterz
  2023-09-06 20:40  4%       ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Stefan Sterz @ 2023-09-06 15:59 UTC (permalink / raw)
  To: Paolo Bonzini, Krish Sadhukhan, kvm
  Cc: jmattson, seanjc, vkuznets, wanpengli, joro

On 28.09.21 18:55, Paolo Bonzini wrote:
> On 21/09/21 01:51, Krish Sadhukhan wrote:
>> According to section "TLB Flush" in APM vol 2,
>>
>>      "Support for TLB_CONTROL commands other than the first two, is
>>       optional and is indicated by CPUID Fn8000_000A_EDX[FlushByAsid].
>>
>>       All encodings of TLB_CONTROL not defined in the APM are reserved."
>>
>> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>> ---
>>   arch/x86/kvm/svm/nested.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index 5e13357da21e..028cc2a1f028 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -235,6 +235,22 @@ static bool nested_svm_check_bitmap_pa(struct
>> kvm_vcpu *vcpu, u64 pa, u32 size)
>>           kvm_vcpu_is_legal_gpa(vcpu, addr + size - 1);
>>   }
>>   +static bool nested_svm_check_tlb_ctl(struct kvm_vcpu *vcpu
, u8
>> tlb_ctl)
>> +{
>> +    switch(tlb_ctl) {
>> +        case TLB_CONTROL_DO_NOTHING:
>> +        case TLB_CONTROL_FLUSH_ALL_ASID:
>> +            return true;
>> +        case TLB_CONTROL_FLUSH_ASID:
>> +        case TLB_CONTROL_FLUSH_ASID_LOCAL:
>> +            if (guest_cpuid_has(vcpu, X86_FEATURE_FLUSHBYASID))
>> +                return true;
>> +            fallthrough;
>
> Since nested FLUSHBYASID is not supported yet, this second set of case
> labels can go away.
>
> Queued with that change, thanks.
>
> Paolo
>

Are there any plans to support FLUSHBYASID in the future? It seems
VMWare Workstation and ESXi require this feature to run on top of KVM
[1]. This means that after the introduction of this check these VMs fail
to boot and report missing features. Hence, upgrading to a newer kernel
version is not an option for some users.

Sorry if I misunderstood something or if 
this is not the right place to
ask.

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2008583


^ permalink raw reply	[relevance 0%]

* Re: [PATCH 00/13] Implement support for IBS virtualization
  @ 2023-09-06 15:38  4%   ` Manali Shukla
  2023-09-06 19:56  0%     ` Peter Zijlstra
  0 siblings, 1 reply; 200+ results
From: Manali Shukla @ 2023-09-06 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, seanjc, linux-doc, linux-perf-users, x86, pbonzini, bp,
	santosh.shukla, ravi.bangoria, thomas.lendacky, nikunj

Hi Peter,

Thank you for looking into this.

On 9/5/2023 9:17 PM, Peter Zijlstra wrote:
> On Mon, Sep 04, 2023 at 09:53:34AM +0000, Manali Shukla wrote:
> 
>> Note that, since IBS registers are swap type C [2], the hypervisor is
>> responsible for saving and restoring of IBS host state. Hypervisor
>> does so only when IBS is active on the host to avoid unnecessary
>> rdmsrs/wrmsrs. Hypervisor needs to disable host IBS before saving the
>> state and enter the guest. After a guest exit, the hypervisor needs to
>> restore host IBS state and re-enable IBS.
> 
> Why do you think it is OK for a guest to disable the host IBS when
> entering a guest? Perhaps the host was wanting to profile the guest.
> 

1. Since IBS registers are of swap type C [1], only guest state is saved
and restored by the hardware. Host state needs to be saved and restored by
hypervisor. In order to save IBS registers correctly, IBS needs to be
disabled before saving the IBS registers.

2. As per APM [2],
"When a VMRUN is executed to an SEV-ES guest with IBS virtualization enabled, the
IbsFetchCtl[IbsFetchEn] and IbsOpCtl[IbsOpEn] MSR bits must be 0. If either of 
these bits are not 0, the VMRUN will fail with a VMEXIT_INVALID error code."
This is enforced by hardware on SEV-ES guests when VIBS is enabled on SEV-ES
guests.

3. VIBS is not enabled by default. It can be enabled by an explicit
qemu command line option "-cpu +ibs". Guest should be invoked without
this option when host wants to profile the guest.

[1] https://bugzilla.kernel.org/attachment.cgi?id=304653
    AMD64 Architecture Programmer’s Manual, Vol 2, Appendix B. Layout
    of VMCB,
    Table B-2. VMCB Layout, State Save Area 
    Table B-4. VMSA Layout, State Save Area for SEV-ES
    
[2] https://bugzilla.kernel.org/attachment.cgi?id=304653
    AMD64 Architecture Programmer’s Manual, Vol 2, Section 15.38,
    Instruction-Based Sampling Virtualization


> Only when perf_event_attr::exclude_guest is set is this allowed,
> otherwise you have to respect the host running IBS and you're not
> allowed to touch it.
> 
> Host trumps guest etc..


- Manali

^ permalink raw reply	[relevance 4%]

* Re: [PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit
  2023-09-05 23:03  4%   ` Sean Christopherson
@ 2023-09-06  5:07  0%     ` Tao Su
  0 siblings, 0 replies; 200+ results
From: Tao Su @ 2023-09-06  5:07 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, pbonzini, chao.gao, guang.zeng, yi1.lai

On Tue, Sep 05, 2023 at 04:03:36PM -0700, Sean Christopherson wrote:
> +Suravee
> 
> On Mon, Sep 04, 2023, Tao Su wrote:
> > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR
> > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi()
> > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay,
> > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit.
> > 
> > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode,
> > and SDM has no detail about how hardware will handle the UNUSED bit12
> > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found
> > the UNUSED bit12 was also cleared by hardware without #GP. Therefore,
> > the clearing of bit12 should be still kept being consistent with the
> > hardware behavior.
> 
> I'm confused.  If hardware clears the bit, then why is it set in the vAPIC page
> after a trap-like APIC-write VM-Exit?  In other words, how is this not a ucode
> or hardware bug?

Sorry, I didn't describe it clearly.

On bare-metal, bit12 of ICR MSR will be cleared after setting this bit.

If bit12 is set in guest, the bit is not cleared in the vAPIC page after APIC-write
VM-Exit. So whether to clear bit12 in vAPIC page needs to be considered.

> 
> Suravee, can you confirm what happens on AMD with x2AVIC?  Does hardware *always*
> clear the busy bit if it's set by the guest?  If so, then we could "optimize"
> avic_incomplete_ipi_interception() to skip the busy check, e.g.
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index cfc8ab773025..4bf0bb250147 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
>                  * in which case KVM needs to emulate the ICR write as well in
>                  * order to clear the BUSY flag.
>                  */
> -               if (icrl & APIC_ICR_BUSY)
> +               if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY))
>                         kvm_apic_write_nodecode(vcpu, APIC_ICR);
>                 else
>                         kvm_apic_send_ipi(apic, icrl, icrh);
> 
> > Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode")
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > ---
> >  arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++-------
> >  1 file changed, 20 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index a983a16163b1..09a376aeb4a0 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
> >  {
> >  	struct kvm_lapic_irq irq;
> >  
> > -	/* KVM has no delay and should always clear the BUSY/PENDING flag. */
> > -	WARN_ON_ONCE(icr_low & APIC_ICR_BUSY);
> > +	/*
> > +	 * In non-x2apic mode, KVM has no delay and should always clear the
> > +	 * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12
> > +	 * of ICR since hardware will also clear this bit. Although
> > +	 * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different
> > +	 * things in different modes.
> > +	 */
> > +	if (!apic_x2apic_mode(apic))
> > +		WARN_ON_ONCE(icr_low & APIC_ICR_BUSY);
> > +	else
> > +		WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12);
> 
> NAK to the new name, KVM is absolutely not going to zero an arbitrary "unused"
> bit.  If Intel wants to reclaim bit 12 for something useful in the future, then
> Intel can ship CPUs that don't touch the "reserved" bit, and deal with all the
> fun of finding and updating all software that unnecessarily sets the busy bit in
> x2apic mode.
> 
> If we really want to pretend that Intel has more than a snowball's chance in hell
> of doing something useful with bit 12, then the right thing to do in KVM is to
> ignore the bit entirely and let the guest keep the pieces, e.g.

Yes, agree. Currently bit12 is unused and cleared by hardware on bare-metal, clearing
bit12 in guest is for keeping the same behavior as bare-metal. But KVM should not to
zero the unused bit until SDM reclaims it for something, so ignoring the bit in KVM
should be better.

Thanks,
Tao

> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 113ca9661ab2..36ec195a3339 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1473,8 +1473,13 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  {
>         struct kvm_lapic_irq irq;
>  
> -       /* KVM has no delay and should always clear the BUSY/PENDING flag. */
> -       WARN_ON_ONCE(icr_low & APIC_ICR_BUSY);
> +       /*
> +        * KVM has no delay and should always clear the BUSY/PENDING flag.
> +        * The flag doesn't exist in x2APIC mode; both the SDM and APM state
> +        * that the flag "Must Be Zero", but neither Intel nor AMD enforces
> +        * that (or any other reserved bits in ICR).
> +        */
> +       WARN_ON_ONCE(!apic_x2apic_mode(apic) && (icr_low & APIC_ICR_BUSY));
>  
>         irq.vector = icr_low & APIC_VECTOR_MASK;
>         irq.delivery_mode = icr_low & APIC_MODE_MASK;
> @@ -3113,8 +3118,6 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
>  
>  int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data)
>  {
> -       data &= ~APIC_ICR_BUSY;
> -
>         kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32));
>         kvm_lapic_set_reg64(apic, APIC_ICR, data);
>         trace_kvm_apic_write(APIC_ICR, data);
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index cfc8ab773025..4bf0bb250147 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
>                  * in which case KVM needs to emulate the ICR write as well in
>                  * order to clear the BUSY flag.
>                  */
> -               if (icrl & APIC_ICR_BUSY)
> +               if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY))
>                         kvm_apic_write_nodecode(vcpu, APIC_ICR);
>                 else
>                         kvm_apic_send_ipi(apic, icrl, icrh);
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit
  @ 2023-09-05 23:03  4%   ` Sean Christopherson
  2023-09-06  5:07  0%     ` Tao Su
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-09-05 23:03 UTC (permalink / raw)
  To: Tao Su; +Cc: kvm, pbonzini, chao.gao, guang.zeng, yi1.lai

+Suravee

On Mon, Sep 04, 2023, Tao Su wrote:
> When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR
> MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi()
> thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay,
> but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit.
> 
> Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode,
> and SDM has no detail about how hardware will handle the UNUSED bit12
> set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found
> the UNUSED bit12 was also cleared by hardware without #GP. Therefore,
> the clearing of bit12 should be still kept being consistent with the
> hardware behavior.

I'm confused.  If hardware clears the bit, then why is it set in the vAPIC page
after a trap-like APIC-write VM-Exit?  In other words, how is this not a ucode
or hardware bug?

Suravee, can you confirm what happens on AMD with x2AVIC?  Does hardware *always*
clear the busy bit if it's set by the guest?  If so, then we could "optimize"
avic_incomplete_ipi_interception() to skip the busy check, e.g.

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..4bf0bb250147 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
                 * in which case KVM needs to emulate the ICR write as well in
                 * order to clear the BUSY flag.
                 */
-               if (icrl & APIC_ICR_BUSY)
+               if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY))
                        kvm_apic_write_nodecode(vcpu, APIC_ICR);
                else
                        kvm_apic_send_ipi(apic, icrl, icrh);

> Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode")
> Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> ---
>  arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index a983a16163b1..09a376aeb4a0 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  {
>  	struct kvm_lapic_irq irq;
>  
> -	/* KVM has no delay and should always clear the BUSY/PENDING flag. */
> -	WARN_ON_ONCE(icr_low & APIC_ICR_BUSY);
> +	/*
> +	 * In non-x2apic mode, KVM has no delay and should always clear the
> +	 * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12
> +	 * of ICR since hardware will also clear this bit. Although
> +	 * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different
> +	 * things in different modes.
> +	 */
> +	if (!apic_x2apic_mode(apic))
> +		WARN_ON_ONCE(icr_low & APIC_ICR_BUSY);
> +	else
> +		WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12);

NAK to the new name, KVM is absolutely not going to zero an arbitrary "unused"
bit.  If Intel wants to reclaim bit 12 for something useful in the future, then
Intel can ship CPUs that don't touch the "reserved" bit, and deal with all the
fun of finding and updating all software that unnecessarily sets the busy bit in
x2apic mode.

If we really want to pretend that Intel has more than a snowball's chance in hell
of doing something useful with bit 12, then the right thing to do in KVM is to
ignore the bit entirely and let the guest keep the pieces, e.g.

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 113ca9661ab2..36ec195a3339 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1473,8 +1473,13 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 {
        struct kvm_lapic_irq irq;
 
-       /* KVM has no delay and should always clear the BUSY/PENDING flag. */
-       WARN_ON_ONCE(icr_low & APIC_ICR_BUSY);
+       /*
+        * KVM has no delay and should always clear the BUSY/PENDING flag.
+        * The flag doesn't exist in x2APIC mode; both the SDM and APM state
+        * that the flag "Must Be Zero", but neither Intel nor AMD enforces
+        * that (or any other reserved bits in ICR).
+        */
+       WARN_ON_ONCE(!apic_x2apic_mode(apic) && (icr_low & APIC_ICR_BUSY));
 
        irq.vector = icr_low & APIC_VECTOR_MASK;
        irq.delivery_mode = icr_low & APIC_MODE_MASK;
@@ -3113,8 +3118,6 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr)
 
 int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data)
 {
-       data &= ~APIC_ICR_BUSY;
-
        kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32));
        kvm_lapic_set_reg64(apic, APIC_ICR, data);
        trace_kvm_apic_write(APIC_ICR, data);
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index cfc8ab773025..4bf0bb250147 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
                 * in which case KVM needs to emulate the ICR write as well in
                 * order to clear the BUSY flag.
                 */
-               if (icrl & APIC_ICR_BUSY)
+               if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY))
                        kvm_apic_write_nodecode(vcpu, APIC_ICR);
                else
                        kvm_apic_send_ipi(apic, icrl, icrh);


^ permalink raw reply related	[relevance 4%]

* Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory
  @ 2023-08-26  0:59  3%         ` Michael Roth
  0 siblings, 0 replies; 200+ results
From: Michael Roth @ 2023-08-26  0:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Kai Huang, Zhi Wang,
	chen.bo, linux-coco, Chao Peng, Ackerley Tng, Vishal Annapurve,
	Yuan Yao, David.Kaplan

On Fri, Aug 18, 2023 at 03:27:47PM -0700, Sean Christopherson wrote:
> Sorry for responding so late, lost track of this and only found it against when
> reviewing the next version :-/
> 
> On Fri, Jul 21, 2023, Michael Roth wrote:
> > On Fri, Jul 21, 2023 at 07:25:58AM -0700, Sean Christopherson wrote:
> > > Practically speaking, hooking the fault path will result in undesirable behavior.
> > > Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
> > > at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
> > > shares a single 4KiB page in a 2MiB chunk.   Dirty logging is another case where
> > > the RMP can stay at 2MiB.  Or as a very silly example, imagine userspace pulls a
> > > stupid and covers a single 2MiB chunk of gmem with 512 memslots.
> > 
> > Unfortunately I don't think things aren't quite that flexible with SNP. If
> > RMP entry is 2MB, and you map a sub-page as 4K in the NPT, you'll immediately
> > get a PFERR_GUEST_SIZEM on the first access (presumably when the guest tries
> > to PVALIDATE it before use). The RMP fault handler will then subsequently
> > need to PSMASH the 2MB entry into 4K before that guest can access it. So you
> > get an extra page fault for every 2MB page that's mapped this way.
> > (APM Volume 2, Section 15.36.10).
> 
> Well that's just bloody stupid.  Why is that a restriction?  Obviously creating
> an NPT mapping that's larger would be annoying to handle, e.g. would require
> locking multiple entries or something, so I can understand why that's disallowed.
> But why can't software map at a finer granularity?
> 
> Note, I'm expecting a spec change, just expressing a bit of disbelief.

I asked David Kaplan about this and one of the reasons is that it allows
hardware to reduce the number of RMP table accesses when doing table
walks. E.g. if the mapping is 4K in the NPT, then only that RMP entry
needs to be checked, the hardware doesn't need to also then check the
2MB-aligned RMP entry before determining whether to generate a #NPF.

(note: the 'assigned' bit does get set on each individual 4K entry when
RMPUPDATE use the 'page-size' bit to create a 2MB RMP entry, but if you
look at the pseudo-code in the APM for RMPUPDATE it actually leaves
ASID=0, so those accesses always generate an #NPF(RMP) and aren't actually
usable for 4K NPT mappings of sub-pages for the 2MB range corresponding
to that 2MB RMP entry (and not #NPF+RMP+SIZEM as I incorrectly stated
above))

> 
> Anyways, IMO, we should eat the extra #NPF in that scenario and optimize for much,
> much more likely scenario of the RMP and NPT both being able to use 2MiB pages.

That seems fair, I only see this particularly case occur once or twice
for boot.

> And that means not inserting into the RMP when handling #NPFs, e.g. so that userspace
> can do fallocate() to prep the memory before mapping, and so that if the SPTEs

Makes sense.

> get zapped for whatever reason, faulting them back in doesn't trigger useless
> RMP updates.

We wouldn't issue an RMPUPDATE if we saw that the page was already set to
private in the RMP table. So what you actually save on is just accesses to
the memory range containing the RMP entry to do that check, but the approach
still makes sense.

> 
> I guess the other way to optimze things would be for userspace to use the ioctl()
> to map memory into the guest[*].  But even then, initializing when allocating
> seems cleaner, especially for TDX.
> 
> [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
> 
> > That might not be a big deal for guests that are at least somewhat optimized
> > to make use of 2MB pages, but another situation is:
> > 
> >   - gmem allocates 2MB page
> >   - guest issues PVALIDATE on 2MB page
> >   - guest later converts a subpage to shared but doesn't holepunch
> >   - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> >   - KVM MMU splits NPT mapping to 4K
> >   - guest converts that shared page back to private
> > 
> > At this point there are no mixed attributes, and KVM would normally
> > allow for 2MB NPT mappings again, but this is actually not allowed
> > because the RMP table mappings are validated/4K and cannot be promoted on
> > the hypervisor, so the NPT mappings must still be limited to 4K to
> > match this, otherwise we hit the reverse of the PFERR_GUEST_SIZEM
> > scenario above, where the NPT mapping level is *larger* than the RMP
> > entry level. Unfortunately that does not result in a PFERR_GUEST_SIZEM
> > where we can fix things up in response, but instead it's a general
> > RMP fault that would be tricky to distinguish from an RMP fault
> > resulting from an implicit page conversion or some other guest weirdness
> > without doing RMP table checks every time we get a general RMP fault.
> 
> This seems like a bug in the SNP code.  (a) why would KVM/SNP PSMASH in that
> scenario and (b) why can't it zap/split the NPT before PSMASH?

a) A PSMASH will be needed at some point, since, as detailed above, the 4K
   NPT mapping requires the RMP entries for the pages it maps to be
   limited to 4K RMP entries, but...
b) What would happen normally[1] is the guest would issue PVALIDATE to
   *rescind* the validated status of that 4K GPA before issuing the GHCB
   request to convert it to shared. This would cause an
   #NPF(RMP+SIZE_MISMATCH) and handle_rmp_page_fault() would PSMASH the RMP
   entry so the PVALIDATE can succeed.

   So KVM doesn't even really have the option of deferring the PSMASH, it
   happens before the SET_MEMORY_ATTRIBUTES is even issued to zap the 2MB
   NPT mapping and switch the 2M range to 'mixed'. Because of that, we also
   need a hook in the KVM MMU code to clamp the max mapping level based
   on RMP entry size. Currently the kvm_gmem_prepare() in this patch
   doubles for handling that clamping, so we would still need a similar
   hook for that if we move the RMP initialization stuff to allocation
   time.

[1] This handling is recommended for 'well-behaved' guests according to
GHCB, but I don't see it documented as a hard requirement anywhere, so there
is a possibility that that we have to deal with a guest that doesn't do this.
What would happen then is the PVALIDATE wouldn't trigger the #NPF(RMP+SIZEM),
and instead the SET_MEMORY_ATTRIBUTES would zap the 2MB mapping, install
4K entries on next #NPF(NOT_PRESENT), and at *that* point we would get
an #NPF(RMP) without the SIZEM bit set, due to the behavior described in
the beginning of this email.

handle_rmp_page_fault() can do the corresponding PSMASH to deal with that,
but it is a little unfortunate since we can't differentiate that case from a
spurious/unexpected RMP faults, so would need to attempt a PSMASH in all
cases, sometimes failing.

gmem itself could also trigger this case if the lpage_info_slot() tracking
ever became more granular than what the guest was expected (which I don't
think would happen normally, but I may have hit one case where it does, but
haven't had a chance to debug if that's on the lpage_info_slot() side or
something else on the SNP end.

> 
> > So for all intents and purposes it does sort of end up being the case
> > that the mapping size and RMP entry size are sort of intertwined and
> > can't totally be decoupled, and if you don't take that into account
> > when updating the RMP entry, you'll have to do some extra PSMASH's
> > in response to PFERR_GUEST_SIZEM RMP faults later.
> > 
> > > 
> > > That likely means KVM will need an additional hook to clamp the max_level at the
> > > RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
> > > should be able to blindly do the conversion because it would be a kernel bug if
> > > the page is already assigned to an ASID in the RMP, i.e. the additional hook
> > > shouldn't incur an extra RMP lookup.
> > 
> > Yah we'd still need a hook in roughly this same place for clamping
> > max_level. Previous versions of SNP hypervisor patches all had a separate
> > hook for handling these cases, but since the work of updating the RMP table
> > prior to mapping isn't too dissimilar from the work of determining max
> > mapping size, I combined both of them into the kvm_x86_gmem_prepare()
> > hook/implementation.
> > 
> > But I don't see any major issue with moving RMPUPDATE handling to an
> > allocation-time hook. As mentioned above we'd get additional
> > PFERR_GUEST_SIZEM faults by not taking MMU mapping level into account, but
> > I previously had it implemented similarly via a hook in kvm_gmem_get_pfn()
> > (because we need the GPA) and didn't notice anything major. But I'm not
> > sure exactly where you're suggesting we do it now, so could use some
> > clarify on that if kvm_gmem_get_pfn() isn't what you had in mind.
> 
> I was thinking kvm_gmem_get_folio().  If userspace doesn't preallocate, then
> kvm_gmem_get_pfn() will still east the cost of the conversion, but it at least
> gives userspace the option and handles the zap case.  Tracking which folios have
> been assigned (HKID or RMP) should be pretty straightforward.

This is a little tricky. kvm_gmem_get_pfn() can handle RMP entry
initialization, no problem, because it has the GPA we are fetching the
PFN for.

But kvm_gmem_get_folio() doesn't. But as long as the preallocation
happens after kvm_gmem_bind() and before kvm_gmem_unbind() we should be
able to use the slot to figure out what GPA a particular FD offset
corresponds to. I'll look into this more.

> 
> I'm not totally opposed to doing more work at map time, e.g. to avoid faults or
> to play nice with PSMASH, but I would rather that not bleed into gmem.   And I
> think/hope if we hook kvm_gmem_get_folio(), then TDX and SNP can use that as a
> common base, i.e. whatever extra shenanigans are needed for SNP can be carried
> in the SNP series.  For me, having such "quirks" in the vendor specific series
> would be quite helpful as I'm having a hell of a time keeping track of all the
> wrinkles of TDX and SNP.

Seems reasonable. I'll work on re-working the hooks using this approach.

Thanks,

Mike

^ permalink raw reply	[relevance 3%]

* Re: Question about AMD SVM's virtual NMI support in Linux kernel mainline
  2023-08-25  9:06  4%       ` Santosh Shukla
@ 2023-08-25 14:26  0%         ` Sean Christopherson
  0 siblings, 0 replies; 200+ results
From: Sean Christopherson @ 2023-08-25 14:26 UTC (permalink / raw)
  To: Santosh Shukla
  Cc: 陈培鸿(乘鸿), mlevitsk, kvm, linux-kernel

On Fri, Aug 25, 2023, Santosh Shukla wrote:
> Hi Sean,
> >> Yes, HW does set BLOCKING_MASK when HW takes the pending vNMI event.
> > 
> > I'm not asking about the pending vNMI case, which is clearly spelled out in the
> > APM.  I'm asking about directly injecting an NMI via:
> > 
> > 	svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
> >
> 
> Yes. This is documented in APM as well.
> https://www.amd.com/system/files/TechDocs/24593.pdf : "15.21.10 NMI Virtualization"
> 
> "
> If Event Injection is used to inject an NMI when NMI Virtualization is enabled,
> VMRUN sets V_NMI_MASK in the guest state.
> "

Awesome, thank you!

^ permalink raw reply	[relevance 0%]

* Re: Question about AMD SVM's virtual NMI support in Linux kernel mainline
  2023-08-24 14:22  4%     ` Question about AMD SVM's virtual NMI support in Linux kernel mainline Sean Christopherson
@ 2023-08-25  9:06  4%       ` Santosh Shukla
  2023-08-25 14:26  0%         ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Santosh Shukla @ 2023-08-25  9:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: 陈培鸿(乘鸿), mlevitsk, kvm, linux-kernel

Hi Sean,


On 8/24/2023 7:52 PM, Sean Christopherson wrote:
> +kvm and lkml, didn't realize they weren't Cc'd in the original mail.
> 
> On Thu, Aug 24, 2023, Santosh Shukla wrote:
>> Hi Sean,
>>
>> On 8/23/2023 8:33 PM, Sean Christopherson wrote:
>>> On Fri, Aug 18, 2023, 陈培鸿(乘鸿) wrote:
>>>> According to the results, I found that in the case of concurrent NMIs, some
>>>> NMIs are still injected through eventinj instead of vNMI.
>>>
>>> Key word is "some", having two NMIs arrive "simultaneously" is uncommon.  In quotes
>>> because if a vCPU is scheduled out or otherwise delayed, two NMIs can be recognized
>>> by KVM at the same time even if there was a sizeable delay between when they were
>>> sent.
>>>
>>>> Based on the above explanations, I summarize my questions as follows:
>>>> 1. According to the specification of AMD SVM vNMI, with vNMI enabled, will
>>>> some NMIs be injected through eventinj under the condition of concurrent
>>>> NMIs?
>>>
>>> Yes.
>>>
>>>> 2. If yes, what is the role of vNMI? Is it just an addition to eventinj? What
>>>> benefits is it designed to expect? Is there any benchmark data support?
>>>
>>> Manually injecting NMIs isn't problematic from a performance perspective.  KVM
>>> takes control of the vCPU, i.e. forces a VM-Exit, to pend a virtual NMI, so there's
>>> no extra VM-Exit.
>>>
>>> The value added by vNMI support is that KVM doesn't need to manually track/detect
>>> when NMIs become unblocked in the guest.  SVM doesn't provide a hardware-supported
>>> NMI-window exiting, so KVM has to intercept _and_ single-step IRET, which adds two
>>> VM-Exits for _every_ NMI when vNMI isn't available (and it's a complete mess for
>>> things like SEV-ES).
>>>
>>>> 3. If not, does it mean that the current SVM's vNMI support scheme in the
>>>> Linux mainline code is flawed? How should it be fixed?
>>>
>>> The approach as a whole isn't flawed, it's the best KVM can do given SVM's vNMI
>>> architecture and KVM's ABI with respect to "concurrent" NMIs.
>>>
>>> Hrm, though there does appear to be a bug in the injecting path.  KVM doesn't
>>> manually set V_NMI_BLOCKING_MASK, and will unnecessarily enable IRET interception
>>> when manually injecting a vNMI.  Intercepting IRET should be unnecessary because
>>> hardware will automatically accept the pending vNMI when NMIs become unblocked.
>>> And I don't see anything in the APM that suggests hardware will set V_NMI_BLOCKING_MASK
>>> when software directly injects an NMI.
>>>
>>> So I think we need:
>>>
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index d381ad424554..c956a9f500a2 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -3476,6 +3476,11 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>>>         if (svm->nmi_l1_to_l2)
>>>                 return;
>>>  
>>> +       if (is_vnmi_enabled(svm)) {
>>> +               svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK;
>>> +               return;
>>> +       }
>>> +
>>>         svm->nmi_masked = true;
>>>         svm_set_iret_intercept(svm);
>>>         ++vcpu->stat.nmi_injections;
>>> --
>>>
>>> or if hardware does set V_NMI_BLOCKING_MASK in this case, just:
>>>
>>
>> Yes, HW does set BLOCKING_MASK when HW takes the pending vNMI event.
> 
> I'm not asking about the pending vNMI case, which is clearly spelled out in the
> APM.  I'm asking about directly injecting an NMI via:
> 
> 	svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
>

Yes. This is documented in APM as well.
https://www.amd.com/system/files/TechDocs/24593.pdf : "15.21.10 NMI Virtualization"

"
If Event Injection is used to inject an NMI when NMI Virtualization is enabled,
VMRUN sets V_NMI_MASK in the guest state.
"
 
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index d381ad424554..201a1a33ecd2 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -3473,7 +3473,7 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
>>>  
>>>         svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
>>>  
>>> -       if (svm->nmi_l1_to_l2)
>>> +       if (svm->nmi_l1_to_l2 || is_vnmi_enabled(svm))
>>>                 return;
>>>  
>>>         svm->nmi_masked = true;
>>> --
>>>
>>
>> Above proposal make sense to me, I was reviewing source code flow
>> for a scenarios when two consecutive need to me delivered to Guest.
>> Example, process_nmi will pend the first NMI and then second NMI will
>> be injected through EVTINJ, as because (kvm_x86_inject_nmi)
>> will get called and that will set the _iret_intercept. 
>>
>> With your proposal inject_nmi will be set the env_inj NMI w/o the IRET,
>> I think we could check for "is_vnmi_enabled" before the programming
>> the "evt_inj"?
> 
> No, because the whole point of this path is to directly inject an NMI when NMIs
> are NOT blocked in the guest AND there is already a pending vNMI.

I agree, Your patch looks fine.

I'll do the testing on above patch and share the test feedback.

Thanks,
Santosh

^ permalink raw reply	[relevance 4%]

* Re: Question about AMD SVM's virtual NMI support in Linux kernel mainline
       [not found]       ` <174aa0da-0b05-a2dc-7884-4f7b57abcc37@amd.com>
@ 2023-08-24 14:22  4%     ` Sean Christopherson
  2023-08-25  9:06  4%       ` Santosh Shukla
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2023-08-24 14:22 UTC (permalink / raw)
  To: Santosh Shukla
  Cc: 陈培鸿(乘鸿), mlevitsk, kvm, linux-kernel

+kvm and lkml, didn't realize they weren't Cc'd in the original mail.

On Thu, Aug 24, 2023, Santosh Shukla wrote:
> Hi Sean,
> 
> On 8/23/2023 8:33 PM, Sean Christopherson wrote:
> > On Fri, Aug 18, 2023, 陈培鸿(乘鸿) wrote:
> >> According to the results, I found that in the case of concurrent NMIs, some
> >> NMIs are still injected through eventinj instead of vNMI.
> > 
> > Key word is "some", having two NMIs arrive "simultaneously" is uncommon.  In quotes
> > because if a vCPU is scheduled out or otherwise delayed, two NMIs can be recognized
> > by KVM at the same time even if there was a sizeable delay between when they were
> > sent.
> > 
> >> Based on the above explanations, I summarize my questions as follows:
> >> 1. According to the specification of AMD SVM vNMI, with vNMI enabled, will
> >> some NMIs be injected through eventinj under the condition of concurrent
> >> NMIs?
> > 
> > Yes.
> > 
> >> 2. If yes, what is the role of vNMI? Is it just an addition to eventinj? What
> >> benefits is it designed to expect? Is there any benchmark data support?
> > 
> > Manually injecting NMIs isn't problematic from a performance perspective.  KVM
> > takes control of the vCPU, i.e. forces a VM-Exit, to pend a virtual NMI, so there's
> > no extra VM-Exit.
> > 
> > The value added by vNMI support is that KVM doesn't need to manually track/detect
> > when NMIs become unblocked in the guest.  SVM doesn't provide a hardware-supported
> > NMI-window exiting, so KVM has to intercept _and_ single-step IRET, which adds two
> > VM-Exits for _every_ NMI when vNMI isn't available (and it's a complete mess for
> > things like SEV-ES).
> > 
> >> 3. If not, does it mean that the current SVM's vNMI support scheme in the
> >> Linux mainline code is flawed? How should it be fixed?
> > 
> > The approach as a whole isn't flawed, it's the best KVM can do given SVM's vNMI
> > architecture and KVM's ABI with respect to "concurrent" NMIs.
> > 
> > Hrm, though there does appear to be a bug in the injecting path.  KVM doesn't
> > manually set V_NMI_BLOCKING_MASK, and will unnecessarily enable IRET interception
> > when manually injecting a vNMI.  Intercepting IRET should be unnecessary because
> > hardware will automatically accept the pending vNMI when NMIs become unblocked.
> > And I don't see anything in the APM that suggests hardware will set V_NMI_BLOCKING_MASK
> > when software directly injects an NMI.
> > 
> > So I think we need:
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index d381ad424554..c956a9f500a2 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -3476,6 +3476,11 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
> >         if (svm->nmi_l1_to_l2)
> >                 return;
> >  
> > +       if (is_vnmi_enabled(svm)) {
> > +               svm->vmcb->control.int_ctl |= V_NMI_BLOCKING_MASK;
> > +               return;
> > +       }
> > +
> >         svm->nmi_masked = true;
> >         svm_set_iret_intercept(svm);
> >         ++vcpu->stat.nmi_injections;
> > --
> > 
> > or if hardware does set V_NMI_BLOCKING_MASK in this case, just:
> > 
> 
> Yes, HW does set BLOCKING_MASK when HW takes the pending vNMI event.

I'm not asking about the pending vNMI case, which is clearly spelled out in the
APM.  I'm asking about directly injecting an NMI via:

	svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;

> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index d381ad424554..201a1a33ecd2 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -3473,7 +3473,7 @@ static void svm_inject_nmi(struct kvm_vcpu *vcpu)
> >  
> >         svm->vmcb->control.event_inj = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
> >  
> > -       if (svm->nmi_l1_to_l2)
> > +       if (svm->nmi_l1_to_l2 || is_vnmi_enabled(svm))
> >                 return;
> >  
> >         svm->nmi_masked = true;
> > --
> >
> 
> Above proposal make sense to me, I was reviewing source code flow
> for a scenarios when two consecutive need to me delivered to Guest.
> Example, process_nmi will pend the first NMI and then second NMI will
> be injected through EVTINJ, as because (kvm_x86_inject_nmi)
> will get called and that will set the _iret_intercept. 
> 
> With your proposal inject_nmi will be set the env_inj NMI w/o the IRET,
> I think we could check for "is_vnmi_enabled" before the programming
> the "evt_inj"?

No, because the whole point of this path is to directly inject an NMI when NMIs
are NOT blocked in the guest AND there is already a pending vNMI.

^ permalink raw reply	[relevance 4%]

* [GIT PULL 12/22] s390/vfio-ap: store entire AP queue status word with the queue object
  @ 2023-08-24 12:43  9% ` Janosch Frank
  0 siblings, 0 replies; 200+ results
From: Janosch Frank @ 2023-08-24 12:43 UTC (permalink / raw)
  To: pbonzini
  Cc: kvm, frankja, david, borntraeger, cohuck, linux-s390, imbrenda,
	hca, mihajlov, seiden, akrowiak

From: Tony Krowiak <akrowiak@linux.ibm.com>

Store the entire AP queue status word returned from the ZAPQ command with
the struct vfio_ap_queue object instead of just the response code field.
The other information contained in the status word is need by the
apq_reset_check function to display a proper message to indicate that the
vfio_ap driver is waiting for the ZAPQ to complete because the queue is
not empty or IRQs are still enabled.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
Link: https://lore.kernel.org/r/20230815184333.6554-7-akrowiak@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 drivers/s390/crypto/vfio_ap_ops.c     | 27 +++++++++++++++------------
 drivers/s390/crypto/vfio_ap_private.h |  4 ++--
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 2517868aad56..43224f7a40ea 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -674,7 +674,7 @@ static bool vfio_ap_mdev_filter_matrix(unsigned long *apm, unsigned long *aqm,
 			 */
 			apqn = AP_MKQID(apid, apqi);
 			q = vfio_ap_mdev_get_queue(matrix_mdev, apqn);
-			if (!q || q->reset_rc) {
+			if (!q || q->reset_status.response_code) {
 				clear_bit_inv(apid,
 					      matrix_mdev->shadow_apcb.apm);
 				break;
@@ -1628,6 +1628,7 @@ static int apq_reset_check(struct vfio_ap_queue *q)
 	int ret = -EBUSY, elapsed = 0;
 	struct ap_queue_status status;
 
+	memcpy(&status, &q->reset_status, sizeof(status));
 	while (true) {
 		msleep(AP_RESET_INTERVAL);
 		elapsed += AP_RESET_INTERVAL;
@@ -1643,20 +1644,20 @@ static int apq_reset_check(struct vfio_ap_queue *q)
 					      status.queue_empty,
 					      status.irq_enabled);
 		} else {
-			if (q->reset_rc == AP_RESPONSE_RESET_IN_PROGRESS ||
-			    q->reset_rc == AP_RESPONSE_BUSY) {
+			if (q->reset_status.response_code == AP_RESPONSE_RESET_IN_PROGRESS ||
+			    q->reset_status.response_code == AP_RESPONSE_BUSY) {
 				status = ap_zapq(q->apqn, 0);
-				q->reset_rc = status.response_code;
+				memcpy(&q->reset_status, &status, sizeof(status));
 				continue;
 			}
 			/*
-			 * When an AP adapter is deconfigured, the associated
-			 * queues are reset, so let's set the status response
-			 * code to 0 so the queue may be passed through (i.e.,
-			 * not filtered).
+			 * When an AP adapter is deconfigured, the
+			 * associated queues are reset, so let's set the
+			 * status response code to 0 so the queue may be
+			 * passed through (i.e., not filtered)
 			 */
-			if (q->reset_rc == AP_RESPONSE_DECONFIGURED)
-				q->reset_rc = 0;
+			if (status.response_code == AP_RESPONSE_DECONFIGURED)
+				q->reset_status.response_code = 0;
 			if (q->saved_isc != VFIO_AP_ISC_INVALID)
 				vfio_ap_free_aqic_resources(q);
 			break;
@@ -1673,7 +1674,7 @@ static int vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q)
 	if (!q)
 		return 0;
 	status = ap_zapq(q->apqn, 0);
-	q->reset_rc = status.response_code;
+	memcpy(&q->reset_status, &status, sizeof(status));
 	switch (status.response_code) {
 	case AP_RESPONSE_NORMAL:
 	case AP_RESPONSE_RESET_IN_PROGRESS:
@@ -1688,7 +1689,8 @@ static int vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q)
 		 * so the queue may be passed through (i.e., not filtered) and
 		 * return a value indicating the reset completed successfully.
 		 */
-		q->reset_rc = 0;
+		q->reset_status.response_code = 0;
+		ret = 0;
 		vfio_ap_free_aqic_resources(q);
 		break;
 	default:
@@ -2042,6 +2044,7 @@ int vfio_ap_mdev_probe_queue(struct ap_device *apdev)
 
 	q->apqn = to_ap_queue(&apdev->device)->qid;
 	q->saved_isc = VFIO_AP_ISC_INVALID;
+	memset(&q->reset_status, 0, sizeof(q->reset_status));
 	matrix_mdev = get_update_locks_by_apqn(q->apqn);
 
 	if (matrix_mdev) {
diff --git a/drivers/s390/crypto/vfio_ap_private.h b/drivers/s390/crypto/vfio_ap_private.h
index 4642bbdbd1b2..d6eb3527e056 100644
--- a/drivers/s390/crypto/vfio_ap_private.h
+++ b/drivers/s390/crypto/vfio_ap_private.h
@@ -133,7 +133,7 @@ struct ap_matrix_mdev {
  * @apqn: the APQN of the AP queue device
  * @saved_isc: the guest ISC registered with the GIB interface
  * @mdev_qnode: allows the vfio_ap_queue struct to be added to a hashtable
- * @reset_rc: the status response code from the last reset of the queue
+ * @reset_status: the status from the last reset of the queue
  */
 struct vfio_ap_queue {
 	struct ap_matrix_mdev *matrix_mdev;
@@ -142,7 +142,7 @@ struct vfio_ap_queue {
 #define VFIO_AP_ISC_INVALID 0xff
 	unsigned char saved_isc;
 	struct hlist_node mdev_qnode;
-	unsigned int reset_rc;
+	struct ap_queue_status reset_status;
 };
 
 int vfio_ap_mdev_register(void);
-- 
2.41.0


^ permalink raw reply related	[relevance 9%]

Results 1-200 of ~3000   | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2021-09-20 23:51     [PATCH 0/5] nSVM: Check for optional commands and reserved encodins of TLB_CONTROL in nested guests Krish Sadhukhan
2021-09-20 23:51     ` [PATCH 2/5] nSVM: Check for optional commands and reserved encodings of TLB_CONTROL in nested VMCB Krish Sadhukhan
2021-09-28 16:55       ` Paolo Bonzini
2023-09-06 15:59  0%     ` Stefan Sterz
2023-09-06 20:40  4%       ` Sean Christopherson
2023-04-07  8:56     [PATCH V2] KVM: x86/pmu: Disable vPMU if EVENTSEL_GUESTONLY bit doesn't exist Like Xu
2023-04-07 15:37     ` Sean Christopherson
2023-09-14  8:23  0%   ` Like Xu
2023-09-25 23:31  4%     ` Sean Christopherson
2023-09-26  3:49  0%       ` Like Xu
2023-09-26 15:44  0%         ` Sean Christopherson
2023-07-20 23:32     [RFC PATCH v4 00/10] KVM: guest_memfd(), X86: Common base for SNP and TDX (was KVM: guest memory: Misc enhancement) isaku.yamahata
2023-07-20 23:32     ` [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private isaku.yamahata
2023-07-21 14:11       ` Sean Christopherson
2023-07-22  0:52         ` Isaku Yamahata
2024-02-22  2:05  4%       ` Sean Christopherson
2023-07-20 23:32     ` [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory isaku.yamahata
2023-07-21 14:25       ` Sean Christopherson
2023-07-22  0:34         ` Michael Roth
2023-08-18 22:27           ` Sean Christopherson
2023-08-26  0:59  3%         ` Michael Roth
2023-08-24 12:43     [GIT PULL 00/22] KVM: s390: Changes for 6.6 Janosch Frank
2023-08-24 12:43  9% ` [GIT PULL 12/22] s390/vfio-ap: store entire AP queue status word with the queue object Janosch Frank
     [not found]     <67fba65c-ba2a-4681-a9bc-2a6e8f0bcb92.chenpeihong.cph@alibaba-inc.com>
     [not found]     ` <ZOYfxgSy/SxCn0Wq@google.com>
     [not found]       ` <174aa0da-0b05-a2dc-7884-4f7b57abcc37@amd.com>
2023-08-24 14:22  4%     ` Question about AMD SVM's virtual NMI support in Linux kernel mainline Sean Christopherson
2023-08-25  9:06  4%       ` Santosh Shukla
2023-08-25 14:26  0%         ` Sean Christopherson
2023-09-04  1:35     [PATCH 0/2] KVM: x86: Fix a WARN in kvm_apic_send_ipi() Tao Su
2023-09-04  1:35     ` [PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit Tao Su
2023-09-05 23:03  4%   ` Sean Christopherson
2023-09-06  5:07  0%     ` Tao Su
2023-09-04  9:53     [PATCH 00/13] Implement support for IBS virtualization Manali Shukla
2023-09-05 15:47     ` Peter Zijlstra
2023-09-06 15:38  4%   ` Manali Shukla
2023-09-06 19:56  0%     ` Peter Zijlstra
2023-09-07 15:49  0%       ` Manali Shukla
2023-09-14  7:21     [PATCH v4 00/21] Support smp.clusters for x86 in QEMU Zhao Liu
2023-09-14  7:21  6% ` [PATCH v4 02/21] tests: Rename test-x86-cpuid.c to test-x86-topo.c Zhao Liu
2023-09-14  7:21  2% ` [PATCH v4 03/21] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
2023-09-14  7:31  0%   ` Philippe Mathieu-Daudé
2023-09-15  7:39  0%     ` Zhao Liu
2023-09-14  7:21  5% ` [PATCH v4 19/21] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2023-09-22 19:27  0%   ` Moger, Babu
2023-09-26  3:10  0%     ` Zhao Liu
2023-09-28 15:04     [PATCH 0/5] AVIC bugfixes and workarounds Maxim Levitsky
2023-09-28 15:04  3% ` [PATCH 2/5] x86: KVM: SVM: add support for Invalid IPI Vector interception Maxim Levitsky
2023-09-28 15:46  0%   ` Sean Christopherson
2023-09-28 17:33     [PATCH v2 0/4] AVIC bugfixes and workarounds Maxim Levitsky
2023-09-28 17:33  3% ` [PATCH v2 2/4] x86: KVM: SVM: add support for Invalid IPI Vector interception Maxim Levitsky
2023-09-29  0:42  0%   ` Sean Christopherson
2023-10-02 20:40     [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event Peter Zijlstra
2023-10-03  0:56     ` Sean Christopherson
2023-10-03  8:16       ` Peter Zijlstra
2023-10-03 15:23         ` Sean Christopherson
2023-10-04 11:21           ` Peter Zijlstra
2023-10-04 19:51             ` Mingwei Zhang
2023-10-04 21:50               ` Sean Christopherson
2023-10-04 22:05                 ` Sean Christopherson
2023-10-08 10:04                   ` Like Xu
2023-10-09 17:03                     ` Manali Shukla
2023-10-11 14:15                       ` Peter Zijlstra
2023-10-17 10:24                         ` Manali Shukla
2023-10-17 16:58                           ` Mingwei Zhang
2023-10-18  0:01  4%                         ` Mingwei Zhang
2023-10-04  0:20  4% [PATCH] x86: KVM: Add feature flag for AMD's FsGsKernelGsBaseNonSerializing Jim Mattson
2023-10-09 21:29  4% [PATCH] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled Sean Christopherson
2023-10-10 12:03  0% ` Maxim Levitsky
2023-10-10 14:46  4%   ` Sean Christopherson
2023-10-10 16:06  4%     ` Maxim Levitsky
2023-10-10 17:50  0%       ` Sean Christopherson
2023-10-14 10:16  0%     ` Santosh Shukla
2023-10-14 14:49  0%       ` Santosh Shukla
2023-10-10  8:49     [kvm-unit-tests PATCH v2 0/7] s390x: Add base AP support Janosch Frank
2023-10-10  8:49  4% ` [kvm-unit-tests PATCH v2 1/7] lib: s390x: Add ap library Janosch Frank
2023-10-10  8:49  6% ` [kvm-unit-tests PATCH v2 3/7] lib: s390x: ap: Add ap_setup Janosch Frank
2023-10-10 20:02     [PATCH 0/9] SVM guest shadow stack support John Allen
2023-10-10 20:02  4% ` [PATCH 4/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp John Allen
2023-11-02 18:06  4%   ` Maxim Levitsky
2023-10-10 20:02     ` [PATCH 5/9] KVM: SVM: Save shadow stack host state on VMRUN John Allen
2023-11-02 18:07  4%   ` Maxim Levitsky
2024-02-26 16:56  0%     ` John Allen
2023-10-16 11:50     [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
2023-10-16 11:50     ` [PATCH RFC gmem v1 8/8] KVM: x86: Determine shared/private faults based on vm_type Michael Roth
2024-01-31  1:13       ` Sean Christopherson
2024-02-08  0:24         ` Michael Roth
2024-02-08 17:27  4%       ` Sean Christopherson
2023-10-16 13:27     [PATCH v10 00/50] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2023-10-16 13:27     ` [PATCH v10 06/50] x86/sev: Add the host SEV-SNP initialization support Michael Roth
2023-11-07 16:31  4%   ` Borislav Petkov
2023-11-07 18:32         ` Tom Lendacky
2023-11-08  8:21  4%       ` Jeremi Piotrowski
2023-11-08 15:19  4%         ` Tom Lendacky
2023-11-07 19:00  0%     ` Kalra, Ashish
2023-11-07 19:19  0%       ` Borislav Petkov
2023-12-08 17:09  0%       ` Jeremi Piotrowski
2023-10-16 13:28  4% ` [PATCH v10 31/50] KVM: SEV: Add KVM_EXIT_VMGEXIT Michael Roth
2023-10-16 13:28  2% ` [PATCH v10 35/50] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2023-10-18 19:20  3% [PATCH v2] KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled Sean Christopherson
2023-10-22 14:59  0% ` Santosh Shukla
2023-10-18 19:41     [PATCH 0/2] KVM: nSVM: TLB_CONTROL / FLUSHBYASID "fixes" Sean Christopherson
2023-10-18 19:41  5% ` [PATCH 1/2] Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB" Sean Christopherson
2023-11-20 11:41  0%   ` Maxim Levitsky
2023-10-20 20:48  4% [PATCH] s390/vfio-ap: fix sysfs status attribute for AP queue devices Tony Krowiak
2023-11-06 16:03  0% ` Tony Krowiak
2023-11-07  8:07  0%   ` Harald Freudenberger
2023-11-08 14:44  0%     ` Tony Krowiak
2023-10-24  9:03     [PATCH v5 00/20] Support smp.clusters for x86 in QEMU Zhao Liu
2023-10-24  9:03  6% ` [PATCH v5 02/20] tests: Rename test-x86-cpuid.c to test-x86-topo.c Zhao Liu
2023-10-24  9:03  2% ` [PATCH v5 03/20] softmmu: Fix CPUSTATE.nr_cores' calculation Zhao Liu
2023-10-24  9:03  4% ` [PATCH v5 19/20] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2023-10-30  6:36  3% [PATCH v5 00/14] Add Secure TSC support for SNP guests Nikunj A Dadhania
2023-11-06 11:02     [PULL 00/60] Misc HW/UI patches for 2023-11-06 Philippe Mathieu-Daudé
2023-11-06 11:03  6% ` [PULL 44/60] tests/unit: Rename test-x86-cpuid.c to test-x86-topo.c Philippe Mathieu-Daudé
2023-11-06 11:03  2% ` [PULL 45/60] system/cpus: Fix CPUState.nr_cores' calculation Philippe Mathieu-Daudé
2023-11-08 20:11  4% [PATCH v2] s390/vfio-ap: fix sysfs status attribute for AP queue devices Tony Krowiak
2023-11-08 20:21  0% ` Tony Krowiak
2023-11-10  0:37     [RFC PATCH 0/4] KVM: SEV: Limit cache flush operations in sev guest memory reclaim events Jacky Li
2023-12-01 18:05     ` Sean Christopherson
2023-12-01 19:02       ` Mingwei Zhang
2023-12-01 21:30         ` Kalra, Ashish
2023-12-01 21:58           ` Mingwei Zhang
2023-12-01 22:13             ` Sean Christopherson
2023-12-01 22:22               ` Mingwei Zhang
2023-12-01 22:30  4%             ` Sean Christopherson
2023-12-02  6:21  0%               ` Mingwei Zhang
2023-11-17  7:50     [PATCH v6 00/16] Support smp.clusters for x86 in QEMU Zhao Liu
2023-11-17  7:51  4% ` [PATCH v6 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2023-11-17 15:19     [kvm-unit-tests PATCH v3 0/7] s390x: Add base AP support Janosch Frank
2023-11-17 15:19  4% ` [kvm-unit-tests PATCH v3 1/7] lib: s390x: Add ap library Janosch Frank
2023-11-17 15:19  5% ` [kvm-unit-tests PATCH v3 3/7] lib: s390x: ap: Add proper ap setup code Janosch Frank
2023-11-28 12:59  2% [PATCH v6 00/16] Add Secure TSC support for SNP guests Nikunj A Dadhania
2023-12-06 17:46  0% ` Peter Gonda
2023-12-08 16:22     [PATCH v1 0/6] s390/vfio-ap: reset queues removed from guest's AP configuration Tony Krowiak
2023-12-08 16:22 20% ` [PATCH v1 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
2023-12-08 16:22 14% ` [PATCH v1 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
2023-12-08 16:22 13% ` [PATCH v1 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
2023-12-08 16:22 10% ` [PATCH v1 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
2023-12-08 16:22  5% ` [PATCH v1 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
2023-12-12 21:25     [PATCH v2 0/6] s390/vfio-ap: reset queues removed from guest's AP configuration Tony Krowiak
2023-12-12 21:25 20% ` [PATCH v2 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
2023-12-12 21:25 14% ` [PATCH v2 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
2023-12-12 21:25 13% ` [PATCH v2 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
2023-12-12 21:25  5% ` [PATCH v2 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
2023-12-12 21:25  5% ` [PATCH v2 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
2024-01-10 15:44  0%   ` Jason J. Herne
2023-12-20 15:13  2% [PATCH v7 00/16] Add Secure TSC support for SNP guests Nikunj A Dadhania
2024-01-25  6:08  0% ` Nikunj A. Dadhania
2024-01-26  1:00  0%   ` Dionna Amalie Glaze
2024-01-27  4:10  0%     ` Nikunj A. Dadhania
2023-12-30 16:19     [PATCH v1 00/26] Add AMD Secure Nested Paging (SEV-SNP) Initialization Support Michael Roth
2023-12-30 16:19     ` [PATCH v1 04/26] x86/sev: Add the host SEV-SNP initialization support Michael Roth
2024-01-04 11:16  3%   ` Borislav Petkov
2023-12-30 17:23     [PATCH v11 00/35] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2023-12-30 17:23  3% ` [PATCH v11 21/35] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
2023-12-30 17:23  4% ` [PATCH v11 22/35] KVM: SEV: Add support to handle " Michael Roth
2023-12-30 17:23  2% ` [PATCH v11 24/35] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2024-01-08  8:27     [PATCH v7 00/16] Support smp.clusters for x86 in QEMU Zhao Liu
2024-01-08  8:27  4% ` [PATCH v7 15/16] i386: Use offsets get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2024-01-14 14:42  0%   ` Xiaoyao Li
2024-01-15  3:48  0%     ` Zhao Liu
2024-01-15  4:27  0%       ` Xiaoyao Li
2024-01-15 14:54  0%         ` Zhao Liu
2024-01-11 14:38     [PATCH v3 0/6] s390/vfio-ap: reset queues removed from guest's AP configuration Tony Krowiak
2024-01-11 14:38 20% ` [PATCH v3 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
2024-01-11 14:38 14% ` [PATCH v3 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
2024-01-11 14:38 13% ` [PATCH v3 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
2024-01-11 14:38  5% ` [PATCH v3 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
2024-01-11 14:38  5% ` [PATCH v3 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
2024-01-12  6:51  4% [PATCH] KVM: VMX: Correct *intr_info content and *info2 for EPT_VIOLATION in get_exit_info() Robert Hoo
2024-01-26 20:45  0% ` Sean Christopherson
2024-01-15 18:54     [PATCH v4 0/6] s390/vfio-ap: reset queues removed from guest's AP configuration Tony Krowiak
2024-01-15 18:54 20% ` [PATCH v4 1/6] s390/vfio-ap: always filter entire AP matrix Tony Krowiak
2024-01-15 18:54 14% ` [PATCH v4 2/6] s390/vfio-ap: loop over the shadow APCB when filtering guest's AP configuration Tony Krowiak
2024-01-15 18:54 12% ` [PATCH v4 4/6] s390/vfio-ap: reset queues filtered from the guest's AP config Tony Krowiak
2024-01-15 18:54  5% ` [PATCH v4 5/6] s390/vfio-ap: reset queues associated with adapter for queue unbound from driver Tony Krowiak
2024-01-15 18:54  9% ` [PATCH v4 6/6] s390/vfio-ap: do not reset queue removed from host config Tony Krowiak
2024-01-26  4:11     [PATCH v2 00/25] Add AMD Secure Nested Paging (SEV-SNP) Initialization Support Michael Roth
2024-01-26  4:11  2% ` [PATCH v2 04/25] x86/sev: Add the host SEV-SNP initialization support Michael Roth
2024-01-26  4:11  3% ` [PATCH v2 11/25] x86/sev: Adjust directmap to avoid inadvertant RMP faults Michael Roth
2024-01-26 15:34       ` Borislav Petkov
2024-01-26 17:04         ` Michael Roth
2024-01-26 18:43           ` Borislav Petkov
2024-01-26 23:54             ` Michael Roth
2024-01-27 11:42               ` Borislav Petkov
2024-01-27 15:45                 ` Michael Roth
2024-01-27 16:02  4%               ` Borislav Petkov
2024-01-29 11:59  3%                 ` Borislav Petkov
2024-01-31 10:13     [PATCH v8 00/21] Introduce smp.modules for x86 in QEMU Zhao Liu
2024-01-31 10:13  4% ` [PATCH v8 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2024-02-02 14:59     [kvm-unit-tests PATCH v4 0/7] s390x: Add base AP support Janosch Frank
2024-02-02 14:59  4% ` [kvm-unit-tests PATCH v4 1/7] lib: s390x: Add ap library Janosch Frank
2024-02-05 18:15  0%   ` Anthony Krowiak
2024-02-06  8:48  0%     ` Harald Freudenberger
2024-02-06 15:45  0%       ` Anthony Krowiak
2024-02-02 14:59  5% ` [kvm-unit-tests PATCH v4 3/7] lib: s390x: ap: Add proper ap setup code Janosch Frank
2024-02-15 11:31  3% [PATCH v8 00/16] Add Secure TSC support for SNP guests Nikunj A Dadhania
2024-02-26 21:32     [PATCH v2 0/9] SVM guest shadow stack support John Allen
2024-02-26 21:32  4% ` [PATCH v2 5/9] KVM: SVM: Rename vmplX_ssp -> plX_ssp John Allen
2024-02-27 18:14  4%   ` Sean Christopherson
2024-02-27 19:15  0%     ` Tom Lendacky
2024-02-27 19:19  0%       ` John Allen
2024-02-27 19:23  0%         ` Sean Christopherson
2024-02-27 19:25  0%           ` John Allen
2024-02-27 10:32     [PATCH v9 00/21] Introduce smp.modules for x86 in QEMU Zhao Liu
2024-02-27 10:32  4% ` [PATCH v9 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2024-02-29 15:13  0%   ` Moger, Babu
2024-03-09 13:41  0%   ` Xiaoyao Li
2024-02-27 20:03  4% [PATCH] KVM: SVM: Rename vmplX_ssp -> plX_ssp John Allen
2024-02-27 21:28  0% ` Sean Christopherson
2024-02-28  2:41     [PATCH 00/16] KVM: x86/mmu: Page fault and MMIO cleanups Sean Christopherson
2024-02-28  2:41  5% ` [PATCH 03/16] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF Sean Christopherson
2024-02-28  4:43  0%   ` Dongli Zhang
2024-02-28 16:16  5%     ` Sean Christopherson
2024-02-28  2:41     ` [PATCH 05/16] KVM: x86/mmu: Use synthetic page fault error code to indicate private faults Sean Christopherson
2024-03-06  9:43       ` Xu Yilun
2024-03-06 14:45  3%     ` Sean Christopherson
2024-03-07  9:05  0%       ` Xu Yilun
2024-02-28  2:41  3% ` [PATCH 06/16] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero Sean Christopherson
2024-02-29 22:11  0%   ` Huang, Kai
2024-02-29 23:07  0%     ` Sean Christopherson
2024-03-12  5:44  0%       ` Binbin Wu
2024-03-01  7:50     [PATCH] KVM: x86/svm/pmu: Set PerfMonV2 global control bits correctly Sandipan Das
2024-03-01  8:37     ` Like Xu
2024-03-01  9:00       ` Sandipan Das
2024-03-04  7:59         ` Mi, Dapeng
2024-03-04 19:46  4%       ` Sean Christopherson
2024-03-04 20:17  0%         ` Mingwei Zhang
2024-03-05  2:31  4%         ` Like Xu
2024-03-06  2:48  0%         ` Mi, Dapeng
2024-03-05 10:35  5% [PATCH v2] kvm: set guest physical bits in CPUID.0x80000008 Gerd Hoffmann
2024-03-05 16:35  3% ` Sean Christopherson
2024-03-06  5:43  0%   ` Xiaoyao Li
2024-03-11 10:41     [PATCH v3 0/2] kvm/cpuid: set proper GuestPhysBits " Gerd Hoffmann
2024-03-11 10:41  3% ` [PATCH v3 2/2] " Gerd Hoffmann
2024-03-12  2:44  0%   ` Xiaoyao Li
2024-03-13  1:06  0%   ` Tao Su
2024-03-13  1:14  0%     ` Xiaoyao Li
2024-03-13 12:58     [PATCH v4 0/2] " Gerd Hoffmann
2024-03-13 12:58  3% ` [PATCH v4 2/2] " Gerd Hoffmann
2024-03-19 16:34  3% [PATCH RFC not-to-be-merged] KVM: SVM: Workaround overly strict CR3 check by Hyper-V Vitaly Kuznetsov
2024-03-21 14:40     [PATCH v10 00/21] i386: Introduce smp.modules and clean up cache topology Zhao Liu
2024-03-21 14:40  4% ` [PATCH v10 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2024-03-29 22:58     [PATCH v12 00/29] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2024-03-29 22:58 10% ` [PATCH v12 10/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
2024-03-29 22:58  8%   ` Michael Roth
2024-03-29 22:58  3% ` [PATCH v12 14/29] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
2024-03-29 22:58  4% ` [PATCH v12 15/29] KVM: SEV: Add support to handle " Michael Roth
2024-03-29 22:58  2% ` [PATCH v12 17/29] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2024-03-29 22:58  2%   ` Michael Roth
2024-03-29 22:58  2%   ` Michael Roth
2024-04-09 23:07  4% [PATCH for-9.1 v1 0/3] Add SEV/SEV-ES machine compat options for KVM_SEV_INIT2 Michael Roth
2024-04-11 17:35  0% ` Paolo Bonzini
2024-04-16 20:47     [PATCH] KVM/x86: Do not clear SIPI while in SMM Boris Ostrovsky
2024-04-16 20:53     ` Paolo Bonzini
2024-04-16 20:57  4%   ` boris.ostrovsky
2024-04-16 22:03  0%     ` Paolo Bonzini
2024-04-16 22:14  0%       ` Sean Christopherson
2024-04-16 23:02  0%         ` boris.ostrovsky
2024-04-16 22:56  0%       ` boris.ostrovsky
2024-04-16 23:17  0%         ` Sean Christopherson
2024-04-16 23:37  0%           ` boris.ostrovsky
2024-04-17 12:40  0%             ` Igor Mammedov
2024-04-17 13:58  0%               ` boris.ostrovsky
2024-04-18 19:41     [PATCH v13 00/26] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2024-04-18 19:41  9% ` [PATCH v13 09/26] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
2024-04-18 19:41  3% ` [PATCH v13 13/26] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
2024-04-18 19:41  4% ` [PATCH v13 14/26] KVM: SEV: Add support to handle " Michael Roth
2024-04-18 19:41  2% ` [PATCH v13 15/26] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2024-04-19 12:57     [kvm-unit-tests RFC PATCH 00/13] Introduce SEV-SNP Support Pavan Kumar Paluri
2024-04-19 12:57  3% ` [kvm-unit-tests RFC PATCH 09/13] x86 AMD SEV-SNP: Test Shared->Private Page State Changes using GHCB MSR Pavan Kumar Paluri
2024-04-21 18:01     [PATCH v14 00/22] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2024-04-21 18:01  9% ` [PATCH v14 05/22] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
2024-04-21 18:01  3% ` [PATCH v14 09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT Michael Roth
2024-04-21 18:01  4% ` [PATCH v14 10/22] KVM: SEV: Add support to handle " Michael Roth
2024-04-21 18:01  2% ` [PATCH v14 11/22] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2024-04-24 15:49     [PATCH v11 00/21] i386: Introduce smp.modules and clean up cache topology Zhao Liu
2024-04-24 15:49  4% ` [PATCH v11 07/21] i386/cpu: Use APIC ID info get NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14] Zhao Liu
2024-04-29  6:06  3% [PATCH 0/3] x86/cpu: Add Bus Lock Detect support for AMD Ravi Bangoria
2024-04-29  6:06  4% ` [PATCH 2/3] x86/bus_lock: Add " Ravi Bangoria
2024-05-06 16:05  0%   ` Tom Lendacky
2024-05-06 16:24  0%   ` Jim Mattson
2024-05-07  3:57  0%     ` Ravi Bangoria
2024-05-01  8:51     [PATCH v15 00/20] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2024-05-01  8:51  9% ` [PATCH v15 05/20] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
2024-05-01  8:52  2% ` [PATCH v15 11/20] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2024-05-01  8:52  2% ` [PATCH v15 19/20] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Michael Roth
2024-05-01 14:54     [PATCH v2 0/5] Add support for the Idle HLT intercept feature Manali Shukla
2024-05-01 14:54  3% ` [PATCH v2 2/5] KVM: SVM: Add Idle HLT intercept support Manali Shukla
2024-05-07 15:58  3% [PATCH v2 00/17] KVM: x86/mmu: Page fault and MMIO cleanups Paolo Bonzini
2024-05-07 15:58  5% ` [PATCH 03/17] KVM: x86: Define more SEV+ page fault error bits/flags for #NPF Paolo Bonzini
2024-05-07 15:58  3% ` [PATCH 06/17] KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero Paolo Bonzini
2024-05-10 21:10     [PULL 00/19] KVM: Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support Michael Roth
2024-05-10 21:10 10% ` [PULL 04/19] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command Michael Roth
2024-05-10 21:10  2% ` [PULL 10/19] KVM: SEV: Add support to handle RMP nested page faults Michael Roth
2024-05-10 21:10  3% ` [PULL 18/19] KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event Michael Roth
2024-05-14 14:36     [PATCH v15 19/20] " Sean Christopherson
2024-05-15  1:25  3% ` [PATCH] KVM: SEV: Replace KVM_EXIT_VMGEXIT with KVM_EXIT_SNP_REQ_CERTS Michael Roth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).