All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
@ 2023-11-08 11:17 Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h Nicolas Saenz Julienne
                   ` (34 more replies)
  0 siblings, 35 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature
that leverages the hypervisor to create secure execution environments
within a guest. VSM is documented as part of Microsoft's Hypervisor Top
Level Functional Specification [1]. Security features that build upon
VSM, like Windows Credential Guard, are enabled by default on Windows 11,
and are becoming a prerequisite in some industries.

This RFC series introduces the necessary infrastructure to emulate VSM
enabled guests. It is a snapshot of the progress we made so far, and its
main goal is to gather design feedback. Specifically on the KVM APIs we
introduce. For a high level design overview, see the documentation in
patch 33.

Additionally, this topic will be discussed as part of the KVM
Micro-conference, in this year's Linux Plumbers Conference [2].

The series is accompanied by two repositories:
 - A PoC QEMU implementation of VSM [3].
 - VSM kvm-unit-tests [4].

Note that this isn't a full VSM implementation. For now it only supports
2 VTLs, and only runs on uniprocessor guests. It is capable of booting
Windows Sever 2016/2019, but is unstable during runtime.

The series is based on the v6.6 kernel release, and depends on the
introduction of KVM memory attributes, which is being worked on
independently in "KVM: guest_memfd() and per-page attributes" [5]. A full
Linux tree is also made available [6].

Series rundown:
 - Patch 2 introduces the concept of APIC ID groups.
 - Patches 3-12 introduce the VSM capability and basic VTL awareness into
   Hyper-V emulation.
 - Patch 13 introduces vCPU polling support.
 - Patches 14-31 use KVM's memory attributes to implement VTL memory
   protections. Introduces the VTL KMV device and secure memory
   intercepts.
 - Patch 32 is a temporary implementation of
   HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019.
 - Patch 33 introduces documentation.

Our intention is to integrate feedback gathered in the RFC and LPC while
we finish the VSM implementation. In the future, we will split the series
into distinct feature patch sets and upstream these independently.

Thanks,
Nicolas

[1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
[2] https://lpc.events/event/17/sessions/166/#20231114
[3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1
[4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1
[5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
[6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. 
    There are also two small dependencies with
    https://marc.info/?l=kvm&m=167887543028109&w=2 and
    https://lkml.org/lkml/2023/10/17/972



^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 16:11   ` Sean Christopherson
  2023-11-08 11:17 ` [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Nicolas Saenz Julienne
                   ` (33 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

lapic.h has no dependencies with hyperv.h, so don't include it there.

Additionally, cpuid.c implicitly relied on hyperv.h's inclusion through
lapic.h, so include it explicitly there.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/cpuid.c | 1 +
 arch/x86/kvm/lapic.h | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 773132c3bf5a..eabd5e9dc003 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -28,6 +28,7 @@
 #include "trace.h"
 #include "pmu.h"
 #include "xen.h"
+#include "hyperv.h"
 
 /*
  * Unlike "struct cpuinfo_x86.x86_capability", kvm_cpu_caps doesn't need to be
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 0a0ea4b5dd8c..e1021517cf04 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -6,7 +6,6 @@
 
 #include <linux/kvm_host.h>
 
-#include "hyperv.h"
 #include "smm.h"
 
 #define KVM_APIC_INIT		0
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 12:11   ` Alexander Graf
                     ` (2 more replies)
  2023-11-08 11:17 ` [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support Nicolas Saenz Julienne
                   ` (32 subsequent siblings)
  34 siblings, 3 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Anel Orazgaliyeva, Nicolas Saenz Julienne

From: Anel Orazgaliyeva <anelkz@amazon.de>

Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
ids into two. The lower bits, the physical APIC id, represent the part
that's exposed to the guest. The higher bits, which are private to KVM,
groups APICs together. APICs in different groups are isolated from each
other, and IPIs can only be directed at APICs that share the same group
as its source. Furthermore, groups are only relevant to IPIs, anything
incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
PV-IPIs is targeted at the default APIC group, group 0.

When routing IPIs with physical destinations, KVM will OR the source's
vCPU APIC group with the ICR's destination ID and use that to resolve
the target lAPIC. The APIC physical map is also made group aware in
order to speed up this process. For the sake of simplicity, the logical
map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
routing to the slower per-vCPU scan method.

This capability serves as a building block to implement virtualisation
based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
introduces a para-virtualised switch that allows for guest CPUs to jump
into a different execution context, this switches into a different CPU
state, lAPIC state, and memory protections. We model this in KVM by
using distinct kvm_vcpus for each context. Moreover, execution contexts
are hierarchical and its APICs are meant to remain functional even when
the context isn't 'scheduled in'. For example, we have to keep track of
timers' expirations, and interrupt execution of lesser priority contexts
when relevant. Hence the need to alias physical APIC ids, while keeping
the ability to target specific execution contexts.

Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de>
Co-developed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/include/uapi/asm/kvm.h |  5 +++
 arch/x86/kvm/lapic.c            | 59 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/lapic.h            | 33 ++++++++++++++++++
 arch/x86/kvm/x86.c              | 15 +++++++++
 include/uapi/linux/kvm.h        |  2 ++
 6 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dff10051e9b6..a2f224f95404 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1298,6 +1298,9 @@ struct kvm_arch {
 	struct rw_semaphore apicv_update_lock;
 	unsigned long apicv_inhibit_reasons;
 
+	u32 apic_id_group_mask;
+	u8 apic_id_group_shift;
+
 	gpa_t wall_clock;
 
 	bool mwait_in_guest;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a448d0964fc0..f73d137784d7 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -565,4 +565,9 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_SW_PROTECTED_VM	1
 
+/* for KVM_SET_APIC_ID_GROUPS */
+struct kvm_apic_id_groups {
+	__u8 n_bits; /* nr of bits used to represent group in the APIC ID */
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3e977dbbf993..f55d216cb2a0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -141,7 +141,7 @@ static inline int apic_enabled(struct kvm_lapic *apic)
 
 static inline u32 kvm_x2apic_id(struct kvm_lapic *apic)
 {
-	return apic->vcpu->vcpu_id;
+	return kvm_apic_id(apic->vcpu);
 }
 
 static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu)
@@ -219,8 +219,8 @@ static int kvm_recalculate_phys_map(struct kvm_apic_map *new,
 				    bool *xapic_id_mismatch)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	u32 x2apic_id = kvm_x2apic_id(apic);
-	u32 xapic_id = kvm_xapic_id(apic);
+	u32 x2apic_id = kvm_apic_id_and_group(vcpu);
+	u32 xapic_id = kvm_apic_id_and_group(vcpu);
 	u32 physical_id;
 
 	/*
@@ -299,6 +299,13 @@ static void kvm_recalculate_logical_map(struct kvm_apic_map *new,
 	u16 mask;
 	u32 ldr;
 
+	/*
+	 * Using maps for logical destinations when KVM_CAP_APIC_ID_GRUPS is in
+	 * use isn't supported.
+	 */
+	if (kvm_apic_group(vcpu))
+		new->logical_mode = KVM_APIC_MODE_MAP_DISABLED;
+
 	if (new->logical_mode == KVM_APIC_MODE_MAP_DISABLED)
 		return;
 
@@ -370,6 +377,25 @@ enum {
 	DIRTY
 };
 
+int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
+				    struct kvm_apic_id_groups *groups)
+{
+	u8 n_bits = groups->n_bits;
+
+	if (n_bits > 32)
+		return -EINVAL;
+
+	kvm->arch.apic_id_group_mask = n_bits ? GENMASK(31, 32 - n_bits): 0;
+	/*
+	 * Bitshifts >= than the width of the type are UD, so set the
+	 * apic group shift to 0 when n_bits == 0. The group mask above will
+	 * clear the APIC ID, so group querying functions will return the
+	 * correct value.
+	 */
+	kvm->arch.apic_id_group_shift = n_bits ? 32 - n_bits : 0;
+	return 0;
+}
+
 void kvm_recalculate_apic_map(struct kvm *kvm)
 {
 	struct kvm_apic_map *new, *old = NULL;
@@ -414,7 +440,7 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
 
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		if (kvm_apic_present(vcpu))
-			max_id = max(max_id, kvm_x2apic_id(vcpu->arch.apic));
+			max_id = max(max_id, kvm_apic_id_and_group(vcpu));
 
 	new = kvzalloc(sizeof(struct kvm_apic_map) +
 	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1),
@@ -525,7 +551,7 @@ static inline void kvm_apic_set_x2apic_id(struct kvm_lapic *apic, u32 id)
 {
 	u32 ldr = kvm_apic_calc_x2apic_ldr(id);
 
-	WARN_ON_ONCE(id != apic->vcpu->vcpu_id);
+	WARN_ON_ONCE(id != kvm_apic_id(apic->vcpu));
 
 	kvm_lapic_set_reg(apic, APIC_ID, id);
 	kvm_lapic_set_reg(apic, APIC_LDR, ldr);
@@ -1067,6 +1093,17 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
 	struct kvm_lapic *target = vcpu->arch.apic;
 	u32 mda = kvm_apic_mda(vcpu, dest, source, target);
 
+	/*
+	 * Make sure vCPUs belong to the same APIC group, it's not possible
+	 * to send interrupts across groups.
+	 *
+	 * Non-IPIs and PV-IPIs can only be injected into the default APIC
+	 * group (group 0).
+	 */
+	if ((source && !kvm_match_apic_group(source->vcpu, vcpu)) ||
+	    kvm_apic_group(vcpu))
+		return false;
+
 	ASSERT(target);
 	switch (shorthand) {
 	case APIC_DEST_NOSHORT:
@@ -1518,6 +1555,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
 	else
 		irq.dest_id = GET_XAPIC_DEST_FIELD(icr_high);
 
+	if (irq.dest_mode == APIC_DEST_PHYSICAL)
+		kvm_apic_id_set_group(apic->vcpu->kvm,
+				      kvm_apic_group(apic->vcpu), &irq.dest_id);
+
 	trace_kvm_apic_ipi(icr_low, irq.dest_id);
 
 	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
@@ -2541,7 +2582,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 	/* update jump label if enable bit changes */
 	if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) {
 		if (value & MSR_IA32_APICBASE_ENABLE) {
-			kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
+			kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
 			static_branch_slow_dec_deferred(&apic_hw_disabled);
 			/* Check if there are APF page ready requests pending */
 			kvm_make_request(KVM_REQ_APF_READY, vcpu);
@@ -2553,9 +2594,9 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 
 	if ((old_value ^ value) & X2APIC_ENABLE) {
 		if (value & X2APIC_ENABLE)
-			kvm_apic_set_x2apic_id(apic, vcpu->vcpu_id);
+			kvm_apic_set_x2apic_id(apic, kvm_apic_id(vcpu));
 		else if (value & MSR_IA32_APICBASE_ENABLE)
-			kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
+			kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
 	}
 
 	if ((old_value ^ value) & (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) {
@@ -2685,7 +2726,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 	/* The xAPIC ID is set at RESET even if the APIC was already enabled. */
 	if (!init_event)
-		kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
+		kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
 	kvm_apic_set_version(apic->vcpu);
 
 	for (i = 0; i < apic->nr_lvt_entries; i++)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e1021517cf04..542bd208e52b 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
 void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
+int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
+				    struct kvm_apic_id_groups *groups);
 void kvm_recalculate_apic_map(struct kvm *kvm);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
@@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
 	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
 }
 
+static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
+{
+	return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;
+}
+
+static inline u32 kvm_apic_id_and_group(struct kvm_vcpu *vcpu)
+{
+	return vcpu->vcpu_id;
+}
+
+static inline u32 kvm_apic_group(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	return (vcpu->vcpu_id & kvm->arch.apic_id_group_mask) >>
+	       kvm->arch.apic_id_group_shift;
+}
+
+static inline void kvm_apic_id_set_group(struct kvm *kvm, u32 group,
+					 u32 *apic_id)
+{
+	*apic_id |= ((group << kvm->arch.apic_id_group_shift) &
+		     kvm->arch.apic_id_group_mask);
+}
+
+static inline bool kvm_match_apic_group(struct kvm_vcpu *src,
+					struct kvm_vcpu *dst)
+{
+	return kvm_apic_group(src) == kvm_apic_group(dst);
+}
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e3eb608b6692..4cd3f00475c1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4526,6 +4526,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_MEMORY_FAULT_INFO:
+	case KVM_CAP_APIC_ID_GROUPS:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -7112,6 +7113,20 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		r = kvm_vm_ioctl_set_msr_filter(kvm, &filter);
 		break;
 	}
+	case KVM_SET_APIC_ID_GROUPS: {
+		struct kvm_apic_id_groups groups;
+
+		r = -EINVAL;
+		if (kvm->created_vcpus)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_from_user(&groups, argp, sizeof(groups)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_apic_id_groups(kvm, &groups);
+		break;
+	}
 	default:
 		r = -ENOTTY;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5b5820d19e71..d7a01766bf21 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1219,6 +1219,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_MEMORY_ATTRIBUTES 232
 #define KVM_CAP_GUEST_MEMFD 233
 #define KVM_CAP_VM_TYPES 234
+#define KVM_CAP_APIC_ID_GROUPS 235
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2307,4 +2308,5 @@ struct kvm_create_guest_memfd {
 
 #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
 
+#define KVM_SET_APIC_ID_GROUPS _IOW(KVMIO, 0xd7, struct kvm_apic_id_groups)
 #endif /* __LINUX_KVM_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 11:44   ` Alexander Graf
  2023-11-08 11:17 ` [RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function Nicolas Saenz Julienne
                   ` (31 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Prepare infrastructure to be able to return data through the XMM
registers when Hyper-V hypercalls are issues in fast mode. The XMM
registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
restored on successful hypercall completion.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/hyperv-tlfs.h |  2 +-
 arch/x86/kvm/hyperv.c              | 33 +++++++++++++++++++++++++++++-
 include/uapi/linux/kvm.h           |  6 ++++++
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 2ff26f53cd62..af594aa65307 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -49,7 +49,7 @@
 /* Support for physical CPU dynamic partitioning events is available*/
 #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE	BIT(3)
 /*
- * Support for passing hypercall input parameter block via XMM
+ * Support for passing hypercall input and output parameter block via XMM
  * registers is available
  */
 #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE		BIT(4)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 238afd7335e4..e1bc861ab3b0 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1815,6 +1815,7 @@ struct kvm_hv_hcall {
 	u16 rep_idx;
 	bool fast;
 	bool rep;
+	bool xmm_dirty;
 	sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
 
 	/*
@@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
 	return ret;
 }
 
+static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
+{
+	int reg;
+
+	kvm_fpu_get();
+	for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
+		const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
+		_kvm_write_sse_reg(reg, &data);
+	}
+	kvm_fpu_put();
+}
+
+static bool kvm_hv_is_xmm_output_hcall(u16 code)
+{
+	return false;
+}
+
 static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 {
-	return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
+	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
+	u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
+	u64 result = vcpu->run->hyperv.u.hcall.result;
+
+	if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
+		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
+
+	return kvm_hv_hypercall_complete(vcpu, result);
 }
 
 static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
@@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 		break;
 	}
 
+	if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty)
+		kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm);
+
 hypercall_complete:
 	return kvm_hv_hypercall_complete(vcpu, ret);
 
@@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	vcpu->run->hyperv.u.hcall.input = hc.param;
 	vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
 	vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
+	if (hc.fast)
+		memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
 	vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
 	return 0;
 }
@@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
 
 			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
+			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
 			ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE;
 			ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d7a01766bf21..5ce06a1eee2b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -192,6 +192,11 @@ struct kvm_s390_cmma_log {
 	__u64 values;
 };
 
+struct kvm_hyperv_xmm_reg {
+	__u64 low;
+	__u64 high;
+};
+
 struct kvm_hyperv_exit {
 #define KVM_EXIT_HYPERV_SYNIC          1
 #define KVM_EXIT_HYPERV_HCALL          2
@@ -210,6 +215,7 @@ struct kvm_hyperv_exit {
 			__u64 input;
 			__u64 result;
 			__u64 params[2];
+			struct kvm_hyperv_xmm_reg xmm[6];
 		} hcall;
 		struct {
 			__u32 msr;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (2 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:01   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page Nicolas Saenz Julienne
                   ` (30 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

The hypercall page patching is about to grow considerably, move it into
its own function.

No functional change intended.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c | 69 ++++++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index e1bc861ab3b0..78d053042667 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -256,6 +256,42 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
 	kvm_make_request(KVM_REQ_HV_EXIT, vcpu);
 }
 
+static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u8 instructions[9];
+	int i = 0;
+	u64 addr;
+
+	/*
+	 * If Xen and Hyper-V hypercalls are both enabled, disambiguate
+	 * the same way Xen itself does, by setting the bit 31 of EAX
+	 * which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just
+	 * going to be clobbered on 64-bit.
+	 */
+	if (kvm_xen_hypercall_enabled(kvm)) {
+		/* orl $0x80000000, %eax */
+		instructions[i++] = 0x0d;
+		instructions[i++] = 0x00;
+		instructions[i++] = 0x00;
+		instructions[i++] = 0x00;
+		instructions[i++] = 0x80;
+	}
+
+	/* vmcall/vmmcall */
+	static_call(kvm_x86_patch_hypercall)(vcpu, instructions + i);
+	i += 3;
+
+	/* ret */
+	((unsigned char *)instructions)[i++] = 0xc3;
+
+	addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
+	if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
+		return 1;
+
+	return 0;
+}
+
 static int synic_set_msr(struct kvm_vcpu_hv_synic *synic,
 			 u32 msr, u64 data, bool host)
 {
@@ -1338,11 +1374,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
 		if (!hv->hv_guest_os_id)
 			hv->hv_hypercall &= ~HV_X64_MSR_HYPERCALL_ENABLE;
 		break;
-	case HV_X64_MSR_HYPERCALL: {
-		u8 instructions[9];
-		int i = 0;
-		u64 addr;
-
+	case HV_X64_MSR_HYPERCALL:
 		/* if guest os id is not set hypercall should remain disabled */
 		if (!hv->hv_guest_os_id)
 			break;
@@ -1351,34 +1383,11 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
 			break;
 		}
 
-		/*
-		 * If Xen and Hyper-V hypercalls are both enabled, disambiguate
-		 * the same way Xen itself does, by setting the bit 31 of EAX
-		 * which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just
-		 * going to be clobbered on 64-bit.
-		 */
-		if (kvm_xen_hypercall_enabled(kvm)) {
-			/* orl $0x80000000, %eax */
-			instructions[i++] = 0x0d;
-			instructions[i++] = 0x00;
-			instructions[i++] = 0x00;
-			instructions[i++] = 0x00;
-			instructions[i++] = 0x80;
-		}
-
-		/* vmcall/vmmcall */
-		static_call(kvm_x86_patch_hypercall)(vcpu, instructions + i);
-		i += 3;
-
-		/* ret */
-		((unsigned char *)instructions)[i++] = 0xc3;
-
-		addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
-		if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
+		if (patch_hypercall_page(vcpu, data))
 			return 1;
+
 		hv->hv_hypercall = data;
 		break;
-	}
 	case HV_X64_MSR_REFERENCE_TSC:
 		hv->hv_tsc_page = data;
 		if (hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (3 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 11:53   ` Alexander Graf
  2023-11-28  7:08   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs Nicolas Saenz Julienne
                   ` (29 subsequent siblings)
  34 siblings, 2 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

VTL call/return hypercalls have their own entry points in the hypercall
page because they don't follow normal hyper-v hypercall conventions.
Move the VTL call/return control input into ECX/RAX and set the
hypercall code into EAX/RCX before calling the hypercall instruction in
order to be able to use the Hyper-V hypercall entry function.

Guests can read an emulated code page offsets register to know the
offsets into the hypercall page for the VTL call/return entries.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>

---

My tree has the additional patch, we're still trying to understand under
what conditions Windows expects the offset to be fixed.

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 54f7f36a89bf..9f2ea8c34447 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -294,6 +294,7 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)

        /* VTL call/return entries */
        if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
+               i = 22;
 #ifdef CONFIG_X86_64
                if (is_64_bit_mode(vcpu)) {
                        /*
---
 arch/x86/include/asm/kvm_host.h   |  2 +
 arch/x86/kvm/hyperv.c             | 78 ++++++++++++++++++++++++++++++-
 include/asm-generic/hyperv-tlfs.h | 11 +++++
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a2f224f95404..00cd21b09f8c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1105,6 +1105,8 @@ struct kvm_hv {
 	u64 hv_tsc_emulation_status;
 	u64 hv_invtsc_control;
 
+	union hv_register_vsm_code_page_offsets vsm_code_page_offsets;
+
 	/* How many vCPUs have VP index != vCPU index */
 	atomic_t num_mismatched_vp_indexes;
 
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 78d053042667..d4b1b53ea63d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
 static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
-	u8 instructions[9];
+	struct kvm_hv *hv = to_kvm_hv(kvm);
+	u8 instructions[0x30];
 	int i = 0;
 	u64 addr;
 
@@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 	/* ret */
 	((unsigned char *)instructions)[i++] = 0xc3;
 
+	/* VTL call/return entries */
+	if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
+#ifdef CONFIG_X86_64
+		if (is_64_bit_mode(vcpu)) {
+			/*
+			 * VTL call 64-bit entry prologue:
+			 * 	mov %rcx, %rax
+			 * 	mov $0x11, %ecx
+			 * 	jmp 0:
+			 */
+			hv->vsm_code_page_offsets.vtl_call_offset = i;
+			instructions[i++] = 0x48;
+			instructions[i++] = 0x89;
+			instructions[i++] = 0xc8;
+			instructions[i++] = 0xb9;
+			instructions[i++] = 0x11;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0xeb;
+			instructions[i++] = 0xe0;
+			/*
+			 * VTL return 64-bit entry prologue:
+			 * 	mov %rcx, %rax
+			 * 	mov $0x12, %ecx
+			 * 	jmp 0:
+			 */
+			hv->vsm_code_page_offsets.vtl_return_offset = i;
+			instructions[i++] = 0x48;
+			instructions[i++] = 0x89;
+			instructions[i++] = 0xc8;
+			instructions[i++] = 0xb9;
+			instructions[i++] = 0x12;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0xeb;
+			instructions[i++] = 0xd6;
+		} else
+#endif
+		{
+			/*
+			 * VTL call 32-bit entry prologue:
+			 * 	mov %eax, %ecx
+			 * 	mov $0x11, %eax
+			 * 	jmp 0:
+			 */
+			hv->vsm_code_page_offsets.vtl_call_offset = i;
+			instructions[i++] = 0x89;
+			instructions[i++] = 0xc1;
+			instructions[i++] = 0xb8;
+			instructions[i++] = 0x11;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0xeb;
+			instructions[i++] = 0xf3;
+			/*
+			 * VTL return 32-bit entry prologue:
+			 * 	mov %eax, %ecx
+			 * 	mov $0x12, %eax
+			 * 	jmp 0:
+			 */
+			hv->vsm_code_page_offsets.vtl_return_offset = i;
+			instructions[i++] = 0x89;
+			instructions[i++] = 0xc1;
+			instructions[i++] = 0xb8;
+			instructions[i++] = 0x12;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0x00;
+			instructions[i++] = 0xeb;
+			instructions[i++] = 0xea;
+		}
+	}
 	addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
 	if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
 		return 1;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index fdac4a1714ec..0e7643c1ef01 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -823,4 +823,15 @@ struct hv_mmio_write_input {
 	u8 data[HV_HYPERCALL_MMIO_MAX_DATA_LENGTH];
 } __packed;
 
+/*
+ * VTL call/return hypercall page offsets register
+ */
+union hv_register_vsm_code_page_offsets {
+	u64 as_u64;
+	struct {
+		u64 vtl_call_offset:12;
+		u64 vtl_return_offset:12;
+		u64 reserved:40;
+	} __packed;
+};
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (4 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:14   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM Nicolas Saenz Julienne
                   ` (28 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

HVCALL_SEND_IPI and HVCALL_SEND_IPI_EX allow targeting specific a
specific VTL. Honour the requests.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c             | 24 +++++++++++++++++-------
 arch/x86/kvm/trace.h              | 20 ++++++++++++--------
 include/asm-generic/hyperv-tlfs.h |  6 ++++--
 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index d4b1b53ea63d..2cf430f6ddd8 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2230,7 +2230,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 }
 
 static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
-				    u64 *sparse_banks, u64 valid_bank_mask)
+				    u64 *sparse_banks, u64 valid_bank_mask, int vtl)
 {
 	struct kvm_lapic_irq irq = {
 		.delivery_mode = APIC_DM_FIXED,
@@ -2245,6 +2245,9 @@ static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
 					    valid_bank_mask, sparse_banks))
 			continue;
 
+		if (kvm_hv_get_active_vtl(vcpu) != vtl)
+			continue;
+
 		/* We fail only when APIC is disabled */
 		kvm_apic_set_irq(vcpu, &irq, NULL);
 	}
@@ -2257,13 +2260,19 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 	struct kvm *kvm = vcpu->kvm;
 	struct hv_send_ipi_ex send_ipi_ex;
 	struct hv_send_ipi send_ipi;
+	union hv_input_vtl *in_vtl;
 	u64 valid_bank_mask;
 	u32 vector;
 	bool all_cpus;
+	u8 vtl;
+
+	/* VTL is at the same offset on both IPI types */
+	in_vtl = &send_ipi.in_vtl;
+	vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu);
 
 	if (hc->code == HVCALL_SEND_IPI) {
 		if (!hc->fast) {
-			if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi,
+			if (unlikely(kvm_vcpu_read_guest(vcpu, hc->ingpa, &send_ipi,
 						    sizeof(send_ipi))))
 				return HV_STATUS_INVALID_HYPERCALL_INPUT;
 			sparse_banks[0] = send_ipi.cpu_mask;
@@ -2278,10 +2287,10 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 		all_cpus = false;
 		valid_bank_mask = BIT_ULL(0);
 
-		trace_kvm_hv_send_ipi(vector, sparse_banks[0]);
+		trace_kvm_hv_send_ipi(vector, sparse_banks[0], vtl);
 	} else {
 		if (!hc->fast) {
-			if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi_ex,
+			if (unlikely(kvm_vcpu_read_guest(vcpu, hc->ingpa, &send_ipi_ex,
 						    sizeof(send_ipi_ex))))
 				return HV_STATUS_INVALID_HYPERCALL_INPUT;
 		} else {
@@ -2292,7 +2301,8 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 
 		trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
 					 send_ipi_ex.vp_set.format,
-					 send_ipi_ex.vp_set.valid_bank_mask);
+					 send_ipi_ex.vp_set.valid_bank_mask,
+					 vtl);
 
 		vector = send_ipi_ex.vector;
 		valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
@@ -2322,9 +2332,9 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
 		return HV_STATUS_INVALID_HYPERCALL_INPUT;
 
 	if (all_cpus)
-		kvm_hv_send_ipi_to_many(kvm, vector, NULL, 0);
+		kvm_hv_send_ipi_to_many(kvm, vector, NULL, 0, vtl);
 	else
-		kvm_hv_send_ipi_to_many(kvm, vector, sparse_banks, valid_bank_mask);
+		kvm_hv_send_ipi_to_many(kvm, vector, sparse_banks, valid_bank_mask, vtl);
 
 ret_success:
 	return HV_STATUS_SUCCESS;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 83843379813e..ab8839c47bc7 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1606,42 +1606,46 @@ TRACE_EVENT(kvm_hv_flush_tlb_ex,
  * Tracepoints for kvm_hv_send_ipi.
  */
 TRACE_EVENT(kvm_hv_send_ipi,
-	TP_PROTO(u32 vector, u64 processor_mask),
-	TP_ARGS(vector, processor_mask),
+	TP_PROTO(u32 vector, u64 processor_mask, u8 vtl),
+	TP_ARGS(vector, processor_mask, vtl),
 
 	TP_STRUCT__entry(
 		__field(u32, vector)
 		__field(u64, processor_mask)
+		__field(u8, vtl)
 	),
 
 	TP_fast_assign(
 		__entry->vector = vector;
 		__entry->processor_mask = processor_mask;
+		__entry->vtl = vtl;
 	),
 
-	TP_printk("vector %x processor_mask 0x%llx",
-		  __entry->vector, __entry->processor_mask)
+	TP_printk("vector %x processor_mask 0x%llx vtl %d",
+		  __entry->vector, __entry->processor_mask, __entry->vtl)
 );
 
 TRACE_EVENT(kvm_hv_send_ipi_ex,
-	TP_PROTO(u32 vector, u64 format, u64 valid_bank_mask),
-	TP_ARGS(vector, format, valid_bank_mask),
+	TP_PROTO(u32 vector, u64 format, u64 valid_bank_mask, u8 vtl),
+	TP_ARGS(vector, format, valid_bank_mask, vtl),
 
 	TP_STRUCT__entry(
 		__field(u32, vector)
 		__field(u64, format)
 		__field(u64, valid_bank_mask)
+		__field(u8, vtl)
 	),
 
 	TP_fast_assign(
 		__entry->vector = vector;
 		__entry->format = format;
 		__entry->valid_bank_mask = valid_bank_mask;
+		__entry->vtl = vtl;
 	),
 
-	TP_printk("vector %x format %llx valid_bank_mask 0x%llx",
+	TP_printk("vector %x format %llx valid_bank_mask 0x%llx vtl %d",
 		  __entry->vector, __entry->format,
-		  __entry->valid_bank_mask)
+		  __entry->valid_bank_mask, __entry->vtl)
 );
 
 TRACE_EVENT(kvm_pv_tlb_flush,
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 0e7643c1ef01..40d7dc793c03 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -424,14 +424,16 @@ struct hv_vpset {
 /* HvCallSendSyntheticClusterIpi hypercall */
 struct hv_send_ipi {
 	u32 vector;
-	u32 reserved;
+	union hv_input_vtl in_vtl;
+	u8 reserved[3];
 	u64 cpu_mask;
 } __packed;
 
 /* HvCallSendSyntheticClusterIpiEx hypercall */
 struct hv_send_ipi_ex {
 	u32 vector;
-	u32 reserved;
+	union hv_input_vtl in_vtl;
+	u8 reserved[3];
 	struct hv_vpset vp_set;
 } __packed;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (5 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:16   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled Nicolas Saenz Julienne
                   ` (27 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce a new capability to enable Hyper-V Virtual Secure Mode (VSM)
emulation support.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/hyperv.h           | 5 +++++
 arch/x86/kvm/x86.c              | 5 +++++
 include/uapi/linux/kvm.h        | 1 +
 4 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 00cd21b09f8c..7712e31b7537 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1118,6 +1118,8 @@ struct kvm_hv {
 
 	struct hv_partition_assist_pg *hv_pa_pg;
 	struct kvm_hv_syndbg hv_syndbg;
+
+	bool hv_enable_vsm;
 };
 
 struct msr_bitmap_range {
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index f83b8db72b11..2bfed69ba0db 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -238,4 +238,9 @@ static inline int kvm_hv_verify_vp_assist(struct kvm_vcpu *vcpu)
 
 int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
 
+static inline bool kvm_hv_vsm_enabled(struct kvm *kvm)
+{
+       return kvm->arch.hyperv.hv_enable_vsm;
+}
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4cd3f00475c1..b0512e433032 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4485,6 +4485,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_HYPERV_CPUID:
 	case KVM_CAP_HYPERV_ENFORCE_CPUID:
 	case KVM_CAP_SYS_HYPERV_CPUID:
+	case KVM_CAP_HYPERV_VSM:
 	case KVM_CAP_PCI_SEGMENT:
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
@@ -6519,6 +6520,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+	case KVM_CAP_HYPERV_VSM:
+		kvm->arch.hyperv.hv_enable_vsm = true;
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5ce06a1eee2b..168b6ac6ebe5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1226,6 +1226,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_GUEST_MEMFD 233
 #define KVM_CAP_VM_TYPES 234
 #define KVM_CAP_APIC_ID_GROUPS 235
+#define KVM_CAP_HYPERV_VSM 237
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (6 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:21   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers Nicolas Saenz Julienne
                   ` (26 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

VSM's VTLs are modeled by using a distinct vCPU per VTL. While one VTL
is running the rest of vCPUs are left idle. This doesn't play well with
the approach of tracking emulated timer expiration by using the VMX
preemption timer. Inactive VTL's timers are still meant to run and
inject interrupts regardless of their runstate.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/lapic.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f55d216cb2a0..8cc75b24381b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -152,9 +152,10 @@ static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu)
 
 bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu)
 {
-	return kvm_x86_ops.set_hv_timer
-	       && !(kvm_mwait_in_guest(vcpu->kvm) ||
-		    kvm_can_post_timer_interrupt(vcpu));
+	return kvm_x86_ops.set_hv_timer &&
+	       !(kvm_mwait_in_guest(vcpu->kvm) ||
+		 kvm_can_post_timer_interrupt(vcpu)) &&
+	       !(kvm_hv_vsm_enabled(vcpu->kvm));
 }
 
 static bool kvm_use_posted_timer_interrupt(struct kvm_vcpu *vcpu)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (7 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 12:21   ` Alexander Graf
  2023-11-28  7:25   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE Nicolas Saenz Julienne
                   ` (25 subsequent siblings)
  34 siblings, 2 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce two helper functions. The first one queries a vCPU's VTL
level, the second one, given a struct kvm_vcpu and VTL pair, returns the
corresponding 'sibling' struct kvm_vcpu at the right VTL.

We keep track of each VTL's state by having a distinct struct kvm_vpcu
for each level. VTL-vCPUs that belong to the same guest CPU share the
same physical APIC id, but belong to different APIC groups where the
apic group represents the vCPU's VTL.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 2bfed69ba0db..5433107e7cc8 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -23,6 +23,7 @@
 
 #include <linux/kvm_host.h>
 #include "x86.h"
+#include "lapic.h"
 
 /* "Hv#1" signature */
 #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648
@@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct kvm_vcpu *vcpu)
 	return &vcpu->kvm->arch.hyperv.hv_syndbg;
 }
 
+static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, int vtl)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u32 target_id = kvm_apic_id(vcpu);
+
+	kvm_apic_id_set_group(kvm, vtl, &target_id);
+	if (vcpu->vcpu_id == target_id)
+		return vcpu;
+
+	return kvm_get_vcpu_by_id(kvm, target_id);
+}
+
+static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu)
+{
+	return kvm_apic_group(vcpu);
+}
+
 static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu)
 {
 	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (8 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:26   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space Nicolas Saenz Julienne
                   ` (24 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

HVCALL_GET_VP_REGISTERS exposes the VTL call hypercall page entry
offsets to the guest. This hypercall is implemented in user-space while
the hypercall page patching happens in-kernel. So expose it as part of
the partition wide VSM state.

NOTE: Alternatively there is the option of sharing this information
through a VTL KVM device attribute (the device is introduced in
subsequent patches).

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/uapi/asm/kvm.h |  5 +++++
 arch/x86/kvm/hyperv.c           |  8 ++++++++
 arch/x86/kvm/hyperv.h           |  2 ++
 arch/x86/kvm/x86.c              | 18 ++++++++++++++++++
 include/uapi/linux/kvm.h        |  4 ++++
 5 files changed, 37 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index f73d137784d7..370483d5d5fd 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -570,4 +570,9 @@ struct kvm_apic_id_groups {
 	__u8 n_bits; /* nr of bits used to represent group in the APIC ID */
 };
 
+/* for KVM_HV_GET_VSM_STATE */
+struct kvm_hv_vsm_state {
+	__u64 vsm_code_page_offsets;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 2cf430f6ddd8..caaa859932c5 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2990,3 +2990,11 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 
 	return 0;
 }
+
+int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state)
+{
+	struct kvm_hv* hv = &kvm->arch.hyperv;
+
+	state->vsm_code_page_offsets = hv->vsm_code_page_offsets.as_u64;
+	return 0;
+}
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 5433107e7cc8..b3d1113efe82 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -261,4 +261,6 @@ static inline bool kvm_hv_vsm_enabled(struct kvm *kvm)
        return kvm->arch.hyperv.hv_enable_vsm;
 }
 
+int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0512e433032..57f9c58e1e32 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7132,6 +7132,24 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 		r = kvm_vm_ioctl_set_apic_id_groups(kvm, &groups);
 		break;
 	}
+	case KVM_HV_GET_VSM_STATE: {
+		struct kvm_hv_vsm_state vsm_state;
+
+		r = -EINVAL;
+		if (!kvm_hv_vsm_enabled(kvm))
+			goto out;
+
+		r = kvm_vm_ioctl_get_hv_vsm_state(kvm, &vsm_state);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &vsm_state, sizeof(vsm_state)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENOTTY;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 168b6ac6ebe5..03f5c08fd7aa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2316,4 +2316,8 @@ struct kvm_create_guest_memfd {
 #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
 
 #define KVM_SET_APIC_ID_GROUPS _IOW(KVMIO, 0xd7, struct kvm_apic_id_groups)
+
+/* Get/Set Hyper-V VSM state. Available with KVM_CAP_HYPERV_VSM */
+#define KVM_HV_GET_VSM_STATE _IOR(KVMIO, 0xd5, struct kvm_hv_vsm_state)
+
 #endif /* __LINUX_KVM_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (9 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 12:14   ` Alexander Graf
  2023-11-08 11:17 ` [RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls " Nicolas Saenz Julienne
                   ` (23 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Let user-space handle HVCALL_GET_VP_REGISTERS and
HVCALL_SET_VP_REGISTERS through the KVM_EXIT_HYPERV_HVCALL exit reason.
Additionally, expose the cpuid bit.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c             | 9 +++++++++
 include/asm-generic/hyperv-tlfs.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index caaa859932c5..a3970d52eef1 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2456,6 +2456,9 @@ static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
 
 static bool kvm_hv_is_xmm_output_hcall(u16 code)
 {
+	if (code == HVCALL_GET_VP_REGISTERS)
+		return true;
+
 	return false;
 }
 
@@ -2520,6 +2523,8 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
 	case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
 	case HVCALL_SEND_IPI_EX:
+	case HVCALL_GET_VP_REGISTERS:
+	case HVCALL_SET_VP_REGISTERS:
 		return true;
 	}
 
@@ -2738,6 +2743,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 			break;
 		}
 		goto hypercall_userspace_exit;
+	case HVCALL_GET_VP_REGISTERS:
+	case HVCALL_SET_VP_REGISTERS:
+		goto hypercall_userspace_exit;
 	default:
 		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
 		break;
@@ -2903,6 +2911,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_POST_MESSAGES;
 			ent->ebx |= HV_SIGNAL_EVENTS;
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
+			ent->ebx |= HV_ACCESS_VP_REGISTERS;
 
 			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
 			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 40d7dc793c03..24ea699a3d8e 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -89,6 +89,7 @@
 #define HV_ACCESS_STATS				BIT(8)
 #define HV_DEBUGGING				BIT(11)
 #define HV_CPU_MANAGEMENT			BIT(12)
+#define HV_ACCESS_VP_REGISTERS			BIT(17)
 #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
 #define HV_ISOLATION				BIT(22)
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls in user-space
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (10 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:28   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 13/33] KVM: Allow polling vCPUs for events Nicolas Saenz Julienne
                   ` (22 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Let user-space handle all hypercalls that fall under the AccessVsm
partition privilege flag. That is:
 - HVCALL_MODIFY_VTL_PROTECTION_MASK:
 - HVCALL_ENABLE_PARTITION_VTL:
 - HVCALL_ENABLE_VP_VTL:
 - HVCALL_VTL_CALL:
 - HVCALL_VTL_RETURN:
The hypercalls are processed through the KVM_EXIT_HYPERV_HVCALL exit.
Additionally, expose the cpuid bit.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c             | 15 +++++++++++++++
 include/asm-generic/hyperv-tlfs.h |  7 ++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index a3970d52eef1..a266c5d393f5 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2462,6 +2462,11 @@ static bool kvm_hv_is_xmm_output_hcall(u16 code)
 	return false;
 }
 
+static inline bool kvm_hv_is_vtl_call_return(u16 code)
+{
+	return code == HVCALL_VTL_CALL || code == HVCALL_VTL_RETURN;
+}
+
 static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 {
 	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
@@ -2471,6 +2476,9 @@ static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 	if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
 		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
 
+	if (kvm_hv_is_vtl_call_return(code))
+		return kvm_skip_emulated_instruction(vcpu);
+
 	return kvm_hv_hypercall_complete(vcpu, result);
 }
 
@@ -2525,6 +2533,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
 	case HVCALL_SEND_IPI_EX:
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
+	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
 		return true;
 	}
 
@@ -2745,6 +2754,11 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 		goto hypercall_userspace_exit;
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
+	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
+	case HVCALL_ENABLE_PARTITION_VTL:
+	case HVCALL_ENABLE_VP_VTL:
+	case HVCALL_VTL_CALL:
+	case HVCALL_VTL_RETURN:
 		goto hypercall_userspace_exit;
 	default:
 		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
@@ -2912,6 +2926,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 			ent->ebx |= HV_SIGNAL_EVENTS;
 			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
 			ent->ebx |= HV_ACCESS_VP_REGISTERS;
+			ent->ebx |= HV_ACCESS_VSM;
 
 			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
 			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index 24ea699a3d8e..a8b5c8a84bbc 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -89,6 +89,7 @@
 #define HV_ACCESS_STATS				BIT(8)
 #define HV_DEBUGGING				BIT(11)
 #define HV_CPU_MANAGEMENT			BIT(12)
+#define HV_ACCESS_VSM				BIT(16)
 #define HV_ACCESS_VP_REGISTERS			BIT(17)
 #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
 #define HV_ISOLATION				BIT(22)
@@ -147,9 +148,13 @@ union hv_reference_tsc_msr {
 /* Declare the various hypercall operations. */
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE	0x0002
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST	0x0003
-#define HVCALL_ENABLE_VP_VTL			0x000f
 #define HVCALL_NOTIFY_LONG_SPIN_WAIT		0x0008
 #define HVCALL_SEND_IPI				0x000b
+#define HVCALL_MODIFY_VTL_PROTECTION_MASK	0x000c
+#define HVCALL_ENABLE_PARTITION_VTL		0x000d
+#define HVCALL_ENABLE_VP_VTL			0x000f
+#define HVCALL_VTL_CALL				0x0011
+#define HVCALL_VTL_RETURN			0x0012
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
 #define HVCALL_SEND_IPI_EX			0x0015
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 13/33] KVM: Allow polling vCPUs for events
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (11 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls " Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:30   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 14/33] KVM: x86: Add VTL to the MMU role Nicolas Saenz Julienne
                   ` (21 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

A number of use cases have surfaced where it'd be beneficial to have a
vCPU stop its execution in user-space, as opposed to having it sleep
in-kernel. Be it in order to make better use of the pCPU's time while
the vCPU is halted, or to implement security features like Hyper-V's
VSM.

A problem with this approach is that user-space has no way of knowing
whether the vCPU has pending events (interrupts, timers, etc...), so we
need a new interface to query if they are. poll() turned out to be a
very good fit.

So enable polling vCPUs. The poll() interface considers a vCPU has a
pending event if it didn't enter the guest since being kicked by an
event source (being kicked forces a guest exit). Kicking a vCPU that has
pollers wakes up the polling threads.

NOTES:
 - There is a race between the 'vcpu->kicked' check in the polling
   thread and the vCPU thread re-entering the guest. This hardly affects
   the use-cases stated above, but needs to be fixed.

 - This was tested alongside a WIP Hyper-V Virtual Trust Level
   implementation which makes ample use of the poll() interface.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/x86.c       |  2 ++
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 30 ++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 57f9c58e1e32..bf4891bc044e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10788,6 +10788,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		goto cancel_injection;
 	}
 
+	WRITE_ONCE(vcpu->kicked, false);
+
 	if (req_immediate_exit) {
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 		static_call(kvm_x86_request_immediate_exit)(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 687589ce9f63..71e1e8cf8936 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -336,6 +336,7 @@ struct kvm_vcpu {
 #endif
 	int mode;
 	u64 requests;
+	bool kicked;
 	unsigned long guest_debug;
 
 	struct mutex mutex;
@@ -395,6 +396,7 @@ struct kvm_vcpu {
 	 */
 	struct kvm_memory_slot *last_used_slot;
 	u64 last_used_slot_gen;
+	wait_queue_head_t wqh;
 };
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ad9aab898a0c..fde004a0ac46 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -497,12 +497,14 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	kvm_vcpu_set_dy_eligible(vcpu, false);
 	vcpu->preempted = false;
 	vcpu->ready = false;
+	vcpu->kicked = false;
 	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
 	vcpu->last_used_slot = NULL;
 
 	/* Fill the stats id string for the vcpu */
 	snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d",
 		 task_pid_nr(current), id);
+	init_waitqueue_head(&vcpu->wqh);
 }
 
 static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -3970,6 +3972,10 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 		if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
 			smp_send_reschedule(cpu);
 	}
+
+	if (!cmpxchg(&vcpu->kicked, false, true))
+		wake_up_interruptible(&vcpu->wqh);
+
 out:
 	put_cpu();
 }
@@ -4174,6 +4180,29 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static __poll_t kvm_vcpu_poll(struct file *file, poll_table *wait)
+{
+	struct kvm_vcpu *vcpu = file->private_data;
+
+	poll_wait(file, &vcpu->wqh, wait);
+
+	/*
+	 * Make sure we read vcpu->kicked after adding the vcpu into
+	 * the waitqueue list. Otherwise we might have the following race:
+	 *
+	 *   READ_ONCE(vcpu->kicked)
+	 *					cmpxchg(&vcpu->kicked, false, true))
+	 *					wake_up_interruptible(&vcpu->wqh)
+	 *   list_add_tail(wait, &vcpu->wqh)
+	 */
+	smp_mb();
+	if (READ_ONCE(vcpu->kicked)) {
+		return EPOLLIN;
+	}
+
+	return 0;
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
@@ -4186,6 +4215,7 @@ static const struct file_operations kvm_vcpu_fops = {
 	.release        = kvm_vcpu_release,
 	.unlocked_ioctl = kvm_vcpu_ioctl,
 	.mmap           = kvm_vcpu_mmap,
+	.poll		= kvm_vcpu_poll,
 	.llseek		= noop_llseek,
 	KVM_COMPAT(kvm_vcpu_compat_ioctl),
 };
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 14/33] KVM: x86: Add VTL to the MMU role
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (12 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 13/33] KVM: Allow polling vCPUs for events Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 17:26   ` Sean Christopherson
  2023-11-08 11:17 ` [RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults Nicolas Saenz Julienne
                   ` (20 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

With the upcoming introduction of per-VTL memory protections, make MMU
roles VTL aware. This will avoid sharing PTEs between vCPUs that belong
to different VTLs, and that have distinct memory access restrictions.

Four bits are allocated to store the VTL number in the MMU role, since
the TLFS states there is a maximum of 16 levels.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/hyperv.h           | 6 ++++++
 arch/x86/kvm/mmu.h              | 1 +
 arch/x86/kvm/mmu/mmu.c          | 3 +++
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7712e31b7537..1f5a85d461ce 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -338,7 +338,8 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned passthrough:1;
-		unsigned :5;
+		unsigned vtl:4;
+		unsigned :1;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index b3d1113efe82..605e80b9e5eb 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -263,4 +263,10 @@ static inline bool kvm_hv_vsm_enabled(struct kvm *kvm)
 
 int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state);
 
+static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu,
+					    union kvm_mmu_page_role *role)
+{
+	role->vtl = kvm_hv_get_active_vtl(vcpu);
+}
+
 #endif
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 253fb2093d5d..e170388c6da1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -304,4 +304,5 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
 		return gpa;
 	return translate_nested_gpa(vcpu, gpa, access, exception);
 }
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index baeba8fc1c38..2afef86863fb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -28,6 +28,7 @@
 #include "page_track.h"
 #include "cpuid.h"
 #include "spte.h"
+#include "hyperv.h"
 
 #include <linux/kvm_host.h>
 #include <linux/types.h>
@@ -5197,6 +5198,7 @@ static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu,
 	role.base.smm = is_smm(vcpu);
 	role.base.guest_mode = is_guest_mode(vcpu);
 	role.ext.valid = 1;
+	kvm_mmu_role_set_hv_bits(vcpu, &role.base);
 
 	if (!____is_cr0_pg(regs)) {
 		role.base.direct = 1;
@@ -5271,6 +5273,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	role.level = kvm_mmu_get_tdp_level(vcpu);
 	role.direct = true;
 	role.has_4_byte_gpte = false;
+	kvm_mmu_role_set_hv_bits(vcpu, &role);
 
 	return role;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (13 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 14/33] KVM: x86: Add VTL to the MMU role Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:34   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits Nicolas Saenz Julienne
                   ` (19 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

The upcoming per-VTL memory protections support needs to fault in
non-executable memory. Introduce a new attribute in struct
kvm_page_fault, map_executable, to control whether the gfn range should
be mapped as executable.

No functional change intended.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c          | 6 +++++-
 arch/x86/kvm/mmu/mmu_internal.h | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 8 ++++++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2afef86863fb..4e02d506cc25 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3245,6 +3245,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu_page *sp;
 	int ret;
 	gfn_t base_gfn = fault->gfn;
+	unsigned access = ACC_ALL;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
 
@@ -3274,7 +3275,10 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
 		return -EFAULT;
 
-	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
+	if (!fault->map_executable)
+		access &= ~ACC_EXEC_MASK;
+
+	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, access,
 			   base_gfn, fault->pfn, fault);
 	if (ret == RET_PF_SPURIOUS)
 		return ret;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b66a7d47e0e4..bd62c4d5d5f1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -239,6 +239,7 @@ struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	bool map_executable;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
@@ -298,6 +299,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 		.is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
+		.map_executable = true,
 	};
 	int r;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6cd4dd631a2f..46f3e72ab770 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -957,14 +957,18 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	u64 new_spte;
 	int ret = RET_PF_FIXED;
 	bool wrprot = false;
+	unsigned access = ACC_ALL;
 
 	if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
 		return RET_PF_RETRY;
 
+	if (!fault->map_executable)
+		access &= ~ACC_EXEC_MASK;
+
 	if (unlikely(!fault->slot))
-		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+		new_spte = make_mmio_spte(vcpu, iter->gfn, access);
 	else
-		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
+		wrprot = make_spte(vcpu, sp, fault->slot, access, iter->gfn,
 					 fault->pfn, iter->old_spte, fault->prefetch, true,
 					 fault->map_writable, &new_spte);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (14 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:36   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled Nicolas Saenz Julienne
                   ` (18 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Include the fault's read, write and execute status when exiting to
user-space.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   | 4 ++--
 include/linux/kvm_host.h | 9 +++++++--
 include/uapi/linux/kvm.h | 6 ++++++
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e02d506cc25..feca077c0210 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4300,8 +4300,8 @@ static inline u8 kvm_max_level_for_order(int order)
 static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 					      struct kvm_page_fault *fault)
 {
-	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
-				      PAGE_SIZE, fault->write, fault->exec,
+	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, PAGE_SIZE,
+				      fault->write, fault->exec, fault->user,
 				      fault->is_private);
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 71e1e8cf8936..631fd532c97a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2367,14 +2367,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
 static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 						 gpa_t gpa, gpa_t size,
 						 bool is_write, bool is_exec,
-						 bool is_private)
+						 bool is_read, bool is_private)
 {
 	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
 	vcpu->run->memory_fault.gpa = gpa;
 	vcpu->run->memory_fault.size = size;
 
-	/* RWX flags are not (yet) defined or communicated to userspace. */
 	vcpu->run->memory_fault.flags = 0;
+	if (is_read)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_READ;
+	if (is_write)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_WRITE;
+	if (is_exec)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_EXECUTE;
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 03f5c08fd7aa..0ddffb8b0c99 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -533,7 +533,13 @@ struct kvm_run {
 		} notify;
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+#define KVM_MEMORY_EXIT_FLAG_READ	(1ULL << 0)
+#define KVM_MEMORY_EXIT_FLAG_WRITE	(1ULL << 1)
+#define KVM_MEMORY_EXIT_FLAG_EXECUTE	(1ULL << 2)
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+#define KVM_MEMORY_EXIT_NO_ACCESS                            \
+	(KVM_MEMORY_EXIT_FLAG_NR | KVM_MEMORY_EXIT_FLAG_NW | \
+	 KVM_MEMORY_EXIT_FLAG_NX)
 			__u64 flags;
 			__u64 gpa;
 			__u64 size;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (15 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:39   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array Nicolas Saenz Julienne
                   ` (17 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

VSM is also a user of memory attributes, so let it use
kvm_set_mem_attributes().

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index feca077c0210..a1fbb905258b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7265,7 +7265,8 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
 	 * a hugepage can be used for affected ranges.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm) &&
+			 !kvm_hv_vsm_enabled(kvm)))
 		return false;
 
 	return kvm_unmap_gfn_range(kvm, range);
@@ -7322,7 +7323,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	 * a range that has PRIVATE GFNs, and conversely converting a range to
 	 * SHARED may now allow hugepages.
 	 */
-	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
+	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm) &&
+			 !kvm_hv_vsm_enabled(kvm)))
 		return false;
 
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (16 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 16:59   ` Sean Christopherson
  2023-11-28  7:41   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() " Nicolas Saenz Julienne
                   ` (16 subsequent siblings)
  34 siblings, 2 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array to
allow other memory attribute sources to use the function.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   | 5 +++--
 include/linux/kvm_host.h | 8 +++++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a1fbb905258b..96421234ca88 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7301,7 +7301,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
 		if (hugepage_test_mixed(slot, gfn, level - 1) ||
-		    attrs != kvm_get_memory_attributes(kvm, gfn))
+		    attrs != kvm_get_memory_attributes(&kvm->mem_attr_array, gfn))
 			return false;
 	}
 	return true;
@@ -7401,7 +7401,8 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
 		 * be manually checked as the attributes may already be mixed.
 		 */
 		for (gfn = start; gfn < end; gfn += nr_pages) {
-			unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
+			unsigned long attrs =
+				kvm_get_memory_attributes(&kvm->mem_attr_array, gfn);
 
 			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
 				hugepage_clear_mixed(slot, gfn, level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 631fd532c97a..4242588e3dfb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2385,9 +2385,10 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 }
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+static inline unsigned long
+kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)
 {
-	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+	return xa_to_value(xa_load(mem_attr_array, gfn));
 }
 
 bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
@@ -2400,7 +2401,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
-	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+	       kvm_get_memory_attributes(&kvm->mem_attr_array, gfn) &
+		       KVM_MEMORY_ATTRIBUTE_PRIVATE;
 }
 #else
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() from struct kvm's mem_attr_array
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (17 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:42   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() " Nicolas Saenz Julienne
                   ` (15 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Decouple kvm_range_has_memory_attributes() from struct kvm's
mem_attr_array to allow other memory attribute sources to use the
function.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   | 3 ++-
 include/linux/kvm_host.h | 4 ++--
 virt/kvm/kvm_main.c      | 9 +++++----
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 96421234ca88..4ace2f8660b0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7297,7 +7297,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
 
 	if (level == PG_LEVEL_2M)
-		return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+		return kvm_range_has_memory_attributes(&kvm->mem_attr_array,
+						       start, end, attrs);
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
 		if (hugepage_test_mixed(slot, gfn, level - 1) ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4242588e3dfb..32cf05637647 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2391,8 +2391,8 @@ kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)
 	return xa_to_value(xa_load(mem_attr_array, gfn));
 }
 
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long attrs);
+bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start,
+				     gfn_t end, unsigned long attrs);
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fde004a0ac46..6bb23eaf7aa6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2440,10 +2440,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * matching @attrs.
  */
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long attrs)
+bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start,
+				     gfn_t end, unsigned long attrs)
 {
-	XA_STATE(xas, &kvm->mem_attr_array, start);
+	XA_STATE(xas, mem_attr_array, start);
 	unsigned long index;
 	bool has_attrs;
 	void *entry;
@@ -2582,7 +2582,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	mutex_lock(&kvm->slots_lock);
 
 	/* Nothing to do if the entire range as the desired attributes. */
-	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
+	if (kvm_range_has_memory_attributes(&kvm->mem_attr_array, start, end,
+					    attributes))
 		goto out_unlock;
 
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() from struct kvm's mem_attr_array
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (18 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() " Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:43   ` Maxim Levitsky
  2023-11-08 11:17 ` [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument Nicolas Saenz Julienne
                   ` (14 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Decouple hugepage_has_attrs() from struct kvm's mem_attr_array to
allow other memory attribute sources to use the function.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4ace2f8660b0..c0fd3afd6be5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7290,19 +7290,19 @@ static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
 }
 
-static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
-			       gfn_t gfn, int level, unsigned long attrs)
+static bool hugepage_has_attrs(struct xarray *mem_attr_array,
+			       struct kvm_memory_slot *slot, gfn_t gfn,
+			       int level, unsigned long attrs)
 {
 	const unsigned long start = gfn;
 	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
 
 	if (level == PG_LEVEL_2M)
-		return kvm_range_has_memory_attributes(&kvm->mem_attr_array,
-						       start, end, attrs);
+		return kvm_range_has_memory_attributes(mem_attr_array, start, end, attrs);
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
 		if (hugepage_test_mixed(slot, gfn, level - 1) ||
-		    attrs != kvm_get_memory_attributes(&kvm->mem_attr_array, gfn))
+		    attrs != kvm_get_memory_attributes(mem_attr_array, gfn))
 			return false;
 	}
 	return true;
@@ -7344,7 +7344,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 			 * misaligned address regardless of memory attributes.
 			 */
 			if (gfn >= slot->base_gfn) {
-				if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+				if (hugepage_has_attrs(&kvm->mem_attr_array,
+						       slot, gfn, level, attrs))
 					hugepage_clear_mixed(slot, gfn, level);
 				else
 					hugepage_set_mixed(slot, gfn, level);
@@ -7366,7 +7367,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 		 */
 		if (gfn < range->end &&
 		    (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) {
-			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+			if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn,
+					       level, attrs))
 				hugepage_clear_mixed(slot, gfn, level);
 			else
 				hugepage_set_mixed(slot, gfn, level);
@@ -7405,7 +7407,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
 			unsigned long attrs =
 				kvm_get_memory_attributes(&kvm->mem_attr_array, gfn);
 
-			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
+			if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn, level, attrs))
 				hugepage_clear_mixed(slot, gfn, level);
 			else
 				hugepage_set_mixed(slot, gfn, level);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (19 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() " Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 17:08   ` Sean Christopherson
  2023-11-08 11:17 ` [RFC 22/33] KVM: Decouple kvm_ioctl_set_mem_attributes() from kvm's mem_attr_array Nicolas Saenz Julienne
                   ` (13 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Pass the memory attribute array through struct kvm_mmu_notifier_arg and
use it in kvm_arch_post_set_memory_attributes() instead of defaulting on
kvm->mem_attr_array.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   | 8 ++++----
 include/linux/kvm_host.h | 5 ++++-
 virt/kvm/kvm_main.c      | 1 +
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c0fd3afd6be5..c2bec2be2ba9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7311,6 +7311,7 @@ static bool hugepage_has_attrs(struct xarray *mem_attr_array,
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range)
 {
+	struct xarray *mem_attr_array = range->arg.mem_attr_array;
 	unsigned long attrs = range->arg.attributes;
 	struct kvm_memory_slot *slot = range->slot;
 	int level;
@@ -7344,8 +7345,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 			 * misaligned address regardless of memory attributes.
 			 */
 			if (gfn >= slot->base_gfn) {
-				if (hugepage_has_attrs(&kvm->mem_attr_array,
-						       slot, gfn, level, attrs))
+				if (hugepage_has_attrs(mem_attr_array, slot,
+						       gfn, level, attrs))
 					hugepage_clear_mixed(slot, gfn, level);
 				else
 					hugepage_set_mixed(slot, gfn, level);
@@ -7367,8 +7368,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 		 */
 		if (gfn < range->end &&
 		    (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) {
-			if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn,
-					       level, attrs))
+			if (hugepage_has_attrs(mem_attr_array, slot, gfn, level, attrs))
 				hugepage_clear_mixed(slot, gfn, level);
 			else
 				hugepage_set_mixed(slot, gfn, level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 32cf05637647..652656444c45 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,7 +256,10 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	pte_t pte;
-	unsigned long attributes;
+	struct {
+		unsigned long attributes;
+		struct xarray *mem_attr_array;
+	};
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6bb23eaf7aa6..f20dafaedc72 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2569,6 +2569,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.start = start,
 		.end = end,
 		.arg.attributes = attributes,
+		.arg.mem_attr_array = &kvm->mem_attr_array,
 		.handler = kvm_arch_post_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_end,
 		.may_block = true,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 22/33] KVM: Decouple kvm_ioctl_set_mem_attributes() from kvm's mem_attr_array
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (20 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 23/33] KVM: Expose memory attribute helper functions unanimously Nicolas Saenz Julienne
                   ` (12 subsequent siblings)
  34 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

VSM will keep track of each VTL's memory protections in a separate
mem_attr_array. Access to these arrays will happen by issuing
KVM_SET_MEMORY_ATTRIBUTES ioctls to their respective KVM VTL devices
(which is also introduced in subsequent patches). Let the VTL devices
reuse kvm_ioctl_set_mem_attributes() by decoupling it from struct kvm's
mem_attr_array. The xarray is now input as a function argument as well
as the list of supported memory attributes.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 include/linux/kvm_host.h |  3 +++
 virt/kvm/kvm_main.c      | 32 ++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 652656444c45..ad104794037f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2394,6 +2394,9 @@ kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)
 	return xa_to_value(xa_load(mem_attr_array, gfn));
 }
 
+int kvm_ioctl_set_mem_attributes(struct kvm *kvm, struct xarray *mem_attr_array,
+				 u64 supported_attrs,
+				 struct kvm_memory_attributes *attrs);
 bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start,
 				     gfn_t end, unsigned long attrs);
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f20dafaedc72..74c4c42b2126 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2554,8 +2554,9 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
 }
 
 /* Set @attributes for the gfn range [@start, @end). */
-static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long attributes)
+static int kvm_set_mem_attributes(struct kvm *kvm,
+				  struct xarray *mem_attr_array, gfn_t start,
+				  gfn_t end, unsigned long attributes)
 {
 	struct kvm_mmu_notifier_range pre_set_range = {
 		.start = start,
@@ -2569,7 +2570,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.start = start,
 		.end = end,
 		.arg.attributes = attributes,
-		.arg.mem_attr_array = &kvm->mem_attr_array,
+		.arg.mem_attr_array = mem_attr_array,
 		.handler = kvm_arch_post_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_end,
 		.may_block = true,
@@ -2583,7 +2584,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	mutex_lock(&kvm->slots_lock);
 
 	/* Nothing to do if the entire range as the desired attributes. */
-	if (kvm_range_has_memory_attributes(&kvm->mem_attr_array, start, end,
+	if (kvm_range_has_memory_attributes(mem_attr_array, start, end,
 					    attributes))
 		goto out_unlock;
 
@@ -2592,7 +2593,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	 * partway through setting the new attributes.
 	 */
 	for (i = start; i < end; i++) {
-		r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT);
+		r = xa_reserve(mem_attr_array, i, GFP_KERNEL_ACCOUNT);
 		if (r)
 			goto out_unlock;
 	}
@@ -2600,7 +2601,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	kvm_handle_gfn_range(kvm, &pre_set_range);
 
 	for (i = start; i < end; i++) {
-		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+		r = xa_err(xa_store(mem_attr_array, i, entry,
 				    GFP_KERNEL_ACCOUNT));
 		KVM_BUG_ON(r, kvm);
 	}
@@ -2612,15 +2613,17 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	return r;
 }
-static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
-					   struct kvm_memory_attributes *attrs)
+
+int kvm_ioctl_set_mem_attributes(struct kvm *kvm, struct xarray *mem_attr_array,
+				 u64 supported_attrs,
+				 struct kvm_memory_attributes *attrs)
 {
 	gfn_t start, end;
 
 	/* flags is currently not used. */
 	if (attrs->flags)
 		return -EINVAL;
-	if (attrs->attributes & ~kvm_supported_mem_attributes(kvm))
+	if (attrs->attributes & ~supported_attrs)
 		return -EINVAL;
 	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
 		return -EINVAL;
@@ -2637,7 +2640,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 	 */
 	BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long));
 
-	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
+	return kvm_set_mem_attributes(kvm, mem_attr_array, start, end,
+				      attrs->attributes);
+}
+
+static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
+					   struct kvm_memory_attributes *attrs)
+{
+	return kvm_ioctl_set_mem_attributes(kvm, &kvm->mem_attr_array,
+					    kvm_supported_mem_attributes(kvm),
+					    attrs);
 }
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 23/33] KVM: Expose memory attribute helper functions unanimously
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (21 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 22/33] KVM: Decouple kvm_ioctl_set_mem_attributes() from kvm's mem_attr_array Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 24/33] KVM: x86: hyper-v: Introduce KVM VTL device Nicolas Saenz Julienne
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Expose memory attribute helper functions even when
CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES is disabled. Other KVM features,
like Hyper-V VSM, make use of memory attributes but don't rely on the
KVM ioctl.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c      | 18 +++++++++---------
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c2bec2be2ba9..a76028aa8fb3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7250,7 +7250,6 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range)
 {
@@ -7377,6 +7376,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 	return false;
 }
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
 					    struct kvm_memory_slot *slot)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ad104794037f..45e3e261755d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2387,7 +2387,6 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long
 kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)
 {
@@ -2404,6 +2403,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 74c4c42b2126..b3f4b200f438 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2435,7 +2435,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 /*
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * matching @attrs.
@@ -2472,14 +2471,6 @@ bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start,
 	return has_attrs;
 }
 
-static u64 kvm_supported_mem_attributes(struct kvm *kvm)
-{
-	if (!kvm || kvm_arch_has_private_mem(kvm))
-		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
-
-	return 0;
-}
-
 static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 						 struct kvm_mmu_notifier_range *range)
 {
@@ -2644,6 +2635,15 @@ int kvm_ioctl_set_mem_attributes(struct kvm *kvm, struct xarray *mem_attr_array,
 				      attrs->attributes);
 }
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+	if (!kvm || kvm_arch_has_private_mem(kvm))
+		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	return 0;
+}
+
 static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 					   struct kvm_memory_attributes *attrs)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 24/33] KVM: x86: hyper-v: Introduce KVM VTL device
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (22 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 23/33] KVM: Expose memory attribute helper functions unanimously Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 11:17 ` [RFC 25/33] KVM: Introduce a set of new memory attributes Nicolas Saenz Julienne
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce a new KVM device aimed at tracking partition wide VTL state,
it'll be the one responsible from keeping track of VTL's memory
protections. For now its functionality it's limited, it only exposes its
VTL level through a device attribute. Additionally, the device type is
only registered if the VSM cap is enabled.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c    | 68 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/hyperv.h    |  3 ++
 arch/x86/kvm/x86.c       |  3 ++
 include/uapi/linux/kvm.h |  5 +++
 4 files changed, 79 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index a266c5d393f5..0d8402dba596 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -3022,3 +3022,71 @@ int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *stat
 	state->vsm_code_page_offsets = hv->vsm_code_page_offsets.as_u64;
 	return 0;
 }
+
+struct kvm_hv_vtl_dev {
+	int vtl;
+};
+
+static int kvm_hv_vtl_get_attr(struct kvm_device *dev,
+			       struct kvm_device_attr *attr)
+{
+	struct kvm_hv_vtl_dev *vtl_dev = dev->private;
+
+	switch (attr->group) {
+	case KVM_DEV_HV_VTL_GROUP:
+	switch (attr->attr){
+	case KVM_DEV_HV_VTL_GROUP_VTLNUM:
+		return put_user(vtl_dev->vtl, (u32 __user *)attr->addr);
+	}
+	}
+
+	return -EINVAL;
+}
+
+static void kvm_hv_vtl_release(struct kvm_device *dev)
+{
+	struct kvm_hv_vtl_dev *vtl_dev = dev->private;
+
+	kfree(vtl_dev);
+	kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */
+}
+
+static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type);
+
+static struct kvm_device_ops kvm_hv_vtl_ops = {
+	.name = "kvm-hv-vtl",
+	.create = kvm_hv_vtl_create,
+	.release = kvm_hv_vtl_release,
+	.get_attr = kvm_hv_vtl_get_attr,
+};
+
+static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type)
+{
+	struct kvm_hv_vtl_dev *vtl_dev;
+	struct kvm_device *tmp;
+	int vtl = 0;
+
+	vtl_dev = kzalloc(sizeof(*vtl_dev), GFP_KERNEL_ACCOUNT);
+	if (!vtl_dev)
+		return -ENOMEM;
+
+	/* Device creation is protected by kvm->lock */
+	list_for_each_entry(tmp, &dev->kvm->devices, vm_node)
+		if (tmp->ops == &kvm_hv_vtl_ops)
+			vtl++;
+
+	vtl_dev->vtl = vtl;
+	dev->private = vtl_dev;
+
+	return 0;
+}
+
+int kvm_hv_vtl_dev_register(void)
+{
+	return kvm_register_device_ops(&kvm_hv_vtl_ops, KVM_DEV_TYPE_HV_VSM_VTL);
+}
+
+void kvm_hv_vtl_dev_unregister(void)
+{
+	kvm_unregister_device_ops(KVM_DEV_TYPE_HV_VSM_VTL);
+}
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 605e80b9e5eb..3cc664e144d8 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -269,4 +269,7 @@ static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu,
 	role->vtl = kvm_hv_get_active_vtl(vcpu);
 }
 
+int kvm_hv_vtl_dev_register(void);
+void kvm_hv_vtl_dev_unregister(void);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bf4891bc044e..82d3b86d9c93 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6521,6 +6521,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		mutex_unlock(&kvm->lock);
 		break;
 	case KVM_CAP_HYPERV_VSM:
+		kvm_hv_vtl_dev_register();
 		kvm->arch.hyperv.hv_enable_vsm = true;
 		r = 0;
 		break;
@@ -9675,6 +9676,8 @@ void kvm_x86_vendor_exit(void)
 	mutex_lock(&vendor_module_lock);
 	kvm_x86_ops.hardware_enable = NULL;
 	mutex_unlock(&vendor_module_lock);
+
+	kvm_hv_vtl_dev_unregister();
 }
 EXPORT_SYMBOL_GPL(kvm_x86_vendor_exit);
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0ddffb8b0c99..bd97c9852142 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1471,6 +1471,9 @@ struct kvm_device_attr {
 #define   KVM_DEV_VFIO_GROUP_DEL	KVM_DEV_VFIO_FILE_DEL
 #define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
+#define KVM_DEV_HV_VTL_GROUP		1
+#define  KVM_DEV_HV_VTL_GROUP_VTLNUM		1
+
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
 #define KVM_DEV_TYPE_FSL_MPIC_20	KVM_DEV_TYPE_FSL_MPIC_20
@@ -1494,6 +1497,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
 	KVM_DEV_TYPE_RISCV_AIA,
 #define KVM_DEV_TYPE_RISCV_AIA		KVM_DEV_TYPE_RISCV_AIA
+	KVM_DEV_TYPE_HV_VSM_VTL,
+#define KVM_DEV_TYPE_HV_VSM_VTL		KVM_DEV_TYPE_HV_VSM_VTL
 	KVM_DEV_TYPE_MAX,
 };
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 25/33] KVM: Introduce a set of new memory attributes
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (23 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 24/33] KVM: x86: hyper-v: Introduce KVM VTL device Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-08 12:30   ` Alexander Graf
  2023-11-08 11:17 ` [RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL " Nicolas Saenz Julienne
                   ` (9 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce the following memory attributes:
 - KVM_MEMORY_ATTRIBUTE_READ
 - KVM_MEMORY_ATTRIBUTE_WRITE
 - KVM_MEMORY_ATTRIBUTE_EXECUTE
 - KVM_MEMORY_ATTRIBUTE_NO_ACCESS

Note that NO_ACCESS is necessary in order to make a distinction between
the lack of attributes for a gfn, which defaults to the memory
protections of the backing memory, versus explicitly prohibiting any
access to that gfn.

These new memory attributes will, for now, only made be available
through the VSM KVM device (which we introduce in subsequent patches).

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 include/uapi/linux/kvm.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bd97c9852142..6b875c1040eb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2314,7 +2314,11 @@ struct kvm_memory_attributes {
 	__u64 flags;
 };
 
+#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
+#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+#define KVM_MEMORY_ATTRIBUTE_NO_ACCESS         (1ULL << 4)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL memory attributes
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (24 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 25/33] KVM: Introduce a set of new memory attributes Nicolas Saenz Julienne
@ 2023-11-08 11:17 ` Nicolas Saenz Julienne
  2023-11-28  7:44   ` Maxim Levitsky
  2023-11-08 11:18 ` [RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots Nicolas Saenz Julienne
                   ` (8 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:17 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce KVM_SET_MEMORY_ATTRIBUTES ioctl support for VTL KVM devices.
The attributes are stored in an xarray private to the VTL device.

The following memory attributes are supported:
 - KVM_MEMORY_ATTRIBUTE_READ
 - KVM_MEMORY_ATTRIBUTE_WRITE
 - KVM_MEMORY_ATTRIBUTE_EXECUTE
 - KVM_MEMORY_ATTRIBUTE_NO_ACCESS
Although only some combinations are valid, see code comment below.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c | 61 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 0d8402dba596..bcace0258af1 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -62,6 +62,10 @@
  */
 #define HV_EXT_CALL_MAX (HV_EXT_CALL_QUERY_CAPABILITIES + 64)
 
+#define KVM_HV_VTL_ATTRS						\
+	(KVM_MEMORY_ATTRIBUTE_READ | KVM_MEMORY_ATTRIBUTE_WRITE |	\
+	 KVM_MEMORY_ATTRIBUTE_EXECUTE | KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
+
 static void stimer_mark_pending(struct kvm_vcpu_hv_stimer *stimer,
 				bool vcpu_kick);
 
@@ -3025,6 +3029,7 @@ int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *stat
 
 struct kvm_hv_vtl_dev {
 	int vtl;
+	struct xarray mem_attrs;
 };
 
 static int kvm_hv_vtl_get_attr(struct kvm_device *dev,
@@ -3047,16 +3052,71 @@ static void kvm_hv_vtl_release(struct kvm_device *dev)
 {
 	struct kvm_hv_vtl_dev *vtl_dev = dev->private;
 
+	xa_destroy(&vtl_dev->mem_attrs);
 	kfree(vtl_dev);
 	kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */
 }
 
+/*
+ * The TLFS lists the valid memory protection combinations (15.9.3):
+ *  - No access
+ *  - Read-only, no execute
+ *  - Read-only, execute
+ *  - Read/write, no execute
+ *  - Read/write, execute
+ */
+static bool kvm_hv_validate_vtl_mem_attributes(struct kvm_memory_attributes *attrs)
+{
+	u64 attr = attrs->attributes;
+
+	if (attr & ~KVM_HV_VTL_ATTRS)
+		return false;
+
+	if (attr == KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
+		return true;
+
+	if (!(attr & KVM_MEMORY_ATTRIBUTE_READ))
+		return false;
+
+	return true;
+}
+
+static long kvm_hv_vtl_ioctl(struct kvm_device *dev, unsigned int ioctl,
+			     unsigned long arg)
+{
+	switch (ioctl) {
+	case KVM_SET_MEMORY_ATTRIBUTES: {
+		struct kvm_hv_vtl_dev *vtl_dev = dev->private;
+		struct kvm_memory_attributes attrs;
+		int r;
+
+		if (copy_from_user(&attrs, (void __user *)arg, sizeof(attrs)))
+			return -EFAULT;
+
+		r = -EINVAL;
+		if (!kvm_hv_validate_vtl_mem_attributes(&attrs))
+			return r;
+
+		r = kvm_ioctl_set_mem_attributes(dev->kvm, &vtl_dev->mem_attrs,
+						 KVM_HV_VTL_ATTRS, &attrs);
+		if (r)
+			return r;
+		break;
+	}
+	default:
+		return -ENOTTY;
+	}
+
+	return 0;
+}
+
 static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type);
 
 static struct kvm_device_ops kvm_hv_vtl_ops = {
 	.name = "kvm-hv-vtl",
 	.create = kvm_hv_vtl_create,
 	.release = kvm_hv_vtl_release,
+	.ioctl = kvm_hv_vtl_ioctl,
 	.get_attr = kvm_hv_vtl_get_attr,
 };
 
@@ -3076,6 +3136,7 @@ static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type)
 			vtl++;
 
 	vtl_dev->vtl = vtl;
+	xa_init(&vtl_dev->mem_attrs);
 	dev->private = vtl_dev;
 
 	return 0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (25 preceding siblings ...)
  2023-11-08 11:17 ` [RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL " Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-28  7:46   ` Maxim Levitsky
  2023-11-08 11:18 ` [RFC 28/33] x86/hyper-v: Introduce memory intercept message structure Nicolas Saenz Julienne
                   ` (7 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce a new step in __kvm_faultin_pfn() that'll validate the
fault against the vCPU's VTL protections and generate a user space exit
when invalid.

Note that kvm_hv_faultin_pfn() has to be run after resolving the fault
against the memslots, since that operation steps over
'fault->map_writable'.

Non VSM users shouldn't see any behaviour change.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c  | 66 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/hyperv.h  |  1 +
 arch/x86/kvm/mmu/mmu.c |  9 +++++-
 3 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index bcace0258af1..eb6a4848e306 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -42,6 +42,8 @@
 #include "irq.h"
 #include "fpu.h"
 
+#include "mmu/mmu_internal.h"
+
 #define KVM_HV_MAX_SPARSE_VCPU_SET_BITS DIV_ROUND_UP(KVM_MAX_VCPUS, HV_VCPUS_PER_SPARSE_BANK)
 
 /*
@@ -3032,6 +3034,55 @@ struct kvm_hv_vtl_dev {
 	struct xarray mem_attrs;
 };
 
+static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu);
+
+bool kvm_hv_vsm_access_valid(struct kvm_page_fault *fault, unsigned long attrs)
+{
+	if (attrs == KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
+		return false;
+
+	/* We should never get here without read permissions, force a fault. */
+	if (WARN_ON_ONCE(!(attrs & KVM_MEMORY_ATTRIBUTE_READ)))
+		return false;
+
+	if (fault->write && !(attrs & KVM_MEMORY_ATTRIBUTE_WRITE))
+		return false;
+
+	if (fault->exec && !(attrs & KVM_MEMORY_ATTRIBUTE_EXECUTE))
+		return false;
+
+	return true;
+}
+
+static unsigned long kvm_hv_vsm_get_memory_attributes(struct kvm_vcpu *vcpu,
+						      gfn_t gfn)
+{
+	struct xarray *prots = kvm_hv_vsm_get_memprots(vcpu);
+
+	if (!prots)
+		return 0;
+
+	return xa_to_value(xa_load(prots, gfn));
+}
+
+int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	unsigned long attrs;
+
+	attrs = kvm_hv_vsm_get_memory_attributes(vcpu, fault->gfn);
+	if (!attrs)
+		return RET_PF_CONTINUE;
+
+	if (kvm_hv_vsm_access_valid(fault, attrs)) {
+		fault->map_executable =
+			!!(attrs & KVM_MEMORY_ATTRIBUTE_EXECUTE);
+		fault->map_writable = !!(attrs & KVM_MEMORY_ATTRIBUTE_WRITE);
+		return RET_PF_CONTINUE;
+	}
+
+	return -EFAULT;
+}
+
 static int kvm_hv_vtl_get_attr(struct kvm_device *dev,
 			       struct kvm_device_attr *attr)
 {
@@ -3120,6 +3171,21 @@ static struct kvm_device_ops kvm_hv_vtl_ops = {
 	.get_attr = kvm_hv_vtl_get_attr,
 };
 
+static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu)
+{
+	struct kvm_hv_vtl_dev *vtl_dev;
+	struct kvm_device *tmp;
+
+	list_for_each_entry(tmp, &vcpu->kvm->devices, vm_node)
+		if (tmp->ops == &kvm_hv_vtl_ops) {
+			vtl_dev = tmp->private;
+			if (vtl_dev->vtl == kvm_hv_get_active_vtl(vcpu))
+				return &vtl_dev->mem_attrs;
+		}
+
+	return NULL;
+}
+
 static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type)
 {
 	struct kvm_hv_vtl_dev *vtl_dev;
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index 3cc664e144d8..ae781b4d4669 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -271,5 +271,6 @@ static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu,
 
 int kvm_hv_vtl_dev_register(void);
 void kvm_hv_vtl_dev_unregister(void);
+int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a76028aa8fb3..ba454c7277dc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4374,7 +4374,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 					  fault->write, &fault->map_writable,
 					  &fault->hva);
 	if (!async)
-		return RET_PF_CONTINUE; /* *pfn has correct page already */
+		goto pf_continue; /* *pfn has correct page already */
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4395,6 +4395,13 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
 					  fault->write, &fault->map_writable,
 					  &fault->hva);
+pf_continue:
+	if (kvm_hv_vsm_enabled(vcpu->kvm)) {
+		if (kvm_hv_faultin_pfn(vcpu, fault)) {
+			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+			return -EFAULT;
+		}
+	}
 	return RET_PF_CONTINUE;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 28/33] x86/hyper-v: Introduce memory intercept message structure
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (26 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-28  7:53   ` Maxim Levitsky
  2023-11-08 11:18 ` [RFC 29/33] KVM: VMX: Save instruction length on EPT violation Nicolas Saenz Julienne
                   ` (6 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce struct hv_memory_intercept_message, which is used when issuing
memory intercepts to a Hyper-V VSM guest.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/hyperv-tlfs.h | 76 ++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index af594aa65307..d3d74fde6da1 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -799,6 +799,82 @@ struct hv_get_vp_from_apic_id_in {
 	u32 apic_ids[];
 } __packed;
 
+
+/* struct hv_intercept_header::access_type_mask */
+#define HV_INTERCEPT_ACCESS_MASK_NONE    0
+#define HV_INTERCEPT_ACCESS_MASK_READ    1
+#define HV_INTERCEPT_ACCESS_MASK_WRITE   2
+#define HV_INTERCEPT_ACCESS_MASK_EXECUTE 4
+
+/* struct hv_intercept_exception::cache_type */
+#define HV_X64_CACHE_TYPE_UNCACHED       0
+#define HV_X64_CACHE_TYPE_WRITECOMBINING 1
+#define HV_X64_CACHE_TYPE_WRITETHROUGH   4
+#define HV_X64_CACHE_TYPE_WRITEPROTECTED 5
+#define HV_X64_CACHE_TYPE_WRITEBACK      6
+
+/* Intecept message header */
+struct hv_intercept_header {
+	__u32 vp_index;
+	__u8 instruction_length;
+#define HV_INTERCEPT_ACCESS_READ    0
+#define HV_INTERCEPT_ACCESS_WRITE   1
+#define HV_INTERCEPT_ACCESS_EXECUTE 2
+	__u8 access_type_mask;
+	union {
+		__u16 as_u16;
+		struct {
+			__u16 cpl:2;
+			__u16 cr0_pe:1;
+			__u16 cr0_am:1;
+			__u16 efer_lma:1;
+			__u16 debug_active:1;
+			__u16 interruption_pending:1;
+			__u16 reserved:9;
+		};
+	} exec_state;
+	struct hv_x64_segment_register cs;
+	__u64 rip;
+	__u64 rflags;
+} __packed;
+
+union hv_x64_memory_access_info {
+	__u8 as_u8;
+	struct {
+		__u8 gva_valid:1;
+		__u8 _reserved:7;
+	};
+};
+
+struct hv_memory_intercept_message {
+	struct hv_intercept_header header;
+	__u32 cache_type;
+	__u8 instruction_byte_count;
+	union hv_x64_memory_access_info memory_access_info;
+	__u16 _reserved;
+	__u64 gva;
+	__u64 gpa;
+	__u8 instruction_bytes[16];
+	struct hv_x64_segment_register ds;
+	struct hv_x64_segment_register ss;
+	__u64 rax;
+	__u64 rcx;
+	__u64 rdx;
+	__u64 rbx;
+	__u64 rsp;
+	__u64 rbp;
+	__u64 rsi;
+	__u64 rdi;
+	__u64 r8;
+	__u64 r9;
+	__u64 r10;
+	__u64 r11;
+	__u64 r12;
+	__u64 r13;
+	__u64 r14;
+	__u64 r15;
+} __packed;
+
 #include <asm-generic/hyperv-tlfs.h>
 
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (27 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 28/33] x86/hyper-v: Introduce memory intercept message structure Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-08 12:40   ` Alexander Graf
  2023-11-08 17:20   ` Sean Christopherson
  2023-11-08 11:18 ` [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request Nicolas Saenz Julienne
                   ` (5 subsequent siblings)
  34 siblings, 2 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Save the length of the instruction that triggered an EPT violation in
struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
intercept messages.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/vmx/vmx.c          | 1 +
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1f5a85d461ce..1a854776d91e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -967,6 +967,8 @@ struct kvm_vcpu_arch {
 	/* set at EPT violation at this point */
 	unsigned long exit_qualification;
 
+	u32 exit_instruction_len;
+
 	/* pv related host specific info */
 	struct {
 		bool pv_unhalted;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6e502ba93141..9c83ee3a293d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5773,6 +5773,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
 	vcpu->arch.exit_qualification = exit_qualification;
+	vcpu->arch.exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
 
 	/*
 	 * Check that the GPA doesn't exceed physical memory limits, as that is
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (28 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 29/33] KVM: VMX: Save instruction length on EPT violation Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-08 12:45   ` Alexander Graf
  2023-11-08 11:18 ` [RFC 31/33] KVM: x86: hyper-v: Inject intercept on VTL memory protection fault Nicolas Saenz Julienne
                   ` (4 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
injecting out-of-band Hyper-V secure intercepts. For now only memory
access intercepts are supported. These are triggered when access a GPA
protected by a higher VTL. The memory intercept metadata is filled based
on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
injected into the guest through SynIC message.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  10 +++
 arch/x86/kvm/hyperv.c           | 114 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/hyperv.h           |   2 +
 arch/x86/kvm/x86.c              |   3 +
 4 files changed, 129 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a854776d91e..39671e075555 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -113,6 +113,7 @@
 	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_HV_INJECT_INTERCEPT	KVM_ARCH_REQ(33)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -639,6 +640,13 @@ struct kvm_vcpu_hv_tlb_flush_fifo {
 	DECLARE_KFIFO(entries, u64, KVM_HV_TLB_FLUSH_FIFO_SIZE);
 };
 
+struct kvm_vcpu_hv_intercept_info {
+	struct kvm_vcpu *vcpu;
+	int type;
+	u64 gpa;
+	u8 access;
+};
+
 /* Hyper-V per vcpu emulation context */
 struct kvm_vcpu_hv {
 	struct kvm_vcpu *vcpu;
@@ -673,6 +681,8 @@ struct kvm_vcpu_hv {
 		u64 vm_id;
 		u32 vp_id;
 	} nested;
+
+	struct kvm_vcpu_hv_intercept_info intercept_info;
 };
 
 struct kvm_hypervisor_cpuid {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index eb6a4848e306..38ee3abdef9c 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2789,6 +2789,120 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static void store_kvm_segment(const struct kvm_segment *kvmseg,
+			      struct hv_x64_segment_register *reg)
+{
+	reg->base = kvmseg->base;
+	reg->limit = kvmseg->limit;
+	reg->selector = kvmseg->selector;
+	reg->segment_type = kvmseg->type;
+	reg->present = kvmseg->present;
+	reg->descriptor_privilege_level = kvmseg->dpl;
+	reg->_default = kvmseg->db;
+	reg->non_system_segment = kvmseg->s;
+	reg->_long = kvmseg->l;
+	reg->granularity = kvmseg->g;
+	reg->available = kvmseg->avl;
+}
+
+static void deliver_gpa_intercept(struct kvm_vcpu *target_vcpu,
+				  struct kvm_vcpu *intercepted_vcpu, u64 gpa,
+				  u64 gva, u8 access_type_mask)
+{
+	ulong cr0;
+	struct hv_message msg = { 0 };
+	struct hv_memory_intercept_message *intercept = (struct hv_memory_intercept_message *)msg.u.payload;
+	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(target_vcpu);
+	struct x86_exception e;
+	struct kvm_segment kvmseg;
+
+	msg.header.message_type = HVMSG_GPA_INTERCEPT;
+	msg.header.payload_size = sizeof(*intercept);
+
+	intercept->header.vp_index = to_hv_vcpu(intercepted_vcpu)->vp_index;
+	intercept->header.instruction_length = intercepted_vcpu->arch.exit_instruction_len;
+	intercept->header.access_type_mask = access_type_mask;
+	kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_CS);
+	store_kvm_segment(&kvmseg, &intercept->header.cs);
+
+	cr0 = kvm_read_cr0(intercepted_vcpu);
+	intercept->header.exec_state.cr0_pe = (cr0 & X86_CR0_PE);
+	intercept->header.exec_state.cr0_am = (cr0 & X86_CR0_AM);
+	intercept->header.exec_state.cpl = kvm_x86_ops.get_cpl(intercepted_vcpu);
+	intercept->header.exec_state.efer_lma = is_long_mode(intercepted_vcpu);
+	intercept->header.exec_state.debug_active = 0;
+	intercept->header.exec_state.interruption_pending = 0;
+	intercept->header.rip = kvm_rip_read(intercepted_vcpu);
+	intercept->header.rflags = kvm_get_rflags(intercepted_vcpu);
+
+	/*
+	 * For exec violations we don't have a way to decode an instruction that issued a fetch
+	 * to a non-X page because CPU points RIP and GPA to the fetch destination in the faulted page.
+	 * Instruction length though is the length of the fetch source.
+	 * Seems like Hyper-V is aware of that and is not trying to access those fields.
+	 */
+	if (access_type_mask == HV_INTERCEPT_ACCESS_EXECUTE) {
+		intercept->instruction_byte_count = 0;
+	} else {
+		intercept->instruction_byte_count = intercepted_vcpu->arch.exit_instruction_len;
+		if (intercept->instruction_byte_count > sizeof(intercept->instruction_bytes))
+			intercept->instruction_byte_count = sizeof(intercept->instruction_bytes);
+		if (kvm_read_guest_virt(intercepted_vcpu,
+					kvm_rip_read(intercepted_vcpu),
+					intercept->instruction_bytes,
+					intercept->instruction_byte_count, &e))
+			goto inject_ud;
+	}
+
+	intercept->memory_access_info.gva_valid = (gva != 0);
+	intercept->gva = gva;
+	intercept->gpa = gpa;
+	intercept->cache_type = HV_X64_CACHE_TYPE_WRITEBACK;
+	kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_DS);
+	store_kvm_segment(&kvmseg, &intercept->ds);
+	kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_SS);
+	store_kvm_segment(&kvmseg, &intercept->ss);
+	intercept->rax = kvm_rax_read(intercepted_vcpu);
+	intercept->rcx = kvm_rcx_read(intercepted_vcpu);
+	intercept->rdx = kvm_rdx_read(intercepted_vcpu);
+	intercept->rbx = kvm_rbx_read(intercepted_vcpu);
+	intercept->rsp = kvm_rsp_read(intercepted_vcpu);
+	intercept->rbp = kvm_rbp_read(intercepted_vcpu);
+	intercept->rsi = kvm_rsi_read(intercepted_vcpu);
+	intercept->rdi = kvm_rdi_read(intercepted_vcpu);
+	intercept->r8 = kvm_r8_read(intercepted_vcpu);
+	intercept->r9 = kvm_r9_read(intercepted_vcpu);
+	intercept->r10 = kvm_r10_read(intercepted_vcpu);
+	intercept->r11 = kvm_r11_read(intercepted_vcpu);
+	intercept->r12 = kvm_r12_read(intercepted_vcpu);
+	intercept->r13 = kvm_r13_read(intercepted_vcpu);
+	intercept->r14 = kvm_r14_read(intercepted_vcpu);
+	intercept->r15 = kvm_r15_read(intercepted_vcpu);
+
+	if (synic_deliver_msg(&hv_vcpu->synic, 0, &msg, true))
+		goto inject_ud;
+
+	return;
+
+inject_ud:
+	kvm_queue_exception(target_vcpu, UD_VECTOR);
+}
+
+void kvm_hv_deliver_intercept(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_hv_intercept_info *info = &to_hv_vcpu(vcpu)->intercept_info;
+
+	switch (info->type) {
+	case HVMSG_GPA_INTERCEPT:
+		deliver_gpa_intercept(vcpu, info->vcpu, info->gpa, 0,
+				      info->access);
+		break;
+	default:
+		pr_warn("Unknown exception\n");
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_hv_deliver_intercept);
+
 void kvm_hv_init_vm(struct kvm *kvm)
 {
 	struct kvm_hv *hv = to_kvm_hv(kvm);
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index ae781b4d4669..8efc4916e0cb 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -273,4 +273,6 @@ int kvm_hv_vtl_dev_register(void);
 void kvm_hv_vtl_dev_unregister(void);
 int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
+void kvm_hv_deliver_intercept(struct kvm_vcpu *vcpu);
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 82d3b86d9c93..f2581eec39a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10707,6 +10707,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+		if (kvm_check_request(KVM_REQ_HV_INJECT_INTERCEPT, vcpu))
+			kvm_hv_deliver_intercept(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 31/33] KVM: x86: hyper-v: Inject intercept on VTL memory protection fault
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (29 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-08 11:18 ` [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS Nicolas Saenz Julienne
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Inject a Hyper-V secure intercept when a VTL tries to access memory that
was protected by a more privileged VTL. The intercept is injected into
the next enabled privileged VTL (for now, this patch takes a shortcut
and assumes it's the one right after).

After injecting the request, the KVM vCPU that took the fault will exit
to user-space with a memory fault.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 38ee3abdef9c..983bf8af5f64 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -3150,6 +3150,32 @@ struct kvm_hv_vtl_dev {
 
 static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu);
 
+static void kvm_hv_inject_gpa_intercept(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault)
+{
+	struct kvm_vcpu *target_vcpu =
+		kvm_hv_get_vtl_vcpu(vcpu, kvm_hv_get_active_vtl(vcpu) + 1);
+	struct kvm_vcpu_hv_intercept_info *intercept =
+		&target_vcpu->arch.hyperv->intercept_info;
+
+	/*
+        * No target VTL available, log a warning and let user-space deal with
+        * the fault.
+        */
+	if (WARN_ON_ONCE(!target_vcpu))
+		return;
+
+	intercept->type = HVMSG_GPA_INTERCEPT;
+	intercept->gpa = fault->addr;
+	intercept->access = (fault->user ? HV_INTERCEPT_ACCESS_READ : 0) |
+			    (fault->write ? HV_INTERCEPT_ACCESS_WRITE : 0) |
+			    (fault->exec ? HV_INTERCEPT_ACCESS_EXECUTE : 0);
+	intercept->vcpu = vcpu;
+
+	kvm_make_request(KVM_REQ_HV_INJECT_INTERCEPT, target_vcpu);
+	kvm_vcpu_kick(target_vcpu);
+}
+
 bool kvm_hv_vsm_access_valid(struct kvm_page_fault *fault, unsigned long attrs)
 {
 	if (attrs == KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
@@ -3194,6 +3220,7 @@ int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		return RET_PF_CONTINUE;
 	}
 
+	kvm_hv_inject_gpa_intercept(vcpu, fault);
 	return -EFAULT;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (30 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 31/33] KVM: x86: hyper-v: Inject intercept on VTL memory protection fault Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-08 12:49   ` Alexander Graf
  2023-11-08 11:18 ` [RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM" Nicolas Saenz Julienne
                   ` (2 subsequent siblings)
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

Introduce HVCALL_TRANSLATE_VIRTUAL_ADDRESS, the hypercall receives a
GVA, generally from a less privileged VTL, and returns the GPA backing
it. The GVA -> GPA conversion is done by walking the target VTL's vCPU
MMU.

NOTE: The hypercall implementation is incomplete and only shared for
completion. Additionally we'd like to move the VTL aware parts to
user-space.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c             | 98 +++++++++++++++++++++++++++++++
 arch/x86/kvm/trace.h              | 23 ++++++++
 include/asm-generic/hyperv-tlfs.h | 28 +++++++++
 3 files changed, 149 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 983bf8af5f64..1cb53cd0708f 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2540,6 +2540,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
 	case HVCALL_GET_VP_REGISTERS:
 	case HVCALL_SET_VP_REGISTERS:
 	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
+	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
 		return true;
 	}
 
@@ -2556,6 +2557,96 @@ static void kvm_hv_hypercall_read_xmm(struct kvm_hv_hcall *hc)
 	kvm_fpu_put();
 }
 
+static bool kvm_hv_xlate_va_validate_input(struct kvm_vcpu* vcpu,
+					   struct hv_xlate_va_input *in,
+					   u8 *vtl, u8 *flags)
+{
+	union hv_input_vtl in_vtl;
+
+	if (in->partition_id != HV_PARTITION_ID_SELF)
+		return false;
+
+	if (in->vp_index != HV_VP_INDEX_SELF &&
+	    in->vp_index != kvm_hv_get_vpindex(vcpu))
+		return false;
+
+	in_vtl.as_uint8 = in->control_flags >> 56;
+	*flags = in->control_flags & HV_XLATE_GVA_FLAGS_MASK;
+	if (*flags > (HV_XLATE_GVA_VAL_READ |
+		      HV_XLATE_GVA_VAL_WRITE |
+		      HV_XLATE_GVA_VAL_EXECUTE))
+		pr_info_ratelimited("Translate VA control flags unsupported and will be ignored: 0x%llx\n",
+				    in->control_flags);
+
+	*vtl = in_vtl.use_target_vtl ? in_vtl.target_vtl :
+					     kvm_hv_get_active_vtl(vcpu);
+	if (*vtl > kvm_hv_get_active_vtl(vcpu))
+		return false;
+
+	return true;
+}
+
+static u64 kvm_hv_xlate_va_walk(struct kvm_vcpu* vcpu, u64 gva, u8 flags)
+{
+	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
+	u32 access = 0;
+
+	if (flags & HV_XLATE_GVA_VAL_WRITE)
+		access |= PFERR_WRITE_MASK;
+	if (flags & HV_XLATE_GVA_VAL_EXECUTE)
+		access |= PFERR_FETCH_MASK;
+
+	return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, mmu, gva, access, NULL);
+}
+
+static u64 kvm_hv_translate_virtual_address(struct kvm_vcpu* vcpu,
+					    struct kvm_hv_hcall *hc)
+{
+	struct hv_xlate_va_output output = {};
+	struct hv_xlate_va_input input;
+	struct kvm_vcpu *target_vcpu;
+	u8 flags, target_vtl;
+
+	if (hc->fast) {
+		input.partition_id = hc->ingpa;
+		input.vp_index = hc->outgpa & 0xFFFFFFFF;
+		input.control_flags = sse128_lo(hc->xmm[0]);
+		input.gva = sse128_hi(hc->xmm[0]);
+	} else {
+		if (kvm_read_guest(vcpu->kvm, hc->ingpa, &input, sizeof(input)))
+			return HV_STATUS_INVALID_HYPERCALL_INPUT;
+	}
+
+	trace_kvm_hv_translate_virtual_address(input.partition_id,
+					       input.vp_index,
+					       input.control_flags, input.gva);
+
+	if (!kvm_hv_xlate_va_validate_input(vcpu, &input, &target_vtl, &flags))
+		return HV_STATUS_INVALID_HYPERCALL_INPUT;
+
+	target_vcpu = kvm_hv_get_vtl_vcpu(vcpu, target_vtl);
+	output.gpa = kvm_hv_xlate_va_walk(target_vcpu, input.gva << PAGE_SHIFT,
+					  flags);
+	if (output.gpa == INVALID_GPA) {
+		output.result_code = HV_XLATE_GVA_UNMAPPED;
+	} else {
+		output.gpa >>= PAGE_SHIFT;
+		output.result_code = HV_XLATE_GVA_SUCCESS;
+		output.cache_type = HV_CACHE_TYPE_X64_WB;
+	}
+
+	if (hc->fast) {
+		memcpy(&hc->xmm[1], &output, sizeof(output));
+		hc->xmm_dirty = true;
+	} else {
+		if (kvm_write_guest(vcpu->kvm, hc->outgpa, &output,
+				    sizeof(output)))
+			return HV_STATUS_INVALID_HYPERCALL_INPUT;
+	}
+
+	return HV_STATUS_SUCCESS;
+}
+
 static bool hv_check_hypercall_access(struct kvm_vcpu_hv *hv_vcpu, u16 code)
 {
 	if (!hv_vcpu->enforce_cpuid)
@@ -2766,6 +2857,13 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	case HVCALL_VTL_CALL:
 	case HVCALL_VTL_RETURN:
 		goto hypercall_userspace_exit;
+	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
+		if (unlikely(hc.rep_cnt)) {
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+			break;
+		}
+		ret = kvm_hv_translate_virtual_address(vcpu, &hc);
+		break;
 	default:
 		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
 		break;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index ab8839c47bc7..6b908671a0cc 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1372,6 +1372,29 @@ TRACE_EVENT(kvm_hv_stimer_cleanup,
 		  __entry->vcpu_id, __entry->timer_index)
 );
 
+TRACE_EVENT(kvm_hv_translate_virtual_address,
+	TP_PROTO(u64 partition_id, u32 vp_index, u64 control_flags, u64 gva),
+	TP_ARGS(partition_id, vp_index, control_flags, gva),
+
+	TP_STRUCT__entry(
+		__field(u64, partition_id)
+		__field(u32, vp_index)
+		__field(u64, control_flags)
+		__field(u64, gva)
+	),
+
+	TP_fast_assign(
+		__entry->partition_id = partition_id;
+		__entry->vp_index = vp_index;
+		__entry->control_flags = control_flags;
+		__entry->gva = gva;
+	),
+
+	TP_printk("partition id 0x%llx, vp index 0x%x, control flags 0x%llx, gva 0x%llx",
+		  __entry->partition_id, __entry->vp_index,
+		  __entry->control_flags, __entry->gva)
+);
+
 TRACE_EVENT(kvm_apicv_inhibit_changed,
 	    TP_PROTO(int reason, bool set, unsigned long inhibits),
 	    TP_ARGS(reason, set, inhibits),
diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
index a8b5c8a84bbc..24f983222c96 100644
--- a/include/asm-generic/hyperv-tlfs.h
+++ b/include/asm-generic/hyperv-tlfs.h
@@ -163,6 +163,7 @@ union hv_reference_tsc_msr {
 #define HVCALL_CREATE_VP			0x004e
 #define HVCALL_GET_VP_REGISTERS			0x0050
 #define HVCALL_SET_VP_REGISTERS			0x0051
+#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS	0x0052
 #define HVCALL_POST_MESSAGE			0x005c
 #define HVCALL_SIGNAL_EVENT			0x005d
 #define HVCALL_POST_DEBUG_DATA			0x0069
@@ -842,4 +843,31 @@ union hv_register_vsm_code_page_offsets {
 		u64 reserved:40;
 	} __packed;
 };
+
+#define HV_XLATE_GVA_SUCCESS 0
+#define HV_XLATE_GVA_UNMAPPED 1
+#define HV_XLATE_GPA_UNMAPPED 4
+#define HV_CACHE_TYPE_X64_WB 6
+
+#define HV_XLATE_GVA_VAL_READ 1
+#define HV_XLATE_GVA_VAL_WRITE 2
+#define HV_XLATE_GVA_VAL_EXECUTE 4
+#define HV_XLATE_GVA_FLAGS_MASK 0x3F
+
+struct hv_xlate_va_input {
+	u64 partition_id;
+	u32 vp_index;
+	u32 reserved;
+	u64 control_flags;
+	u64 gva;
+};
+
+struct hv_xlate_va_output {
+	u32 result_code;
+	u32 cache_type:8;
+	u32 overlay_page:1;
+	u32 reserved:23;
+	u64 gpa;
+};
+
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM"
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (31 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS Nicolas Saenz Julienne
@ 2023-11-08 11:18 ` Nicolas Saenz Julienne
  2023-11-28  8:19   ` Maxim Levitsky
  2023-11-08 11:40 ` [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Alexander Graf
  2023-11-08 16:55 ` Sean Christopherson
  34 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 11:18 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Nicolas Saenz Julienne

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 9048 bytes --]

Introduce "Emulating Hyper-V VSM with KVM", which describes the KVM APIs
made available to a VMM that wants to emulate Hyper-V's VSM.

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 .../virt/kvm/x86/emulating-hyperv-vsm.rst     | 136 ++++++++++++++++++
 1 file changed, 136 insertions(+)
 create mode 100644 Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst

diff --git a/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst b/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst
new file mode 100644
index 000000000000..8f76bf09c530
--- /dev/null
+++ b/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst
@@ -0,0 +1,136 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Emulating Hyper-V VSM with KVM
+==============================
+
+Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature
+that leverages the hypervisor to create secure execution environments
+within a guest. VSM is documented as part of Microsoft's Hypervisor Top
+Level Functional Specification[1].
+
+Emulating Hyper-V's Virtual Secure Mode with KVM requires coordination
+between KVM and the VMM. Most of the VSM state and configuration is left
+to be handled by user-space, but some has made its way into KVM. This
+document describes the mechanisms through which a VMM can implement VSM
+support.
+
+Virtual Trust Levels
+--------------------
+
+The main concept VSM introduces are Virtual Trust Levels or VTLs. Each
+VTL is a CPU mode, with its own private CPU architectural state,
+interrupt subsystem (limited to a local APIC), and memory access
+permissions. VTLs are hierarchical, where VTL0 corresponds to normal
+guest execution and VTL > 0 to privileged execution contexts. In
+practice, when virtualising Windows on top of KVM, we only see VTL0 and
+VTL1. Although the spec allows going all the way to VTL15. VTLs are
+orthogonal to ring levels, so each VTL is capable of runnig its own
+operating system and user-space[2].
+
+  ┌──────────────────────────────┐ ┌──────────────────────────────┐
+  │ Normal Mode (VTL0)           │ │ Secure Mode (VTL1)           │
+  │ ┌──────────────────────────┐ │ │ ┌──────────────────────────┐ │
+  │ │   User-mode Processes    │ │ │ │Secure User-mode Processes│ │
+  │ └──────────────────────────┘ │ │ └──────────────────────────┘ │
+  │ ┌──────────────────────────┐ │ │ ┌──────────────────────────┐ │
+  │ │         Kernel           │ │ │ │      Secure Kernel       │ │
+  │ └──────────────────────────┘ │ │ └──────────────────────────┘ │
+  └──────────────────────────────┘ └──────────────────────────────┘
+  ┌───────────────────────────────────────────────────────────────┐
+  │                         Hypervisor/KVM                        │
+  └───────────────────────────────────────────────────────────────┘
+  ┌───────────────────────────────────────────────────────────────┐
+  │                           Hardware                            │
+  └───────────────────────────────────────────────────────────────┘
+
+VTLs break the core assumption that a vCPU has a single architectural
+state, lAPIC state, SynIC state, etc. As such, each VTL is modeled as a
+distinct KVM vCPU, with the restriction that only one is allowed to run
+at any moment in time. Having multiple KVM vCPUs tracking a single guest
+CPU complicates vCPU numbering. VMs that enable VSM are expected to use
+CAP_APIC_ID_GROUPS to segregate vCPUs (and their lAPICs) into different
+groups. For example, a 4 CPU VSM VM will setup the APIC ID groups feature
+so only the first two bits of the APIC ID are exposed to the guest. The
+remaining bits represent the vCPU's VTL. The 'sibling' vCPU to VTL0's
+vCPU2 at VTL3 will have an APIC ID of 0xE. Using this approach a VMM and
+KVM are capable of querying a vCPU's VTL, or finding the vCPU associated
+to a specific VTL.
+
+KVM's lAPIC implementation is aware of groups, and takes note of the
+source vCPU's group when delivering IPIs. As such, it shouldn't be
+possible to target a different VTL through the APIC. Interrupts are
+delivered to the vCPU's lAPIC subsystem regardless of the VTL's runstate,
+this also includes timers. Ultimately, any interrupt incoming from an
+outside source (IOAPIC/MSIs) is routed to VTL0.
+
+Moving Between VTLs
+-------------------
+
+All VSM configuration and VTL handling hypercalls are passed through to
+user-space. Notably the two primitives that allow switching between VTLs.
+All shared state synchronization and KVM vCPU scheduling is left to the
+VMM to manage. For example, upon receiving a VTL call, the VMM stops the
+vCPU that issued the hypercall, and schedules the vCPU corresponding to
+the next privileged VTL. When that privileged vCPU is done executing, it
+issues a VTL return hypercall, so the opposite operation happens. All
+this is transparent to KVM, which limits itself to running vCPUs.
+
+An interrupt directed at a privileged VTL always has precedence over the
+execution of lower VTLs. To honor this, the VMM can monitor events
+targeted at privileged vCPUs with poll(), and should trigger an
+asynchronous VTL switch whenever events become available. Additionally,
+the target VTL's vCPU VP assist overlay page is used to notify the target
+VTL with the reason for the switch. The VMM can keep track of the VP
+assist page by installing an MSR filter for HV_X64_MSR_VP_ASSIST_PAGE.
+
+Hyper-V VP registers
+--------------------
+
+VP register hypercalls are passed through to user-space. All requests can
+be fulfilled either by using already existing KVM state ioctls, or are
+related to VSM's configuration, which is already handled by the VMM. Note
+that HV_REGISTER_VSM_CODE_PAGE_OFFSETS is the only VSM specific VP
+register the kernel controls, as such it is made available through the
+KVM_HV_GET_VSM_STATE ioctl.
+
+Per-VTL Memory Protections
+--------------------------
+
+A privileged VTL can change the memory access restrictions of lower VTLs.
+It does so to hide secrets from them, or to control what they are allowed
+to execute. The list of memory protections allowed is[3]:
+ - No access
+ - Read-only, no execute
+ - Read-only, execute
+ - Read/write, no execute
+ - Read/write, execute
+
+VTL memory protection hypercalls are passed through to user-space, but
+KVM provides an interface that allows changing memory protections on a
+per-VTL basis. This is made possible by the KVM VTL device. VMMs can
+create one per VTL and it exposes a ioctl, KVM_SET_MEMORY_ATTRIBUTES,
+that controls the memory protections applied to that VTL. The KVM TDP MMU
+is VTL aware and page faults are resolved taking into account the
+corresponding VTL device's memory attributes.
+
+When a memory access violates VTL memory protections, KVM issues a secure
+memory intercept, which is passed as a SynIC message into the next
+privileged VTL. This happens transparently for the VMM. Additionally, KVM
+exits with a user-space memory fault. This allows the VMM to stop the
+vCPU while the secure intercept is handled by the privileged VTL. In the
+good case, the instruction that triggered the fault is emulated and
+control is returned to the lower VTL, in the bad case, Windows crashes
+gracefully.
+
+Hyper-V's TLFS also states that DMA should follow VTL0's memory access
+restrictions. This is out of scope for this document, as IOMMU mappings
+are not handled by KVM.
+
+[1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
+
+[2] Conceptually this design is similar to arm's TrustZone: The
+hypervisor plays the role of EL3. Windows (VTL0) runs in Non-Secure
+(EL0/EL1) and the secure kernel (VTL1) in Secure World (EL1s/EL0s).
+
+[3] TLFS 15.9.3
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (32 preceding siblings ...)
  2023-11-08 11:18 ` [RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM" Nicolas Saenz Julienne
@ 2023-11-08 11:40 ` Alexander Graf
  2023-11-08 14:41   ` Nicolas Saenz Julienne
  2023-11-08 16:55 ` Sean Christopherson
  34 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 11:40 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

Hey Nicolas,

On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature
> that leverages the hypervisor to create secure execution environments
> within a guest. VSM is documented as part of Microsoft's Hypervisor Top
> Level Functional Specification [1]. Security features that build upon
> VSM, like Windows Credential Guard, are enabled by default on Windows 11,
> and are becoming a prerequisite in some industries.
>
> This RFC series introduces the necessary infrastructure to emulate VSM
> enabled guests. It is a snapshot of the progress we made so far, and its
> main goal is to gather design feedback. Specifically on the KVM APIs we
> introduce. For a high level design overview, see the documentation in
> patch 33.
>
> Additionally, this topic will be discussed as part of the KVM
> Micro-conference, in this year's Linux Plumbers Conference [2].


Awesome, looking forward to the session! :)


> The series is accompanied by two repositories:
>   - A PoC QEMU implementation of VSM [3].
>   - VSM kvm-unit-tests [4].
>
> Note that this isn't a full VSM implementation. For now it only supports
> 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> Windows Sever 2016/2019, but is unstable during runtime.


How much of these limitations are inherent in the current set of 
patches? What is missing to go beyond 2 VTLs and into SMP land? Anything 
that will require API changes?


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
  2023-11-08 11:17 ` [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support Nicolas Saenz Julienne
@ 2023-11-08 11:44   ` Alexander Graf
  2023-11-08 12:11     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 11:44 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> Prepare infrastructure to be able to return data through the XMM
> registers when Hyper-V hypercalls are issues in fast mode. The XMM
> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
> restored on successful hypercall completion.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>   arch/x86/include/asm/hyperv-tlfs.h |  2 +-
>   arch/x86/kvm/hyperv.c              | 33 +++++++++++++++++++++++++++++-
>   include/uapi/linux/kvm.h           |  6 ++++++
>   3 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 2ff26f53cd62..af594aa65307 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -49,7 +49,7 @@
>   /* Support for physical CPU dynamic partitioning events is available*/
>   #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE	BIT(3)
>   /*
> - * Support for passing hypercall input parameter block via XMM
> + * Support for passing hypercall input and output parameter block via XMM
>    * registers is available
>    */
>   #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE		BIT(4)
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 238afd7335e4..e1bc861ab3b0 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall {
>   	u16 rep_idx;
>   	bool fast;
>   	bool rep;
> +	bool xmm_dirty;
>   	sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
>   
>   	/*
> @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
>   	return ret;
>   }
>   
> +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
> +{
> +	int reg;
> +
> +	kvm_fpu_get();
> +	for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
> +		const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
> +		_kvm_write_sse_reg(reg, &data);
> +	}
> +	kvm_fpu_put();
> +}
> +
> +static bool kvm_hv_is_xmm_output_hcall(u16 code)
> +{
> +	return false;
> +}
> +
>   static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>   {
> -	return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
> +	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
> +	u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
> +	u64 result = vcpu->run->hyperv.u.hcall.result;
> +
> +	if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
> +		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
> +
> +	return kvm_hv_hypercall_complete(vcpu, result);
>   }
>   
>   static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   		break;
>   	}
>   
> +	if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty)
> +		kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm);
> +
>   hypercall_complete:
>   	return kvm_hv_hypercall_complete(vcpu, ret);
>   
> @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   	vcpu->run->hyperv.u.hcall.input = hc.param;
>   	vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
>   	vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
> +	if (hc.fast)
> +		memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
>   	vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
>   	return 0;
>   }
> @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>   			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
>   
>   			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
> +			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;


Shouldn't this be guarded by an ENABLE_CAP to make sure old user space 
that doesn't know about xmm outputs is still able to run with newer kernels?


>   			ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE;
>   			ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
>   
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d7a01766bf21..5ce06a1eee2b 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -192,6 +192,11 @@ struct kvm_s390_cmma_log {
>   	__u64 values;
>   };
>   
> +struct kvm_hyperv_xmm_reg {
> +	__u64 low;
> +	__u64 high;
> +};
> +
>   struct kvm_hyperv_exit {
>   #define KVM_EXIT_HYPERV_SYNIC          1
>   #define KVM_EXIT_HYPERV_HCALL          2
> @@ -210,6 +215,7 @@ struct kvm_hyperv_exit {
>   			__u64 input;
>   			__u64 result;
>   			__u64 params[2];
> +			struct kvm_hyperv_xmm_reg xmm[6];


Would this change the size of struct kvm_hyperv_exit? And if so, 
wouldn't that potentially be a UABI breakage?


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-11-08 11:17 ` [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page Nicolas Saenz Julienne
@ 2023-11-08 11:53   ` Alexander Graf
  2023-11-08 14:10     ` Nicolas Saenz Julienne
  2023-11-28  7:08   ` Maxim Levitsky
  1 sibling, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 11:53 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> VTL call/return hypercalls have their own entry points in the hypercall
> page because they don't follow normal hyper-v hypercall conventions.
> Move the VTL call/return control input into ECX/RAX and set the
> hypercall code into EAX/RCX before calling the hypercall instruction in
> order to be able to use the Hyper-V hypercall entry function.
>
> Guests can read an emulated code page offsets register to know the
> offsets into the hypercall page for the VTL call/return entries.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>
> ---
>
> My tree has the additional patch, we're still trying to understand under
> what conditions Windows expects the offset to be fixed.
>
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 54f7f36a89bf..9f2ea8c34447 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -294,6 +294,7 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
>
>          /* VTL call/return entries */
>          if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
> +               i = 22;
>   #ifdef CONFIG_X86_64
>                  if (is_64_bit_mode(vcpu)) {
>                          /*
> ---
>   arch/x86/include/asm/kvm_host.h   |  2 +
>   arch/x86/kvm/hyperv.c             | 78 ++++++++++++++++++++++++++++++-
>   include/asm-generic/hyperv-tlfs.h | 11 +++++
>   3 files changed, 90 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a2f224f95404..00cd21b09f8c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1105,6 +1105,8 @@ struct kvm_hv {
>   	u64 hv_tsc_emulation_status;
>   	u64 hv_invtsc_control;
>   
> +	union hv_register_vsm_code_page_offsets vsm_code_page_offsets;
> +
>   	/* How many vCPUs have VP index != vCPU index */
>   	atomic_t num_mismatched_vp_indexes;
>   
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 78d053042667..d4b1b53ea63d 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
>   static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
>   {
>   	struct kvm *kvm = vcpu->kvm;
> -	u8 instructions[9];
> +	struct kvm_hv *hv = to_kvm_hv(kvm);
> +	u8 instructions[0x30];
>   	int i = 0;
>   	u64 addr;
>   
> @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
>   	/* ret */
>   	((unsigned char *)instructions)[i++] = 0xc3;
>   
> +	/* VTL call/return entries */
> +	if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {


You don't introduce kvm_hv_vsm_enabled() before. Please do a quick test 
build of all individual commits of your patch set for v1 :).


> +#ifdef CONFIG_X86_64


Why do you need the ifdef here? is_long_mode() already has an ifdef that 
will always return false for is_64_bit_mode() on 32bit hosts.


> +		if (is_64_bit_mode(vcpu)) {
> +			/*
> +			 * VTL call 64-bit entry prologue:
> +			 * 	mov %rcx, %rax
> +			 * 	mov $0x11, %ecx
> +			 * 	jmp 0:
> +			 */
> +			hv->vsm_code_page_offsets.vtl_call_offset = i;
> +			instructions[i++] = 0x48;
> +			instructions[i++] = 0x89;
> +			instructions[i++] = 0xc8;
> +			instructions[i++] = 0xb9;
> +			instructions[i++] = 0x11;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0xeb;
> +			instructions[i++] = 0xe0;


I think it would be a lot easier to read (because it's denser) if you 
move the opcodes into a character array:

char vtl_entry[] = { 0x48, 0x89, 0xc8, 0xb9, 0x11, 0x00, 0x00, 0x00. 
0xeb, 0xe0 };

and then just memcpy().


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
  2023-11-08 11:44   ` Alexander Graf
@ 2023-11-08 12:11     ` Vitaly Kuznetsov
  2023-11-08 12:16       ` Alexander Graf
  0 siblings, 1 reply; 108+ messages in thread
From: Vitaly Kuznetsov @ 2023-11-08 12:11 UTC (permalink / raw)
  To: Alexander Graf, Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, anelkz, dwmw,
	jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

Alexander Graf <graf@amazon.com> writes:

> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
>> Prepare infrastructure to be able to return data through the XMM
>> registers when Hyper-V hypercalls are issues in fast mode. The XMM
>> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
>> restored on successful hypercall completion.
>>
>> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>> ---
>>   arch/x86/include/asm/hyperv-tlfs.h |  2 +-
>>   arch/x86/kvm/hyperv.c              | 33 +++++++++++++++++++++++++++++-
>>   include/uapi/linux/kvm.h           |  6 ++++++
>>   3 files changed, 39 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
>> index 2ff26f53cd62..af594aa65307 100644
>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>> @@ -49,7 +49,7 @@
>>   /* Support for physical CPU dynamic partitioning events is available*/
>>   #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE	BIT(3)
>>   /*
>> - * Support for passing hypercall input parameter block via XMM
>> + * Support for passing hypercall input and output parameter block via XMM
>>    * registers is available
>>    */
>>   #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE		BIT(4)
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index 238afd7335e4..e1bc861ab3b0 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall {
>>   	u16 rep_idx;
>>   	bool fast;
>>   	bool rep;
>> +	bool xmm_dirty;
>>   	sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
>>   
>>   	/*
>> @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
>>   	return ret;
>>   }
>>   
>> +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
>> +{
>> +	int reg;
>> +
>> +	kvm_fpu_get();
>> +	for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
>> +		const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
>> +		_kvm_write_sse_reg(reg, &data);
>> +	}
>> +	kvm_fpu_put();
>> +}
>> +
>> +static bool kvm_hv_is_xmm_output_hcall(u16 code)
>> +{
>> +	return false;
>> +}
>> +
>>   static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>>   {
>> -	return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
>> +	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
>> +	u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
>> +	u64 result = vcpu->run->hyperv.u.hcall.result;
>> +
>> +	if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
>> +		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
>> +
>> +	return kvm_hv_hypercall_complete(vcpu, result);
>>   }
>>   
>>   static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>> @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>   		break;
>>   	}
>>   
>> +	if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty)
>> +		kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm);
>> +
>>   hypercall_complete:
>>   	return kvm_hv_hypercall_complete(vcpu, ret);
>>   
>> @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>   	vcpu->run->hyperv.u.hcall.input = hc.param;
>>   	vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
>>   	vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
>> +	if (hc.fast)
>> +		memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
>>   	vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
>>   	return 0;
>>   }
>> @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>>   			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
>>   
>>   			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
>> +			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
>
>
> Shouldn't this be guarded by an ENABLE_CAP to make sure old user space 
> that doesn't know about xmm outputs is still able to run with newer kernels?
>

No, we don't do CAPs for new Hyper-V features anymore since we have
KVM_GET_SUPPORTED_HV_CPUID. Userspace is not supposed to simply copy
its output into guest visible CPUIDs, it must only enable features it
knows. Even 'hv_passthrough' option in QEMU doesn't pass unknown
features through.

>
>>   			ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE;
>>   			ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
>>   
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index d7a01766bf21..5ce06a1eee2b 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -192,6 +192,11 @@ struct kvm_s390_cmma_log {
>>   	__u64 values;
>>   };
>>   
>> +struct kvm_hyperv_xmm_reg {
>> +	__u64 low;
>> +	__u64 high;
>> +};
>> +
>>   struct kvm_hyperv_exit {
>>   #define KVM_EXIT_HYPERV_SYNIC          1
>>   #define KVM_EXIT_HYPERV_HCALL          2
>> @@ -210,6 +215,7 @@ struct kvm_hyperv_exit {
>>   			__u64 input;
>>   			__u64 result;
>>   			__u64 params[2];
>> +			struct kvm_hyperv_xmm_reg xmm[6];
>
>
> Would this change the size of struct kvm_hyperv_exit? And if so, 
> wouldn't that potentially be a UABI breakage?
>

Yes. 'struct kvm_hyperv_exit' has 'type' field which determines which
particular type of the union (synic/hcall/syndbg) is used. The easiest
would probably be to introduce a new type (hcall_with_xmm or something
like that). 

>
> Alex
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
  2023-11-08 11:17 ` [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Nicolas Saenz Julienne
@ 2023-11-08 12:11   ` Alexander Graf
  2023-11-08 17:47   ` Sean Christopherson
  2023-11-28  6:56   ` Maxim Levitsky
  2 siblings, 0 replies; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:11 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc,
	Anel Orazgaliyeva


On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> From: Anel Orazgaliyeva <anelkz@amazon.de>
>
> Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> ids into two. The lower bits, the physical APIC id, represent the part
> that's exposed to the guest. The higher bits, which are private to KVM,
> groups APICs together. APICs in different groups are isolated from each
> other, and IPIs can only be directed at APICs that share the same group
> as its source. Furthermore, groups are only relevant to IPIs, anything
> incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> PV-IPIs is targeted at the default APIC group, group 0.
>
> When routing IPIs with physical destinations, KVM will OR the source's
> vCPU APIC group with the ICR's destination ID and use that to resolve
> the target lAPIC. The APIC physical map is also made group aware in
> order to speed up this process. For the sake of simplicity, the logical
> map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> routing to the slower per-vCPU scan method.
>
> This capability serves as a building block to implement virtualisation
> based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> introduces a para-virtualised switch that allows for guest CPUs to jump
> into a different execution context, this switches into a different CPU
> state, lAPIC state, and memory protections. We model this in KVM by
> using distinct kvm_vcpus for each context. Moreover, execution contexts
> are hierarchical and its APICs are meant to remain functional even when
> the context isn't 'scheduled in'. For example, we have to keep track of
> timers' expirations, and interrupt execution of lesser priority contexts
> when relevant. Hence the need to alias physical APIC ids, while keeping
> the ability to target specific execution contexts.
>
> Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de>
> Co-developed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  3 ++
>   arch/x86/include/uapi/asm/kvm.h |  5 +++
>   arch/x86/kvm/lapic.c            | 59 ++++++++++++++++++++++++++++-----
>   arch/x86/kvm/lapic.h            | 33 ++++++++++++++++++
>   arch/x86/kvm/x86.c              | 15 +++++++++
>   include/uapi/linux/kvm.h        |  2 ++
>   6 files changed, 108 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dff10051e9b6..a2f224f95404 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1298,6 +1298,9 @@ struct kvm_arch {
>   	struct rw_semaphore apicv_update_lock;
>   	unsigned long apicv_inhibit_reasons;
>   
> +	u32 apic_id_group_mask;
> +	u8 apic_id_group_shift;
> +
>   	gpa_t wall_clock;
>   
>   	bool mwait_in_guest;
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index a448d0964fc0..f73d137784d7 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -565,4 +565,9 @@ struct kvm_pmu_event_filter {
>   #define KVM_X86_DEFAULT_VM	0
>   #define KVM_X86_SW_PROTECTED_VM	1
>   
> +/* for KVM_SET_APIC_ID_GROUPS */
> +struct kvm_apic_id_groups {
> +	__u8 n_bits; /* nr of bits used to represent group in the APIC ID */
> +};
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 3e977dbbf993..f55d216cb2a0 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -141,7 +141,7 @@ static inline int apic_enabled(struct kvm_lapic *apic)
>   
>   static inline u32 kvm_x2apic_id(struct kvm_lapic *apic)
>   {
> -	return apic->vcpu->vcpu_id;
> +	return kvm_apic_id(apic->vcpu);
>   }
>   
>   static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu)
> @@ -219,8 +219,8 @@ static int kvm_recalculate_phys_map(struct kvm_apic_map *new,
>   				    bool *xapic_id_mismatch)
>   {
>   	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u32 x2apic_id = kvm_x2apic_id(apic);
> -	u32 xapic_id = kvm_xapic_id(apic);
> +	u32 x2apic_id = kvm_apic_id_and_group(vcpu);
> +	u32 xapic_id = kvm_apic_id_and_group(vcpu);
>   	u32 physical_id;
>   
>   	/*
> @@ -299,6 +299,13 @@ static void kvm_recalculate_logical_map(struct kvm_apic_map *new,
>   	u16 mask;
>   	u32 ldr;
>   
> +	/*
> +	 * Using maps for logical destinations when KVM_CAP_APIC_ID_GRUPS is in
> +	 * use isn't supported.
> +	 */
> +	if (kvm_apic_group(vcpu))
> +		new->logical_mode = KVM_APIC_MODE_MAP_DISABLED;
> +
>   	if (new->logical_mode == KVM_APIC_MODE_MAP_DISABLED)
>   		return;
>   
> @@ -370,6 +377,25 @@ enum {
>   	DIRTY
>   };
>   
> +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> +				    struct kvm_apic_id_groups *groups)
> +{
> +	u8 n_bits = groups->n_bits;
> +
> +	if (n_bits > 32)
> +		return -EINVAL;
> +
> +	kvm->arch.apic_id_group_mask = n_bits ? GENMASK(31, 32 - n_bits): 0;
> +	/*
> +	 * Bitshifts >= than the width of the type are UD, so set the
> +	 * apic group shift to 0 when n_bits == 0. The group mask above will
> +	 * clear the APIC ID, so group querying functions will return the
> +	 * correct value.
> +	 */
> +	kvm->arch.apic_id_group_shift = n_bits ? 32 - n_bits : 0;
> +	return 0;
> +}
> +
>   void kvm_recalculate_apic_map(struct kvm *kvm)
>   {
>   	struct kvm_apic_map *new, *old = NULL;
> @@ -414,7 +440,7 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
>   
>   	kvm_for_each_vcpu(i, vcpu, kvm)
>   		if (kvm_apic_present(vcpu))
> -			max_id = max(max_id, kvm_x2apic_id(vcpu->arch.apic));
> +			max_id = max(max_id, kvm_apic_id_and_group(vcpu));
>   
>   	new = kvzalloc(sizeof(struct kvm_apic_map) +
>   	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1),
> @@ -525,7 +551,7 @@ static inline void kvm_apic_set_x2apic_id(struct kvm_lapic *apic, u32 id)
>   {
>   	u32 ldr = kvm_apic_calc_x2apic_ldr(id);
>   
> -	WARN_ON_ONCE(id != apic->vcpu->vcpu_id);
> +	WARN_ON_ONCE(id != kvm_apic_id(apic->vcpu));
>   
>   	kvm_lapic_set_reg(apic, APIC_ID, id);
>   	kvm_lapic_set_reg(apic, APIC_LDR, ldr);
> @@ -1067,6 +1093,17 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
>   	struct kvm_lapic *target = vcpu->arch.apic;
>   	u32 mda = kvm_apic_mda(vcpu, dest, source, target);
>   
> +	/*
> +	 * Make sure vCPUs belong to the same APIC group, it's not possible
> +	 * to send interrupts across groups.
> +	 *
> +	 * Non-IPIs and PV-IPIs can only be injected into the default APIC
> +	 * group (group 0).
> +	 */
> +	if ((source && !kvm_match_apic_group(source->vcpu, vcpu)) ||
> +	    kvm_apic_group(vcpu))
> +		return false;
> +
>   	ASSERT(target);
>   	switch (shorthand) {
>   	case APIC_DEST_NOSHORT:
> @@ -1518,6 +1555,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>   	else
>   		irq.dest_id = GET_XAPIC_DEST_FIELD(icr_high);
>   
> +	if (irq.dest_mode == APIC_DEST_PHYSICAL)
> +		kvm_apic_id_set_group(apic->vcpu->kvm,
> +				      kvm_apic_group(apic->vcpu), &irq.dest_id);
> +
>   	trace_kvm_apic_ipi(icr_low, irq.dest_id);
>   
>   	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
> @@ -2541,7 +2582,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>   	/* update jump label if enable bit changes */
>   	if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) {
>   		if (value & MSR_IA32_APICBASE_ENABLE) {
> -			kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
> +			kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
>   			static_branch_slow_dec_deferred(&apic_hw_disabled);
>   			/* Check if there are APF page ready requests pending */
>   			kvm_make_request(KVM_REQ_APF_READY, vcpu);
> @@ -2553,9 +2594,9 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>   
>   	if ((old_value ^ value) & X2APIC_ENABLE) {
>   		if (value & X2APIC_ENABLE)
> -			kvm_apic_set_x2apic_id(apic, vcpu->vcpu_id);
> +			kvm_apic_set_x2apic_id(apic, kvm_apic_id(vcpu));
>   		else if (value & MSR_IA32_APICBASE_ENABLE)
> -			kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
> +			kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
>   	}
>   
>   	if ((old_value ^ value) & (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) {
> @@ -2685,7 +2726,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
>   
>   	/* The xAPIC ID is set at RESET even if the APIC was already enabled. */
>   	if (!init_event)
> -		kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
> +		kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
>   	kvm_apic_set_version(apic->vcpu);
>   
>   	for (i = 0; i < apic->nr_lvt_entries; i++)
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index e1021517cf04..542bd208e52b 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
>   void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
>   void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>   u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
> +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> +				    struct kvm_apic_id_groups *groups);
>   void kvm_recalculate_apic_map(struct kvm *kvm);
>   void kvm_apic_set_version(struct kvm_vcpu *vcpu);
>   void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
> @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
>   	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
>   }
>   
> +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;
> +}
> +
> +static inline u32 kvm_apic_id_and_group(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id;
> +}
> +
> +static inline u32 kvm_apic_group(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	return (vcpu->vcpu_id & kvm->arch.apic_id_group_mask) >>
> +	       kvm->arch.apic_id_group_shift;
> +}
> +
> +static inline void kvm_apic_id_set_group(struct kvm *kvm, u32 group,
> +					 u32 *apic_id)
> +{
> +	*apic_id |= ((group << kvm->arch.apic_id_group_shift) &
> +		     kvm->arch.apic_id_group_mask);
> +}
> +
> +static inline bool kvm_match_apic_group(struct kvm_vcpu *src,
> +					struct kvm_vcpu *dst)
> +{
> +	return kvm_apic_group(src) == kvm_apic_group(dst);
> +}
> +
>   #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e3eb608b6692..4cd3f00475c1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4526,6 +4526,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
>   	case KVM_CAP_IRQFD_RESAMPLE:
>   	case KVM_CAP_MEMORY_FAULT_INFO:
> +	case KVM_CAP_APIC_ID_GROUPS:
>   		r = 1;
>   		break;
>   	case KVM_CAP_EXIT_HYPERCALL:
> @@ -7112,6 +7113,20 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>   		r = kvm_vm_ioctl_set_msr_filter(kvm, &filter);
>   		break;
>   	}
> +	case KVM_SET_APIC_ID_GROUPS: {


Instead of the separate ioctl, could this just be part of the ENABLE_CAP 
arguments? See kvm_vm_ioctl_enable_cap() for reference.

Also, for v1 please document the CAP in Documentation/virt/kvm/api.rst.

And finally, isn't there some guest exposure of the number of bits an 
APIC ID can have?


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space
  2023-11-08 11:17 ` [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space Nicolas Saenz Julienne
@ 2023-11-08 12:14   ` Alexander Graf
  2023-11-28  7:26     ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:14 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> Let user-space handle HVCALL_GET_VP_REGISTERS and
> HVCALL_SET_VP_REGISTERS through the KVM_EXIT_HYPERV_HVCALL exit reason.
> Additionally, expose the cpuid bit.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>   arch/x86/kvm/hyperv.c             | 9 +++++++++
>   include/asm-generic/hyperv-tlfs.h | 1 +
>   2 files changed, 10 insertions(+)
>
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index caaa859932c5..a3970d52eef1 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2456,6 +2456,9 @@ static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
>   
>   static bool kvm_hv_is_xmm_output_hcall(u16 code)
>   {
> +	if (code == HVCALL_GET_VP_REGISTERS)
> +		return true;
> +
>   	return false;
>   }
>   
> @@ -2520,6 +2523,8 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
>   	case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
>   	case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
>   	case HVCALL_SEND_IPI_EX:
> +	case HVCALL_GET_VP_REGISTERS:
> +	case HVCALL_SET_VP_REGISTERS:
>   		return true;
>   	}
>   
> @@ -2738,6 +2743,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   			break;
>   		}
>   		goto hypercall_userspace_exit;
> +	case HVCALL_GET_VP_REGISTERS:
> +	case HVCALL_SET_VP_REGISTERS:
> +		goto hypercall_userspace_exit;
>   	default:
>   		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
>   		break;
> @@ -2903,6 +2911,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>   			ent->ebx |= HV_POST_MESSAGES;
>   			ent->ebx |= HV_SIGNAL_EVENTS;
>   			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
> +			ent->ebx |= HV_ACCESS_VP_REGISTERS;


Do we need to guard this?


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
  2023-11-08 12:11     ` Vitaly Kuznetsov
@ 2023-11-08 12:16       ` Alexander Graf
  2023-11-28  6:57         ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:16 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, anelkz, dwmw,
	jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 13:11, Vitaly Kuznetsov wrote:
> Alexander Graf <graf@amazon.com> writes:
>
>> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
>>> Prepare infrastructure to be able to return data through the XMM
>>> registers when Hyper-V hypercalls are issues in fast mode. The XMM
>>> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
>>> restored on successful hypercall completion.
>>>
>>> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>>> ---
>>>    arch/x86/include/asm/hyperv-tlfs.h |  2 +-
>>>    arch/x86/kvm/hyperv.c              | 33 +++++++++++++++++++++++++++++-
>>>    include/uapi/linux/kvm.h           |  6 ++++++
>>>    3 files changed, 39 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
>>> index 2ff26f53cd62..af594aa65307 100644
>>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>>> @@ -49,7 +49,7 @@
>>>    /* Support for physical CPU dynamic partitioning events is available*/
>>>    #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE  BIT(3)
>>>    /*
>>> - * Support for passing hypercall input parameter block via XMM
>>> + * Support for passing hypercall input and output parameter block via XMM
>>>     * registers is available
>>>     */
>>>    #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE               BIT(4)
>>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>>> index 238afd7335e4..e1bc861ab3b0 100644
>>> --- a/arch/x86/kvm/hyperv.c
>>> +++ b/arch/x86/kvm/hyperv.c
>>> @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall {
>>>       u16 rep_idx;
>>>       bool fast;
>>>       bool rep;
>>> +    bool xmm_dirty;
>>>       sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
>>>
>>>       /*
>>> @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
>>>       return ret;
>>>    }
>>>
>>> +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
>>> +{
>>> +    int reg;
>>> +
>>> +    kvm_fpu_get();
>>> +    for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
>>> +            const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
>>> +            _kvm_write_sse_reg(reg, &data);
>>> +    }
>>> +    kvm_fpu_put();
>>> +}
>>> +
>>> +static bool kvm_hv_is_xmm_output_hcall(u16 code)
>>> +{
>>> +    return false;
>>> +}
>>> +
>>>    static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>>>    {
>>> -    return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
>>> +    bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
>>> +    u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
>>> +    u64 result = vcpu->run->hyperv.u.hcall.result;
>>> +
>>> +    if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
>>> +            kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
>>> +
>>> +    return kvm_hv_hypercall_complete(vcpu, result);
>>>    }
>>>
>>>    static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>>> @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>>               break;
>>>       }
>>>
>>> +    if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty)
>>> +            kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm);
>>> +
>>>    hypercall_complete:
>>>       return kvm_hv_hypercall_complete(vcpu, ret);
>>>
>>> @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>>       vcpu->run->hyperv.u.hcall.input = hc.param;
>>>       vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
>>>       vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
>>> +    if (hc.fast)
>>> +            memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
>>>       vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
>>>       return 0;
>>>    }
>>> @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>>>                       ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
>>>
>>>                       ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
>>> +                    ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
>>
>> Shouldn't this be guarded by an ENABLE_CAP to make sure old user space
>> that doesn't know about xmm outputs is still able to run with newer kernels?
>>
> No, we don't do CAPs for new Hyper-V features anymore since we have
> KVM_GET_SUPPORTED_HV_CPUID. Userspace is not supposed to simply copy
> its output into guest visible CPUIDs, it must only enable features it
> knows. Even 'hv_passthrough' option in QEMU doesn't pass unknown
> features through.


Ah, nice :). That simplifies things.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
  2023-11-08 11:17 ` [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers Nicolas Saenz Julienne
@ 2023-11-08 12:21   ` Alexander Graf
  2023-11-08 14:04     ` Nicolas Saenz Julienne
  2023-11-28  7:25   ` Maxim Levitsky
  1 sibling, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:21 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> Introduce two helper functions. The first one queries a vCPU's VTL
> level, the second one, given a struct kvm_vcpu and VTL pair, returns the
> corresponding 'sibling' struct kvm_vcpu at the right VTL.
>
> We keep track of each VTL's state by having a distinct struct kvm_vpcu
> for each level. VTL-vCPUs that belong to the same guest CPU share the
> same physical APIC id, but belong to different APIC groups where the
> apic group represents the vCPU's VTL.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>   arch/x86/kvm/hyperv.h | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> index 2bfed69ba0db..5433107e7cc8 100644
> --- a/arch/x86/kvm/hyperv.h
> +++ b/arch/x86/kvm/hyperv.h
> @@ -23,6 +23,7 @@
>   
>   #include <linux/kvm_host.h>
>   #include "x86.h"
> +#include "lapic.h"
>   
>   /* "Hv#1" signature */
>   #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648
> @@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct kvm_vcpu *vcpu)
>   	return &vcpu->kvm->arch.hyperv.hv_syndbg;
>   }
>   
> +static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, int vtl)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u32 target_id = kvm_apic_id(vcpu);
> +
> +	kvm_apic_id_set_group(kvm, vtl, &target_id);
> +	if (vcpu->vcpu_id == target_id)
> +		return vcpu;
> +
> +	return kvm_get_vcpu_by_id(kvm, target_id);
> +}
> +
> +static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_apic_group(vcpu);


Shouldn't this check whether VTL is active? If someone wants to use APIC 
groups for a different purpose in the future, they'd suddenly find 
themselves in VTL code paths in other code (such as memory protections), no?

Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 25/33] KVM: Introduce a set of new memory attributes
  2023-11-08 11:17 ` [RFC 25/33] KVM: Introduce a set of new memory attributes Nicolas Saenz Julienne
@ 2023-11-08 12:30   ` Alexander Graf
  2023-11-08 16:43     ` Sean Christopherson
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:30 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> Introduce the following memory attributes:
>   - KVM_MEMORY_ATTRIBUTE_READ
>   - KVM_MEMORY_ATTRIBUTE_WRITE
>   - KVM_MEMORY_ATTRIBUTE_EXECUTE
>   - KVM_MEMORY_ATTRIBUTE_NO_ACCESS
>
> Note that NO_ACCESS is necessary in order to make a distinction between
> the lack of attributes for a gfn, which defaults to the memory
> protections of the backing memory, versus explicitly prohibiting any
> access to that gfn.


If we negate the attributes (no read, no write, no execute), we can keep 
0 == default and 0b111 becomes "no access".


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 11:18 ` [RFC 29/33] KVM: VMX: Save instruction length on EPT violation Nicolas Saenz Julienne
@ 2023-11-08 12:40   ` Alexander Graf
  2023-11-08 16:15     ` Sean Christopherson
  2023-11-08 17:20   ` Sean Christopherson
  1 sibling, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:40 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> Save the length of the instruction that triggered an EPT violation in
> struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
> intercept messages.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>


In v1, please do this for SVM as well :)


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
  2023-11-08 11:18 ` [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request Nicolas Saenz Julienne
@ 2023-11-08 12:45   ` Alexander Graf
  2023-11-08 13:38     ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:45 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
> injecting out-of-band Hyper-V secure intercepts. For now only memory
> access intercepts are supported. These are triggered when access a GPA
> protected by a higher VTL. The memory intercept metadata is filled based
> on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
> injected into the guest through SynIC message.
>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>


IMHO memory protection violations should result in a user space exit. 
User space can then validate what to do with the violation and if 
necessary inject an intercept.

That means from an API point of view, you want a new exit reason 
(violation) and an ioctl that allows you to transmit the violating CPU 
state into the target vCPU. I don't think the injection should even know 
that the source of data for the violation was a vCPU.



Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS
  2023-11-08 11:18 ` [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS Nicolas Saenz Julienne
@ 2023-11-08 12:49   ` Alexander Graf
  2023-11-08 13:44     ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 12:49 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> Introduce HVCALL_TRANSLATE_VIRTUAL_ADDRESS, the hypercall receives a
> GVA, generally from a less privileged VTL, and returns the GPA backing
> it. The GVA -> GPA conversion is done by walking the target VTL's vCPU
> MMU.
>
> NOTE: The hypercall implementation is incomplete and only shared for
> completion. Additionally we'd like to move the VTL aware parts to
> user-space.


Yes, please :). We should handle the complete hypercall in user space if 
possible. If you're afraid that gva -> gpa conversion may run out of 
sync between a user space and the kvm implementations, let's introduce 
an ioctl that allows you to perform that conversion.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
  2023-11-08 12:45   ` Alexander Graf
@ 2023-11-08 13:38     ` Nicolas Saenz Julienne
  2023-11-28  8:19       ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 13:38 UTC (permalink / raw)
  To: Alexander Graf, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

On Wed Nov 8, 2023 at 12:45 PM UTC, Alexander Graf wrote:
>
> On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
> > injecting out-of-band Hyper-V secure intercepts. For now only memory
> > access intercepts are supported. These are triggered when access a GPA
> > protected by a higher VTL. The memory intercept metadata is filled based
> > on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
> > injected into the guest through SynIC message.
> >
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>
>
> IMHO memory protection violations should result in a user space exit. 

It already does, it's not very explicit from the patch itself, since the
functionality was introduced in through the "KVM: guest_memfd() and
per-page attributes" series [1].

See this snippet in patch #27:

+	if (kvm_hv_vsm_enabled(vcpu->kvm)) {
+		if (kvm_hv_faultin_pfn(vcpu, fault)) {
+			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+			return -EFAULT;
+		}
+	}

Otherwise the doc in patch #33 also mentions this. :)

> User space can then validate what to do with the violation and if 
> necessary inject an intercept.

I do agree that secure intercept injection should be moved into to
user-space, and happen as a reaction to a user-space memory fault exit.
I was unable to do so yet, since the intercepts require a level of
introspection that is not yet available to QEMU. For example, providing
the length of the instruction that caused the fault. I'll work on
exposing the necessary information to user-space and move the whole
intercept concept there.

Nicolas

[1] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS
  2023-11-08 12:49   ` Alexander Graf
@ 2023-11-08 13:44     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 13:44 UTC (permalink / raw)
  To: Alexander Graf, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

On Wed Nov 8, 2023 at 12:49 PM UTC, Alexander Graf wrote:
>
> On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > Introduce HVCALL_TRANSLATE_VIRTUAL_ADDRESS, the hypercall receives a
> > GVA, generally from a less privileged VTL, and returns the GPA backing
> > it. The GVA -> GPA conversion is done by walking the target VTL's vCPU
> > MMU.
> >
> > NOTE: The hypercall implementation is incomplete and only shared for
> > completion. Additionally we'd like to move the VTL aware parts to
> > user-space.
>
>
> Yes, please :). We should handle the complete hypercall in user space if 
> possible. If you're afraid that gva -> gpa conversion may run out of 
> sync between a user space and the kvm implementations, let's introduce 
> an ioctl that allows you to perform that conversion.

I'll look into introducing a generic API that performs MMU walks. The
devil is in the details though, the hypercall introduces flags like:

• HV_TRANSLATE_GVA_TLB_FLUSH_INHIBIT: Indicates that the TlbFlushInhibit
  flag in the virtual processor’s HvRegisterInterceptSuspend register
  should be set as a consequence of a successful return. This prevents
  other virtual processors associated with the target partition from
  flushing the stage 1 TLB of the specified virtual processor until
  after the TlbFlushInhibit flag is cleared.

Which make things trickier.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
  2023-11-08 12:21   ` Alexander Graf
@ 2023-11-08 14:04     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 14:04 UTC (permalink / raw)
  To: Alexander Graf, kvm, anelkz
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, kys, haiyangz, decui, x86, linux-doc

On Wed Nov 8, 2023 at 12:21 PM UTC, Alexander Graf wrote:
>
> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> > Introduce two helper functions. The first one queries a vCPU's VTL
> > level, the second one, given a struct kvm_vcpu and VTL pair, returns the
> > corresponding 'sibling' struct kvm_vcpu at the right VTL.
> >
> > We keep track of each VTL's state by having a distinct struct kvm_vpcu
> > for each level. VTL-vCPUs that belong to the same guest CPU share the
> > same physical APIC id, but belong to different APIC groups where the
> > apic group represents the vCPU's VTL.
> >
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > ---
> >   arch/x86/kvm/hyperv.h | 18 ++++++++++++++++++
> >   1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> > index 2bfed69ba0db..5433107e7cc8 100644
> > --- a/arch/x86/kvm/hyperv.h
> > +++ b/arch/x86/kvm/hyperv.h
> > @@ -23,6 +23,7 @@
> >   
> >   #include <linux/kvm_host.h>
> >   #include "x86.h"
> > +#include "lapic.h"
> >   
> >   /* "Hv#1" signature */
> >   #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648
> > @@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct kvm_vcpu *vcpu)
> >   	return &vcpu->kvm->arch.hyperv.hv_syndbg;
> >   }
> >   
> > +static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, int vtl)
> > +{
> > +	struct kvm *kvm = vcpu->kvm;
> > +	u32 target_id = kvm_apic_id(vcpu);
> > +
> > +	kvm_apic_id_set_group(kvm, vtl, &target_id);
> > +	if (vcpu->vcpu_id == target_id)
> > +		return vcpu;
> > +
> > +	return kvm_get_vcpu_by_id(kvm, target_id);
> > +}
> > +
> > +static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu)
> > +{
> > +	return kvm_apic_group(vcpu);
>
> Shouldn't this check whether VTL is active? If someone wants to use APIC 
> groups for a different purpose in the future, they'd suddenly find 
> themselves in VTL code paths in other code (such as memory protections), no?

Yes, indeed. This is solved by adding a couple of checks vs
kvm_hv_vsm_enabled().

I don't have another use-case in mind for APIC ID groups so it's hard to
picture if I'm just over engineering things, but I wonder it we need to
introduce some sort of protection vs concurrent usages.

For example we could introduce masks within the group bits and have
consumers explicitly request what they want. Something like:

	vtl = kvm_apic_group(vcpu, HV_VTL);

If user-space didn't reserve bits within the APIC ID group area and
marked them with HV_VTL you'd get an error as opposed to 0 which is
otherwise a valid group.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-11-08 11:53   ` Alexander Graf
@ 2023-11-08 14:10     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 14:10 UTC (permalink / raw)
  To: Alexander Graf, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

On Wed Nov 8, 2023 at 11:53 AM UTC, Alexander Graf wrote:

[...]

> > @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> >   	/* ret */
> >   	((unsigned char *)instructions)[i++] = 0xc3;
> >   
> > +	/* VTL call/return entries */
> > +	if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
>
>
> You don't introduce kvm_hv_vsm_enabled() before. Please do a quick test 
> build of all individual commits of your patch set for v1 :).

Yes, sorry for that. This happens for a couple of helpers, I'll fix it.

> Why do you need the ifdef here? is_long_mode() already has an ifdef that 
> will always return false for is_64_bit_mode() on 32bit hosts.

Noted, will remove.

> > +		if (is_64_bit_mode(vcpu)) {
> > +			/*
> > +			 * VTL call 64-bit entry prologue:
> > +			 * 	mov %rcx, %rax
> > +			 * 	mov $0x11, %ecx
> > +			 * 	jmp 0:
> > +			 */
> > +			hv->vsm_code_page_offsets.vtl_call_offset = i;
> > +			instructions[i++] = 0x48;
> > +			instructions[i++] = 0x89;
> > +			instructions[i++] = 0xc8;
> > +			instructions[i++] = 0xb9;
> > +			instructions[i++] = 0x11;
> > +			instructions[i++] = 0x00;
> > +			instructions[i++] = 0x00;
> > +			instructions[i++] = 0x00;
> > +			instructions[i++] = 0xeb;
> > +			instructions[i++] = 0xe0;
>
>
> I think it would be a lot easier to read (because it's denser) if you 
> move the opcodes into a character array:
>
> char vtl_entry[] = { 0x48, 0x89, 0xc8, 0xb9, 0x11, 0x00, 0x00, 0x00. 
> 0xeb, 0xe0 };
>
> and then just memcpy().

Works for me, I'll rework it.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-08 11:40 ` [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Alexander Graf
@ 2023-11-08 14:41   ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-08 14:41 UTC (permalink / raw)
  To: Alexander Graf, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, kys, haiyangz, decui, x86, linux-doc

On Wed Nov 8, 2023 at 11:40 AM UTC, Alexander Graf wrote:
> Hey Nicolas,

[...]

> > The series is accompanied by two repositories:
> >   - A PoC QEMU implementation of VSM [3].
> >   - VSM kvm-unit-tests [4].
> >
> > Note that this isn't a full VSM implementation. For now it only supports
> > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> > Windows Sever 2016/2019, but is unstable during runtime.
>
> How much of these limitations are inherent in the current set of 
> patches? What is missing to go beyond 2 VTLs and into SMP land? Anything 
> that will require API changes?

The main KVM concepts introduced by this series are ready to deal with
any number of VTLs (APIC ID groups, VTL KVM device).

KVM_HV_GET_VSM_STATE should provide a copy of 'vsm_code_page_offsets'
per-VTL, since the hypercall page is partition wide but per-VTL.
Attaching that information as a VTL KVM device attribute fits that
requirement nicely. I'd prefer going that way especially if the VTL KVM
device has a decent reception. Also, the secure memory intecepts and
HVCALL_TRANSLATE_VIRTUAL_ADDRESS take some VTL related shortcuts, but
those are going away. Otherwise, I don't see any necessary in-kernel
changes.

When virtualizing Windows with VSM I've never seen usages that go beyond
VTL1. So enabling VTL > 1 will be mostly a kvm-unit-tests effort. As for
SMP, it just a matter of work. Notably HvStartVirtualProcessor and
HvGetVpIndexFromApicId need to be implemented, and making sure the QEMU
VTL scheduling code holds up.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h
  2023-11-08 11:17 ` [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h Nicolas Saenz Julienne
@ 2023-11-08 16:11   ` Sean Christopherson
  0 siblings, 0 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 16:11 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> lapic.h has no dependencies with hyperv.h, so don't include it there.
> 
> Additionally, cpuid.c implicitly relied on hyperv.h's inclusion through
> lapic.h, so include it explicitly there.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---

FWIW, feel free to post patches like this without the full context, I'm more than
happy to take patches that resolve header inclusion issues even if the issue(s)
only become visible with additional changes.

I'll earmark this one for 6.8.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 12:40   ` Alexander Graf
@ 2023-11-08 16:15     ` Sean Christopherson
  2023-11-08 17:11       ` Alexander Graf
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 16:15 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, dwmw, jgowans, corbert, kys,
	haiyangz, decui, x86, linux-doc

On Wed, Nov 08, 2023, Alexander Graf wrote:
> 
> On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > Save the length of the instruction that triggered an EPT violation in
> > struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
> > intercept messages.
> > 
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> 
> 
> In v1, please do this for SVM as well :)

Why?  KVM caches values on VMX because VMREAD is measurable slower than memory
accesses, especially when running nested.  SVM has no such problems.  I wouldn't
be surprised if adding a "cache" is actually less performant due to increased
pressure and misses on the hardware cache.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 25/33] KVM: Introduce a set of new memory attributes
  2023-11-08 12:30   ` Alexander Graf
@ 2023-11-08 16:43     ` Sean Christopherson
  0 siblings, 0 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 16:43 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, dwmw, jgowans, corbert, kys,
	haiyangz, decui, x86, linux-doc

On Wed, Nov 08, 2023, Alexander Graf wrote:
> 
> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> > Introduce the following memory attributes:
> >   - KVM_MEMORY_ATTRIBUTE_READ
> >   - KVM_MEMORY_ATTRIBUTE_WRITE
> >   - KVM_MEMORY_ATTRIBUTE_EXECUTE
> >   - KVM_MEMORY_ATTRIBUTE_NO_ACCESS
> > 
> > Note that NO_ACCESS is necessary in order to make a distinction between
> > the lack of attributes for a gfn, which defaults to the memory
> > protections of the backing memory, versus explicitly prohibiting any
> > access to that gfn.
> 
> 
> If we negate the attributes (no read, no write, no execute), we can keep 0
> == default and 0b111 becomes "no access".

Yes, I suggested this in the initial discussion[*].  I think it makes sense to
have the uAPI flags have positive polarity, i.e. as above, but internally we can
invert things so that the default 000b gives full RWX protections.  Or we could
make the push for a range-based xarray implementation so that storing 111b for
all gfns is super cheap.

Regardless of how KVM stores the information internally, there's no need for a
NO_ACCESS flag in the uAPI.

[*] https://lore.kernel.org/all/ZGfUqBLaO+cI9ypv@google.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
                   ` (33 preceding siblings ...)
  2023-11-08 11:40 ` [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Alexander Graf
@ 2023-11-08 16:55 ` Sean Christopherson
  2023-11-08 18:33   ` Sean Christopherson
  2023-11-10 19:04   ` Nicolas Saenz Julienne
  34 siblings, 2 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 16:55 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> This RFC series introduces the necessary infrastructure to emulate VSM
> enabled guests. It is a snapshot of the progress we made so far, and its
> main goal is to gather design feedback.

Heh, then please provide an overview of the design, and ideally context and/or
justification for various design decisions.  It doesn't need to be a proper design
doc, and you can certainly point at other documentation for explaining VSM/VTLs,
but a few paragraphs and/or verbose bullet points would go a long way.

The documentation in patch 33 provides an explanation of VSM itself, and a little
insight into how userspace can utilize the KVM implementation.  But the documentation
provides no explanation of the mechanics that KVM *developers* care about, e.g.
the use of memory attributes, how memory attributes are enforced, whether or not
an in-kernel local APIC is required, etc.

Nor does the documentation explain *why*, e.g. why store a separate set of memory
attributes per VTL "device", which by the by is broken and unnecessary.

> Specifically on the KVM APIs we introduce. For a high level design overview,
> see the documentation in patch 33.
> 
> Additionally, this topic will be discussed as part of the KVM
> Micro-conference, in this year's Linux Plumbers Conference [2].
> 
> The series is accompanied by two repositories:
>  - A PoC QEMU implementation of VSM [3].
>  - VSM kvm-unit-tests [4].
> 
> Note that this isn't a full VSM implementation. For now it only supports
> 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> Windows Sever 2016/2019, but is unstable during runtime.
> 
> The series is based on the v6.6 kernel release, and depends on the
> introduction of KVM memory attributes, which is being worked on
> independently in "KVM: guest_memfd() and per-page attributes" [5].

This doesn't actually apply on 6.6 with v14 of guest_memfd, because v14 of
guest_memfd is based on kvm-6.7-1.  Ah, and looking at your github repo, this
isn't based on v14 at all, it's based on v12.

That's totally fine, but the cover letter needs to explicitly, clearly, and
*accurately* state the dependencies.  I can obviously grab the full branch from
github, but that's not foolproof, e.g. if you accidentally delete or force push
to that branch.  And I also prefer to know that what I'm replying to on list is
the exact same code that I am looking at.

> A full Linux tree is also made available [6].
> 
> Series rundown:
>  - Patch 2 introduces the concept of APIC ID groups.
>  - Patches 3-12 introduce the VSM capability and basic VTL awareness into
>    Hyper-V emulation.
>  - Patch 13 introduces vCPU polling support.
>  - Patches 14-31 use KVM's memory attributes to implement VTL memory
>    protections. Introduces the VTL KMV device and secure memory
>    intercepts.
>  - Patch 32 is a temporary implementation of
>    HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019.
>  - Patch 33 introduces documentation.
> 
> Our intention is to integrate feedback gathered in the RFC and LPC while
> we finish the VSM implementation. In the future, we will split the series
> into distinct feature patch sets and upstream these independently.
> 
> Thanks,
> Nicolas
> 
> [1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
> [2] https://lpc.events/event/17/sessions/166/#20231114
> [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1
> [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1
> [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
> [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. 

When providing github links, my preference is to format the pointers as:

  <repo> <branch>

or
  <repo> tags/<tag>

e.g.

  https://github.com/vianpl/linux vsm-rfc-v1

so that readers can copy+paste the full thing directly into `git fetch`.  It's a
minor thing, but AFAIK no one actually does review by clicking through github's
webview.

>     There are also two small dependencies with
>     https://marc.info/?l=kvm&m=167887543028109&w=2 and
>     https://lkml.org/lkml/2023/10/17/972

Please use lore links, there's zero reason to use anything else these days.  For
those of us that use b4, lore links make life much easier.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array
  2023-11-08 11:17 ` [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array Nicolas Saenz Julienne
@ 2023-11-08 16:59   ` Sean Christopherson
  2023-11-28  7:41   ` Maxim Levitsky
  1 sibling, 0 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 16:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 631fd532c97a..4242588e3dfb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2385,9 +2385,10 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  }
>  
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +static inline unsigned long
> +kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)

Do not wrap before the function name.  Linus has a nice explanation/rant on this[*].

[*] https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=GhjSY6nFmuB_wA8fYd5iLBNXjo9Bw@mail.gmail.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument
  2023-11-08 11:17 ` [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument Nicolas Saenz Julienne
@ 2023-11-08 17:08   ` Sean Christopherson
  0 siblings, 0 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 17:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> Pass the memory attribute array through struct kvm_mmu_notifier_arg and
> use it in kvm_arch_post_set_memory_attributes() instead of defaulting on
> kvm->mem_attr_array.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 8 ++++----
>  include/linux/kvm_host.h | 5 ++++-
>  virt/kvm/kvm_main.c      | 1 +
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c0fd3afd6be5..c2bec2be2ba9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7311,6 +7311,7 @@ static bool hugepage_has_attrs(struct xarray *mem_attr_array,
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  					 struct kvm_gfn_range *range)
>  {
> +	struct xarray *mem_attr_array = range->arg.mem_attr_array;
>  	unsigned long attrs = range->arg.attributes;
>  	struct kvm_memory_slot *slot = range->slot;
>  	int level;
> @@ -7344,8 +7345,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  			 * misaligned address regardless of memory attributes.
>  			 */
>  			if (gfn >= slot->base_gfn) {
> -				if (hugepage_has_attrs(&kvm->mem_attr_array,
> -						       slot, gfn, level, attrs))
> +				if (hugepage_has_attrs(mem_attr_array, slot,
> +						       gfn, level, attrs))

This is wildly broken.  The hugepage tracking is per VM, whereas the attributes
here are per-VTL.  I.e. KVM will (dis)allow hugepages based on whatever VTL last
changed its protections.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 16:15     ` Sean Christopherson
@ 2023-11-08 17:11       ` Alexander Graf
  0 siblings, 0 replies; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 17:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, dwmw, jgowans, corbert, kys,
	haiyangz, decui, x86, linux-doc


On 08.11.23 17:15, Sean Christopherson wrote:
>
> On Wed, Nov 08, 2023, Alexander Graf wrote:
>> On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
>>> Save the length of the instruction that triggered an EPT violation in
>>> struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
>>> intercept messages.
>>>
>>> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
>>
>> In v1, please do this for SVM as well :)
> Why?  KVM caches values on VMX because VMREAD is measurable slower than memory
> accesses, especially when running nested.  SVM has no such problems.  I wouldn't
> be surprised if adding a "cache" is actually less performant due to increased
> pressure and misses on the hardware cache.


My understanding was that this patch wasn't about caching it, it was 
about storing it somewhere generically so we can use it for the fault 
injection code path in the following patch. And if we don't set this 
variable for SVM, it just means Credential Guard fault injection would 
be broken there.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 11:18 ` [RFC 29/33] KVM: VMX: Save instruction length on EPT violation Nicolas Saenz Julienne
  2023-11-08 12:40   ` Alexander Graf
@ 2023-11-08 17:20   ` Sean Christopherson
  2023-11-08 17:27     ` Alexander Graf
  1 sibling, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 17:20 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> Save the length of the instruction that triggered an EPT violation in
> struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
> intercept messages.

This is silly and unnecessarily obfuscates *why* (as my response regarding SVM
shows), i.e. that this is "needed" becuase the value is consumed by a *different*
vCPU, not because of performance concerns.

It's also broken, AFAICT nothing prevents the intercepted vCPU from hitting a
different EPT violation before the target vCPU consumes exit_instruction_len.

Holy cow.  All of deliver_gpa_intercept() is wildly unsafe.  Aside from race
conditions, which in and of themselves are a non-starter, nothing guarantees that
the intercepted vCPU actually cached all of the information that is held in its VMCS.

The sane way to do this is to snapshot *all* information on the intercepted vCPU,
and then hand that off as a payload to the target vCPU.  That is, assuming the
cross-vCPU stuff is actually necessary.  At a glance, I don't see anything that
explains *why*.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role
  2023-11-08 11:17 ` [RFC 14/33] KVM: x86: Add VTL to the MMU role Nicolas Saenz Julienne
@ 2023-11-08 17:26   ` Sean Christopherson
  2023-11-10 18:52     ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 17:26 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> With the upcoming introduction of per-VTL memory protections, make MMU
> roles VTL aware. This will avoid sharing PTEs between vCPUs that belong
> to different VTLs, and that have distinct memory access restrictions.
> 
> Four bits are allocated to store the VTL number in the MMU role, since
> the TLFS states there is a maximum of 16 levels.

How many does KVM actually allow/support?  Multiplying the number of possible
roots by 16x is a *major* change.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 17:20   ` Sean Christopherson
@ 2023-11-08 17:27     ` Alexander Graf
  2023-11-08 18:19       ` Jim Mattson
  0 siblings, 1 reply; 108+ messages in thread
From: Alexander Graf @ 2023-11-08 17:27 UTC (permalink / raw)
  To: Sean Christopherson, Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc


On 08.11.23 18:20, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
>> Save the length of the instruction that triggered an EPT violation in
>> struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
>> intercept messages.
> This is silly and unnecessarily obfuscates *why* (as my response regarding SVM
> shows), i.e. that this is "needed" becuase the value is consumed by a *different*
> vCPU, not because of performance concerns.
>
> It's also broken, AFAICT nothing prevents the intercepted vCPU from hitting a
> different EPT violation before the target vCPU consumes exit_instruction_len.
>
> Holy cow.  All of deliver_gpa_intercept() is wildly unsafe.  Aside from race
> conditions, which in and of themselves are a non-starter, nothing guarantees that
> the intercepted vCPU actually cached all of the information that is held in its VMCS.
>
> The sane way to do this is to snapshot *all* information on the intercepted vCPU,
> and then hand that off as a payload to the target vCPU.  That is, assuming the
> cross-vCPU stuff is actually necessary.  At a glance, I don't see anything that
> explains *why*.


Yup, I believe you repeated the comment I had on the function - and 
Nicolas already agreed :). This should go through user space which 
automatically means you need to bubble up all necessary trap data to 
user space on the faulting vCPU and then inject the full set of data 
into the receiving one.

My point with the comment on this patch was "Don't break AMD (or ancient 
VMX without instruction length decoding [Does that exist? I know SVM has 
old CPUs that don't do it]) please".


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
  2023-11-08 11:17 ` [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Nicolas Saenz Julienne
  2023-11-08 12:11   ` Alexander Graf
@ 2023-11-08 17:47   ` Sean Christopherson
  2023-11-10 18:46     ` Nicolas Saenz Julienne
  2023-11-28  6:56   ` Maxim Levitsky
  2 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 17:47 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Anel Orazgaliyeva

On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> From: Anel Orazgaliyeva <anelkz@amazon.de>
> 
> Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> ids into two. The lower bits, the physical APIC id, represent the part
> that's exposed to the guest. The higher bits, which are private to KVM,
> groups APICs together. APICs in different groups are isolated from each
> other, and IPIs can only be directed at APICs that share the same group
> as its source. Furthermore, groups are only relevant to IPIs, anything
> incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> PV-IPIs is targeted at the default APIC group, group 0.
> 
> When routing IPIs with physical destinations, KVM will OR the source's
> vCPU APIC group with the ICR's destination ID and use that to resolve
> the target lAPIC.

Is all of the above arbitrary KVM behavior or defined by the TLFS?

> The APIC physical map is also made group aware in
> order to speed up this process. For the sake of simplicity, the logical
> map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> routing to the slower per-vCPU scan method.

Why?  I mean, I kinda sorta understand what it does for VSM, but it's not at all
obvious why this information needs to be shoved into the APIC IDs.  E.g. why not
have an explicit group_id and then maintain separate optimization maps for each?

> This capability serves as a building block to implement virtualisation
> based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> introduces a para-virtualised switch that allows for guest CPUs to jump
> into a different execution context, this switches into a different CPU
> state, lAPIC state, and memory protections. We model this in KVM by

Who is "we"?  As a general rule, avoid pronouns.  "we" and "us" in particular
should never show up in a changelog.  I genuinely don't know if "we" means
userspace or KVM, and the distinction matters because it clarifies whether or
not KVM is actively involved in the modeling versus KVM being little more than a
dumb pipe to provide the plumbing.

> using distinct kvm_vcpus for each context.
>
> Moreover, execution contexts are hierarchical and its APICs are meant to
> remain functional even when the context isn't 'scheduled in'.

Please explain the relationship and rules of execution contexts.  E.g. are
execution contexts the same thing as VTLs?  Do all "real" vCPUs belong to every
execution context?  If so, is that a requirement?

> For example, we have to keep track of
> timers' expirations, and interrupt execution of lesser priority contexts
> when relevant. Hence the need to alias physical APIC ids, while keeping
> the ability to target specific execution contexts.
> 
> Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de>
> Co-developed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---


> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index e1021517cf04..542bd208e52b 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
>  void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
> +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> +				    struct kvm_apic_id_groups *groups);
>  void kvm_recalculate_apic_map(struct kvm *kvm);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
>  void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
> @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
>  	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
>  }
>  
> +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;

This is *extremely* misleading.  KVM forces the x2APIC ID to match vcpu_id, but
in xAPIC mode the ID is fully writable.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
  2023-11-08 17:27     ` Alexander Graf
@ 2023-11-08 18:19       ` Jim Mattson
  0 siblings, 0 replies; 108+ messages in thread
From: Jim Mattson @ 2023-11-08 18:19 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Sean Christopherson, Nicolas Saenz Julienne, kvm, linux-kernel,
	linux-hyperv, pbonzini, vkuznets, anelkz, dwmw, jgowans, corbert,
	kys, haiyangz, decui, x86, linux-doc

On Wed, Nov 8, 2023 at 9:27 AM Alexander Graf <graf@amazon.com> wrote:

> My point with the comment on this patch was "Don't break AMD (or ancient
> VMX without instruction length decoding [Does that exist? I know SVM has
> old CPUs that don't do it]) please".

VM-exit instruction length is not defined for all VM-exit reasons (EPT
misconfiguration is one that is notably absent), but the field has
been there since Prescott.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-08 16:55 ` Sean Christopherson
@ 2023-11-08 18:33   ` Sean Christopherson
  2023-11-10 17:56     ` Nicolas Saenz Julienne
  2023-11-10 19:04   ` Nicolas Saenz Julienne
  1 sibling, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-11-08 18:33 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, Nov 08, 2023, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > This RFC series introduces the necessary infrastructure to emulate VSM
> > enabled guests. It is a snapshot of the progress we made so far, and its
> > main goal is to gather design feedback.
> 
> Heh, then please provide an overview of the design, and ideally context and/or
> justification for various design decisions.  It doesn't need to be a proper design
> doc, and you can certainly point at other documentation for explaining VSM/VTLs,
> but a few paragraphs and/or verbose bullet points would go a long way.
> 
> The documentation in patch 33 provides an explanation of VSM itself, and a little
> insight into how userspace can utilize the KVM implementation.  But the documentation
> provides no explanation of the mechanics that KVM *developers* care about, e.g.
> the use of memory attributes, how memory attributes are enforced, whether or not
> an in-kernel local APIC is required, etc.
> 
> Nor does the documentation explain *why*, e.g. why store a separate set of memory
> attributes per VTL "device", which by the by is broken and unnecessary.

After speed reading the series..  An overview of the design, why you made certain
choices, and the tradeoffs between various options is definitely needed.

A few questions off the top of my head:

 - What is the split between userspace and KVM?  How did you arrive at that split?

 - How much *needs* to be in KVM?  I.e. how much can be pushed to userspace while
   maintaininly good performance?
   
 - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
   in a VTL device and APIC ID groups, why not modify "struct kvm" to support
   replicating state that needs to be tracked per-VTL?  Because of how memory
   attributes affect hugepages, duplicating *memslots* might actually be easier
   than teaching memslots to be VTL-aware.

 - Is "struct kvm_vcpu" the best representation of an execution context (if I'm
   getting the terminology right)?  E.g. if 90% of the state is guaranteed to be
   identical for a given vCPU across execution contexts, then modeling that with
   separate kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
   but it might be quite high depending on how much the TFLS restricts the state
   of the vCPU, e.g. if it's 64-bit only.

The more info you can provide before LPC, the better, e.g. so that we can spend
time discussing options instead of you getting peppered with questions about the
requirements and whatnot.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-08 18:33   ` Sean Christopherson
@ 2023-11-10 17:56     ` Nicolas Saenz Julienne
  2023-11-10 19:32       ` Sean Christopherson
  0 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-10 17:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

Hi Sean,
Thanks for taking the time to review the series. I took note of your
comments across the series, and will incorporate them into the LPC
discussion.

On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Sean Christopherson wrote:
> > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > > This RFC series introduces the necessary infrastructure to emulate VSM
> > > enabled guests. It is a snapshot of the progress we made so far, and its
> > > main goal is to gather design feedback.
> >
> > Heh, then please provide an overview of the design, and ideally context and/or
> > justification for various design decisions.  It doesn't need to be a proper design
> > doc, and you can certainly point at other documentation for explaining VSM/VTLs,
> > but a few paragraphs and/or verbose bullet points would go a long way.
> >
> > The documentation in patch 33 provides an explanation of VSM itself, and a little
> > insight into how userspace can utilize the KVM implementation.  But the documentation
> > provides no explanation of the mechanics that KVM *developers* care about, e.g.
> > the use of memory attributes, how memory attributes are enforced, whether or not
> > an in-kernel local APIC is required, etc.
> >
> > Nor does the documentation explain *why*, e.g. why store a separate set of memory
> > attributes per VTL "device", which by the by is broken and unnecessary.
>
> After speed reading the series..  An overview of the design, why you made certain
> choices, and the tradeoffs between various options is definitely needed.
>
> A few questions off the top of my head:
>
>  - What is the split between userspace and KVM?  How did you arrive at that split?

Our original design, which we discussed in the KVM forum 2023 [1] and is
public [2], implemented most of VSM in-kernel. Notably we introduced VTL
awareness in struct kvm_vcpu. This turned out to be way too complex: now
vcpus have multiple CPU architectural states, events, apics, mmu, etc.
First of all, the code turned out to be very intrusive, for example,
most apic APIs had to be reworked one way or another to accommodate
the fact there are multiple apics available. Also, we were forced to
introduce VSM-specific semantics in x86 emulation code. But more
importantly, the biggest pain has been dealing with state changes, they
may be triggered remotely through requests, and some are already fairly
delicate as-is. They involve a multitude of corner cases that almost
never apply for a VTL aware kvm_vcpu. Especially if you factor in
features like live migration. It's been a continuous source of
regressions.

Memory protections were implemented by using memory slot modifications.
We introduced a downstream API that allows updating memory slots
concurrently with vCPUs running. I think there was a similar proposal
upstream from Red Hat some time ago. The result is complex, hard to
generalize and slow.

So we decided to move all this complexity outside of struct kvm_vcpu
and, as much as possible, out of the kernel. We figures out the basic
kernel building blocks that are absolutely necessary, and let user-space
deal with the rest.

>  - How much *needs* to be in KVM?  I.e. how much can be pushed to userspace while
>    maintaininly good performance?

As I said above, the aim of the current design is to keep it as light as
possible. The biggest move we made was moving VTL switching into
user-space. We don't see any indication that performance is affected in
a major way. But we will know for sure once we finish the implementation
and test it under real use-cases.

>  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
>    in a VTL device and APIC ID groups, why not modify "struct kvm" to support
>    replicating state that needs to be tracked per-VTL?  Because of how memory
>    attributes affect hugepages, duplicating *memslots* might actually be easier
>    than teaching memslots to be VTL-aware.

I do agree that we need to introduce some level VTL awareness into
memslots. There's the hugepages issues you pointed out. But it'll be
also necessary once we look at how to implement overlay pages that are
per-VTL. (A topic I didn't mention in the series as I though I had
managed to solve memory protections while avoiding the need for multiple
slots). What I have in mind is introducing a memory slot address space
per-VTL, similar to how we do things with SMM.

It's important to note that requirements for overlay pages and memory
protections are very different. Overlay pages are scarce, and are setup
once and never change (AFAICT), so we think stopping all vCPUs, updating
slots, and resuming execution will provide good enough performance.
Memory protections happen very frequently, generally with page
granularity, and may be short-lived.

>  - Is "struct kvm_vcpu" the best representation of an execution context (if I'm
>    getting the terminology right)?

Let's forget I ever mentioned execution contexts. I used it in the hopes
of making the VTL concept a little more understandable for non-VSM aware
people. It's meant to be interchangeable with VTL. But I see how it
creates confusion.

>    E.g. if 90% of the state is guaranteed to be identical for a given
>    vCPU across execution contexts, then modeling that with separate
>    kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
>    but it might be quite high depending on how much the TFLS restricts
>    the state of the vCPU, e.g. if it's 64-bit only.

For the record here's the private VTL state (TLFS 15.11.1):

"In general, each VTL has its own control registers, RIP register, RSP
 register, and MSRs:

 SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK,
 EFER, PAT, KERNEL_GSBASE, FS.BASE, GS.BASE, TSC_AUX
 HV_X64_MSR_HYPERCALL
 HV_X64_MSR_GUEST_OS_ID
 HV_X64_MSR_REFERENCE_TSC
 HV_X64_MSR_APIC_FREQUENCY
 HV_X64_MSR_EOI
 HV_X64_MSR_ICR
 HV_X64_MSR_TPR
 HV_X64_MSR_APIC_ASSIST_PAGE
 HV_X64_MSR_NPIEP_CONFIG
 HV_X64_MSR_SIRBP
 HV_X64_MSR_SCONTROL
 HV_X64_MSR_SVERSION
 HV_X64_MSR_SIEFP
 HV_X64_MSR_SIMP
 HV_X64_MSR_EOM
 HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15
 HV_X64_MSR_STIMER0_COUNT – HV_X64_MSR_STIMER3_COUNT
 Local APIC registers (including CR8/TPR)
"

The rest is shared.

Note that we've observed that during normal operation, VTL switches
don't happen that often. The boot process is the most affected by any
performance impact VSM might introduce, which issues 100000s (mostly
memory protections).

Nicolas

[1] https://kvm-forum.qemu.org/2023/talk/TK7YGD/
[2] Partial rebase of our original implementation:
    https://github.com/vianpl/linux vsm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
  2023-11-08 17:47   ` Sean Christopherson
@ 2023-11-10 18:46     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-10 18:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Anel Orazgaliyeva

On Wed Nov 8, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > From: Anel Orazgaliyeva <anelkz@amazon.de>
> >
> > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> > ids into two. The lower bits, the physical APIC id, represent the part
> > that's exposed to the guest. The higher bits, which are private to KVM,
> > groups APICs together. APICs in different groups are isolated from each
> > other, and IPIs can only be directed at APICs that share the same group
> > as its source. Furthermore, groups are only relevant to IPIs, anything
> > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> > PV-IPIs is targeted at the default APIC group, group 0.
> >
> > When routing IPIs with physical destinations, KVM will OR the source's
> > vCPU APIC group with the ICR's destination ID and use that to resolve
> > the target lAPIC.
>
> Is all of the above arbitrary KVM behavior or defined by the TLFS?

All this matches VSM's expectations to how interrupts are to be handled.
But APIC groups is a concept we created with the aim of generalizing the
behaviour as much as possible.

> > The APIC physical map is also made group aware in
> > order to speed up this process. For the sake of simplicity, the logical
> > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> > routing to the slower per-vCPU scan method.
>
> Why?  I mean, I kinda sorta understand what it does for VSM, but it's not at all
> obvious why this information needs to be shoved into the APIC IDs.  E.g. why not
> have an explicit group_id and then maintain separate optimization maps for each?

There are three tricks to APIC groups. One is IPI routing: for ex. the
ICR phyical destination is mixed with the source vCPU's APIC group to
find the destination vCPU. Another is presenting a coherent APIC ID
across VTLs; switching VTLs shouldn't change the guest's view of the
APIC ID. And ultimately keeps track of the vCPU's VTL level. I don't wee
why we couldn't decouple them, as long as we keep filtering the APIC ID
before it reaches the guest.

> > This capability serves as a building block to implement virtualisation
> > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> > introduces a para-virtualised switch that allows for guest CPUs to jump
> > into a different execution context, this switches into a different CPU
> > state, lAPIC state, and memory protections. We model this in KVM by
>
> Who is "we"?  As a general rule, avoid pronouns.  "we" and "us" in particular
> should never show up in a changelog.  I genuinely don't know if "we" means
> userspace or KVM, and the distinction matters because it clarifies whether or
> not KVM is actively involved in the modeling versus KVM being little more than a
> dumb pipe to provide the plumbing.

Sorry, I've been actively trying to avoid pronouns as you already
mentioned it on a previous review. This one made it through the cracks.

> > using distinct kvm_vcpus for each context.
> >
> > Moreover, execution contexts are hierarchical and its APICs are meant to
> > remain functional even when the context isn't 'scheduled in'.
>
> Please explain the relationship and rules of execution contexts.  E.g. are
> execution contexts the same thing as VTLs?  Do all "real" vCPUs belong to every
> execution context?  If so, is that a requirement?

I left a note about this in my reply to your questions in the cover
letter.

> > For example, we have to keep track of
> > timers' expirations, and interrupt execution of lesser priority contexts
> > when relevant. Hence the need to alias physical APIC ids, while keeping
> > the ability to target specific execution contexts.
> >
> > Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de>
> > Co-developed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > ---
>
>
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index e1021517cf04..542bd208e52b 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> >  void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
> >  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
> >  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
> > +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> > +                                 struct kvm_apic_id_groups *groups);
> >  void kvm_recalculate_apic_map(struct kvm *kvm);
> >  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
> >  void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
> > @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
> >       return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
> >  }
> >
> > +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
> > +{
> > +     return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;
>
> This is *extremely* misleading.  KVM forces the x2APIC ID to match vcpu_id, but
> in xAPIC mode the ID is fully writable.

Yes, although I'm under the impression that no sane OS will do so. We
can decouple the group from the APIC ID, but it still needs to be masked
before presenting it to the guest. So I guess we'll have to deal with
the eventuality of apic id writing one way or anoter (a warn only if VSM
is enabled?).

If we decide the APIC group uAPI is not worth it, we can always create
an ad-hoc VSM one that explicitly sets the kvm_vcpu's VTL. Then route
the VTL internally into the APIC which can use still groups (or a
similar concept).

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role
  2023-11-08 17:26   ` Sean Christopherson
@ 2023-11-10 18:52     ` Nicolas Saenz Julienne
  2023-11-28  7:34       ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-10 18:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, kys, haiyangz, decui, x86, linux-doc

On Wed Nov 8, 2023 at 5:26 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > With the upcoming introduction of per-VTL memory protections, make MMU
> > roles VTL aware. This will avoid sharing PTEs between vCPUs that belong
> > to different VTLs, and that have distinct memory access restrictions.
> >
> > Four bits are allocated to store the VTL number in the MMU role, since
> > the TLFS states there is a maximum of 16 levels.
>
> How many does KVM actually allow/support?  Multiplying the number of possible
> roots by 16x is a *major* change.

AFAIK in practice only VTL0/1 are used. Don't know if Microsoft will
come up with more in the future. We could introduce a CAP that expses
the number of supported VTLs to user-space, and leave it as a compile
option.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-08 16:55 ` Sean Christopherson
  2023-11-08 18:33   ` Sean Christopherson
@ 2023-11-10 19:04   ` Nicolas Saenz Julienne
  1 sibling, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-10 19:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed Nov 8, 2023 at 4:55 PM UTC, Sean Christopherson wrote:
> > This RFC series introduces the necessary infrastructure to emulate VSM
> > enabled guests. It is a snapshot of the progress we made so far, and its
> > main goal is to gather design feedback.
>
> Heh, then please provide an overview of the design, and ideally context and/or
> justification for various design decisions.  It doesn't need to be a proper design
> doc, and you can certainly point at other documentation for explaining VSM/VTLs,
> but a few paragraphs and/or verbose bullet points would go a long way.
>
> The documentation in patch 33 provides an explanation of VSM itself, and a little
> insight into how userspace can utilize the KVM implementation.  But the documentation
> provides no explanation of the mechanics that KVM *developers* care about, e.g.
> the use of memory attributes, how memory attributes are enforced, whether or not
> an in-kernel local APIC is required, etc.

Noted, I'll provide a design review on the next submission.

> Nor does the documentation explain *why*, e.g. why store a separate set of memory
> attributes per VTL "device", which by the by is broken and unnecessary.

It's clear to me how the current implementation of VTL devices is
broken. But unncessary? That made me think we could inject the VTL In
the memory attribute key, for ex. with 'gfn | vtl << 58'. And then use
generic API and a single xarray.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-10 17:56     ` Nicolas Saenz Julienne
@ 2023-11-10 19:32       ` Sean Christopherson
  2023-11-11 11:55         ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-11-10 19:32 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> >  - What is the split between userspace and KVM?  How did you arrive at that split?
> 
> Our original design, which we discussed in the KVM forum 2023 [1] and is
> public [2], implemented most of VSM in-kernel. Notably we introduced VTL
> awareness in struct kvm_vcpu.

...

> So we decided to move all this complexity outside of struct kvm_vcpu
> and, as much as possible, out of the kernel. We figures out the basic
> kernel building blocks that are absolutely necessary, and let user-space
> deal with the rest.

Sorry, I should have been more clear.  What's the split in terms of responsibilities,
i.e. what will KVM's ABI look like?  E.g. if the vCPU=>VTLs setup is nonsensical,
does KVM care?

My general preference is for KVM to be as permissive as possible, i.e. let
userspace do whatever it wants so long as it doesn't place undue burden on KVM.
But at the same time I don't to end up in a similar boat as many of the paravirt
features, where things just stop working if userspace or the guest makes a goof.

> >  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
> >    in a VTL device and APIC ID groups, why not modify "struct kvm" to support
> >    replicating state that needs to be tracked per-VTL?  Because of how memory
> >    attributes affect hugepages, duplicating *memslots* might actually be easier
> >    than teaching memslots to be VTL-aware.
> 
> I do agree that we need to introduce some level VTL awareness into
> memslots. There's the hugepages issues you pointed out. But it'll be
> also necessary once we look at how to implement overlay pages that are
> per-VTL. (A topic I didn't mention in the series as I though I had
> managed to solve memory protections while avoiding the need for multiple
> slots). What I have in mind is introducing a memory slot address space
> per-VTL, similar to how we do things with SMM.

Noooooooo (I hate memslot address spaces :-) )

Why not represent each VTL with a separate "struct kvm" instance?  That would
naturally provide per-VTL behavior for:

  - APIC groups
  - memslot overlays
  - memory attributes (and their impact on hugepages)
  - MMU pages

The only (obvious) issue with that approach would be cross-VTL operations.  IIUC,
sending IPIs across VTLs isn't allowed, but even if it were, that should be easy
enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources.

GVA=>GPA translation would be trickier, but that patch suggests you want to handle
that in userspace anyways.  And if translation is a rare/slow path, maybe it could
simply be punted to userspace?

  NOTE: The hypercall implementation is incomplete and only shared for
  completion. Additionally we'd like to move the VTL aware parts to
  user-space.

Ewww, and looking at what it would take to support cross-VM translations shows
another problem with using vCPUs to model VTLs.  Nothing prevents userspace from
running a virtual CPU at multiple VTLs concurrently, which means that anything
that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be
modified while kvm_hv_xlate_va_walk() is running.

I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something
holds the other kvm_vcpu/VTL's mutex.  Though ideally, KVM would punt all cross-VTL
operations to userspace.  :-)

If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't
make the locking and concurrency problems meaningfully easier or harder to solve.
E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single
virtual machine, to belong to the same process.  That'd avoid headaches with
mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a
different kvm would need special handling?

Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the
in_vtl field is consumed before send_ipi is read from userspace.

	union hv_input_vtl *in_vtl;
	u64 valid_bank_mask;
	u32 vector;
	bool all_cpus;
	u8 vtl;

	/* VTL is at the same offset on both IPI types */
	in_vtl = &send_ipi.in_vtl;
	vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu);

> >    E.g. if 90% of the state is guaranteed to be identical for a given
> >    vCPU across execution contexts, then modeling that with separate
> >    kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
> >    but it might be quite high depending on how much the TFLS restricts
> >    the state of the vCPU, e.g. if it's 64-bit only.
> 
> For the record here's the private VTL state (TLFS 15.11.1):
> 
> "In general, each VTL has its own control registers, RIP register, RSP
>  register, and MSRs:
> 
>  SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK,
>  EFER, PAT, KERNEL_GSBASE, FS.BASE, GS.BASE, TSC_AUX
>  HV_X64_MSR_HYPERCALL
>  HV_X64_MSR_GUEST_OS_ID
>  HV_X64_MSR_REFERENCE_TSC
>  HV_X64_MSR_APIC_FREQUENCY
>  HV_X64_MSR_EOI
>  HV_X64_MSR_ICR
>  HV_X64_MSR_TPR
>  HV_X64_MSR_APIC_ASSIST_PAGE
>  HV_X64_MSR_NPIEP_CONFIG
>  HV_X64_MSR_SIRBP
>  HV_X64_MSR_SCONTROL
>  HV_X64_MSR_SVERSION
>  HV_X64_MSR_SIEFP
>  HV_X64_MSR_SIMP
>  HV_X64_MSR_EOM
>  HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15
>  HV_X64_MSR_STIMER0_COUNT – HV_X64_MSR_STIMER3_COUNT
>  Local APIC registers (including CR8/TPR)

Ugh, the APIC state is quite the killer.  And I gotta image things like CET and
FRED are only going to increase that list.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
  2023-11-10 19:32       ` Sean Christopherson
@ 2023-11-11 11:55         ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-11-11 11:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Fri Nov 10, 2023 at 7:32 PM UTC, Sean Christopherson wrote:
> On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote:
> > On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> > >  - What is the split between userspace and KVM?  How did you arrive at that split?
> >
> > Our original design, which we discussed in the KVM forum 2023 [1] and is
> > public [2], implemented most of VSM in-kernel. Notably we introduced VTL
> > awareness in struct kvm_vcpu.
>
> ...
>
> > So we decided to move all this complexity outside of struct kvm_vcpu
> > and, as much as possible, out of the kernel. We figures out the basic
> > kernel building blocks that are absolutely necessary, and let user-space
> > deal with the rest.
>
> Sorry, I should have been more clear.  What's the split in terms of responsibilities,
> i.e. what will KVM's ABI look like?  E.g. if the vCPU=>VTLs setup is nonsensical,
> does KVM care?
>
> My general preference is for KVM to be as permissive as possible, i.e. let
> userspace do whatever it wants so long as it doesn't place undue burden on KVM.
> But at the same time I don't to end up in a similar boat as many of the paravirt
> features, where things just stop working if userspace or the guest makes a goof.

I'll make sure to formalize this for whenever I post a full series, I
need to go over every hcall and think from that perspective.

There are some rules it might make sense to enforce. But it really
depends on the abstractions we settle on. KVM might not have the
necessary introspection to enforce them. IMO ideally it wouldn't, VTLs
should remain a user-space concept. My approach so far has been trusting
QEMU is doing the right thing.

Some high level examples come to mind:
 - Only one VTL vCPU might run at all times.
 - Privileged VTL interrupts have precedence over lower VTL execution.
 - lAPICs can only access their VTL. (Cross VTL IPIs happen through the
   PV interface).
 - Lower VTL state should be up to date when accessed from privileged
   VTLs (through the GET/SET_VP_REGSITER hcall).

> > >  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
> > >    in a VTL device and APIC ID groups, why not modify "struct kvm" to support
> > >    replicating state that needs to be tracked per-VTL?  Because of how memory
> > >    attributes affect hugepages, duplicating *memslots* might actually be easier
> > >    than teaching memslots to be VTL-aware.
> >
> > I do agree that we need to introduce some level VTL awareness into
> > memslots. There's the hugepages issues you pointed out. But it'll be
> > also necessary once we look at how to implement overlay pages that are
> > per-VTL. (A topic I didn't mention in the series as I though I had
> > managed to solve memory protections while avoiding the need for multiple
> > slots). What I have in mind is introducing a memory slot address space
> > per-VTL, similar to how we do things with SMM.
>
> Noooooooo (I hate memslot address spaces :-) )
>
> Why not represent each VTL with a separate "struct kvm" instance?  That would
> naturally provide per-VTL behavior for:
>
>   - APIC groups
>   - memslot overlays
>   - memory attributes (and their impact on hugepages)
>   - MMU pages

Very interesting idea! I'll spend some time researching it, it sure
solves a lot of issues.

> The only (obvious) issue with that approach would be cross-VTL operations.  IIUC,
> sending IPIs across VTLs isn't allowed, but even if it were, that should be easy
> enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources.

Correct. Only through kvm_hv_send_ipi(), but from experience it happens
very rarely, so performance shouldn't be critical.

> GVA=>GPA translation would be trickier, but that patch suggests you want to handle
> that in userspace anyways.  And if translation is a rare/slow path, maybe it could
> simply be punted to userspace?
>
>   NOTE: The hypercall implementation is incomplete and only shared for
>   completion. Additionally we'd like to move the VTL aware parts to
>   user-space.
>
> Ewww, and looking at what it would take to support cross-VM translations shows
> another problem with using vCPUs to model VTLs.  Nothing prevents userspace from
> running a virtual CPU at multiple VTLs concurrently, which means that anything
> that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be
> modified while kvm_hv_xlate_va_walk() is running.
>
> I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something
> holds the other kvm_vcpu/VTL's mutex.  Though ideally, KVM would punt all cross-VTL
> operations to userspace.  :-)
>
> If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't
> make the locking and concurrency problems meaningfully easier or harder to solve.
> E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single
> virtual machine, to belong to the same process.  That'd avoid headaches with
> mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a
> different kvm would need special handling?

I'll look into it.

> Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the
> in_vtl field is consumed before send_ipi is read from userspace.

ugh, that's a tired last minute "cleanup" that went south... It's been
working as intended for a while otherwise. I'll implement a
kvm-unit-test to redeem myself. :)

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
  2023-11-08 11:17 ` [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Nicolas Saenz Julienne
  2023-11-08 12:11   ` Alexander Graf
  2023-11-08 17:47   ` Sean Christopherson
@ 2023-11-28  6:56   ` Maxim Levitsky
  2023-12-01 15:25     ` Nicolas Saenz Julienne
  2 siblings, 1 reply; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  6:56 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Anel Orazgaliyeva

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> From: Anel Orazgaliyeva <anelkz@amazon.de>
> 
> Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> ids into two. The lower bits, the physical APIC id, represent the part
> that's exposed to the guest. The higher bits, which are private to KVM,
> groups APICs together. APICs in different groups are isolated from each
> other, and IPIs can only be directed at APICs that share the same group
> as its source. Furthermore, groups are only relevant to IPIs, anything
> incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> PV-IPIs is targeted at the default APIC group, group 0.
> 
> When routing IPIs with physical destinations, KVM will OR the source's
> vCPU APIC group with the ICR's destination ID and use that to resolve
> the target lAPIC. The APIC physical map is also made group aware in
> order to speed up this process. For the sake of simplicity, the logical
> map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> routing to the slower per-vCPU scan method.
> 
> This capability serves as a building block to implement virtualisation
> based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> introduces a para-virtualised switch that allows for guest CPUs to jump
> into a different execution context, this switches into a different CPU
> state, lAPIC state, and memory protections. We model this in KVM by
> using distinct kvm_vcpus for each context. Moreover, execution contexts
> are hierarchical and its APICs are meant to remain functional even when
> the context isn't 'scheduled in'. For example, we have to keep track of
> timers' expirations, and interrupt execution of lesser priority contexts
> when relevant. Hence the need to alias physical APIC ids, while keeping
> the ability to target specific execution contexts.


A few general remarks on this patch (assuming that we don't go with
the approach of a VM per VTL, in which case this patch is not needed)

-> This feature has to be done in the kernel because vCPUs sharing same VTL,
   will have same APIC ID.
   (In addition to that, APIC state is private to a VTL so each VTL
   can even change its apic id).

   Because of this KVM has to have at least some awareness of this.

-> APICv/AVIC should be supported with VTL eventually: 
   This is thankfully possible by having separate physid/pid tables per VTL,
   and will mostly just work but needs KVM awareness.

-> I am somewhat against reserving bits in apic id, because that will limit
   the number of apic id bits available to userspace. Currently this is not
   a problem but it might be in the future if for some reason the userspace
   will want an apic id with high bits set.

   But still things change, and with this being part of KVM's ABI, it might backfire.
   A better idea IMHO is just to have 'APIC namespaces', which like say PID namespaces,
   such as each namespace is isolated IPI wise on its own, and let each vCPU belong to
   a one namespace.

   In fact Intel's PRM has a brief mention of a 'hierarchical cluster' mode in which
   roughly describes this situation in which there are multiple not interconnected APIC buses, 
   and communication between them needs a 'cluster manager device'

   However I don't think that we need an explicit pairs of vCPUs and VTL awareness in the kernel
   all of this I think can be done in userspace.

   TL;DR: Lets have APIC namespace. a vCPU can belong to a single namespace, and all vCPUs
   in a namespace send IPIs to each other and know nothing about vCPUs from other namespace.
   
   A vCPU sending IPI to a different VTL thankfully can only do this using a hypercall,
   and thus can be handled in the userspace.


Overall though IMHO the approach of a VM per VTL is better unless some show stoppers show up.
If we go with a VM per VTL, we gain APIC namespaces for free, together with AVIC support and
such.

Best regards,
	Maxim Levitsky


> 
> Signed-off-by: Anel Orazgaliyeva <anelkz@amazon.de>
> Co-developed-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ++
>  arch/x86/include/uapi/asm/kvm.h |  5 +++
>  arch/x86/kvm/lapic.c            | 59 ++++++++++++++++++++++++++++-----
>  arch/x86/kvm/lapic.h            | 33 ++++++++++++++++++
>  arch/x86/kvm/x86.c              | 15 +++++++++
>  include/uapi/linux/kvm.h        |  2 ++
>  6 files changed, 108 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dff10051e9b6..a2f224f95404 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1298,6 +1298,9 @@ struct kvm_arch {
>  	struct rw_semaphore apicv_update_lock;
>  	unsigned long apicv_inhibit_reasons;
>  
> +	u32 apic_id_group_mask;
> +	u8 apic_id_group_shift;
> +
>  	gpa_t wall_clock;
>  
>  	bool mwait_in_guest;
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index a448d0964fc0..f73d137784d7 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -565,4 +565,9 @@ struct kvm_pmu_event_filter {
>  #define KVM_X86_DEFAULT_VM	0
>  #define KVM_X86_SW_PROTECTED_VM	1
>  
> +/* for KVM_SET_APIC_ID_GROUPS */
> +struct kvm_apic_id_groups {
> +	__u8 n_bits; /* nr of bits used to represent group in the APIC ID */
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 3e977dbbf993..f55d216cb2a0 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -141,7 +141,7 @@ static inline int apic_enabled(struct kvm_lapic *apic)
>  
>  static inline u32 kvm_x2apic_id(struct kvm_lapic *apic)
>  {
> -	return apic->vcpu->vcpu_id;
> +	return kvm_apic_id(apic->vcpu);
>  }
>  
>  static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu)
> @@ -219,8 +219,8 @@ static int kvm_recalculate_phys_map(struct kvm_apic_map *new,
>  				    bool *xapic_id_mismatch)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u32 x2apic_id = kvm_x2apic_id(apic);
> -	u32 xapic_id = kvm_xapic_id(apic);
> +	u32 x2apic_id = kvm_apic_id_and_group(vcpu);
> +	u32 xapic_id = kvm_apic_id_and_group(vcpu);
>  	u32 physical_id;
>  
>  	/*
> @@ -299,6 +299,13 @@ static void kvm_recalculate_logical_map(struct kvm_apic_map *new,
>  	u16 mask;
>  	u32 ldr;
>  
> +	/*
> +	 * Using maps for logical destinations when KVM_CAP_APIC_ID_GRUPS is in
> +	 * use isn't supported.
> +	 */
> +	if (kvm_apic_group(vcpu))
> +		new->logical_mode = KVM_APIC_MODE_MAP_DISABLED;
> +
>  	if (new->logical_mode == KVM_APIC_MODE_MAP_DISABLED)
>  		return;
>  
> @@ -370,6 +377,25 @@ enum {
>  	DIRTY
>  };
>  
> +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> +				    struct kvm_apic_id_groups *groups)
> +{
> +	u8 n_bits = groups->n_bits;
> +
> +	if (n_bits > 32)
> +		return -EINVAL;
> +
> +	kvm->arch.apic_id_group_mask = n_bits ? GENMASK(31, 32 - n_bits): 0;
> +	/*
> +	 * Bitshifts >= than the width of the type are UD, so set the
> +	 * apic group shift to 0 when n_bits == 0. The group mask above will
> +	 * clear the APIC ID, so group querying functions will return the
> +	 * correct value.
> +	 */
> +	kvm->arch.apic_id_group_shift = n_bits ? 32 - n_bits : 0;
> +	return 0;
> +}
> +
>  void kvm_recalculate_apic_map(struct kvm *kvm)
>  {
>  	struct kvm_apic_map *new, *old = NULL;
> @@ -414,7 +440,7 @@ void kvm_recalculate_apic_map(struct kvm *kvm)
>  
>  	kvm_for_each_vcpu(i, vcpu, kvm)
>  		if (kvm_apic_present(vcpu))
> -			max_id = max(max_id, kvm_x2apic_id(vcpu->arch.apic));
> +			max_id = max(max_id, kvm_apic_id_and_group(vcpu));
>  
>  	new = kvzalloc(sizeof(struct kvm_apic_map) +
>  	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1),
> @@ -525,7 +551,7 @@ static inline void kvm_apic_set_x2apic_id(struct kvm_lapic *apic, u32 id)
>  {
>  	u32 ldr = kvm_apic_calc_x2apic_ldr(id);
>  
> -	WARN_ON_ONCE(id != apic->vcpu->vcpu_id);
> +	WARN_ON_ONCE(id != kvm_apic_id(apic->vcpu));
>  
>  	kvm_lapic_set_reg(apic, APIC_ID, id);
>  	kvm_lapic_set_reg(apic, APIC_LDR, ldr);
> @@ -1067,6 +1093,17 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
>  	struct kvm_lapic *target = vcpu->arch.apic;
>  	u32 mda = kvm_apic_mda(vcpu, dest, source, target);
>  
> +	/*
> +	 * Make sure vCPUs belong to the same APIC group, it's not possible
> +	 * to send interrupts across groups.
> +	 *
> +	 * Non-IPIs and PV-IPIs can only be injected into the default APIC
> +	 * group (group 0).
> +	 */
> +	if ((source && !kvm_match_apic_group(source->vcpu, vcpu)) ||
> +	    kvm_apic_group(vcpu))
> +		return false;
> +
>  	ASSERT(target);
>  	switch (shorthand) {
>  	case APIC_DEST_NOSHORT:
> @@ -1518,6 +1555,10 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  	else
>  		irq.dest_id = GET_XAPIC_DEST_FIELD(icr_high);
>  
> +	if (irq.dest_mode == APIC_DEST_PHYSICAL)
> +		kvm_apic_id_set_group(apic->vcpu->kvm,
> +				      kvm_apic_group(apic->vcpu), &irq.dest_id);
> +
>  	trace_kvm_apic_ipi(icr_low, irq.dest_id);
>  
>  	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq, NULL);
> @@ -2541,7 +2582,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  	/* update jump label if enable bit changes */
>  	if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE) {
>  		if (value & MSR_IA32_APICBASE_ENABLE) {
> -			kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
> +			kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
>  			static_branch_slow_dec_deferred(&apic_hw_disabled);
>  			/* Check if there are APF page ready requests pending */
>  			kvm_make_request(KVM_REQ_APF_READY, vcpu);
> @@ -2553,9 +2594,9 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  
>  	if ((old_value ^ value) & X2APIC_ENABLE) {
>  		if (value & X2APIC_ENABLE)
> -			kvm_apic_set_x2apic_id(apic, vcpu->vcpu_id);
> +			kvm_apic_set_x2apic_id(apic, kvm_apic_id(vcpu));
>  		else if (value & MSR_IA32_APICBASE_ENABLE)
> -			kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
> +			kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
>  	}
>  
>  	if ((old_value ^ value) & (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE)) {
> @@ -2685,7 +2726,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
>  
>  	/* The xAPIC ID is set at RESET even if the APIC was already enabled. */
>  	if (!init_event)
> -		kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
> +		kvm_apic_set_xapic_id(apic, kvm_apic_id(vcpu));
>  	kvm_apic_set_version(apic->vcpu);
>  
>  	for (i = 0; i < apic->nr_lvt_entries; i++)
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index e1021517cf04..542bd208e52b 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
>  void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
> +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> +				    struct kvm_apic_id_groups *groups);
>  void kvm_recalculate_apic_map(struct kvm *kvm);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
>  void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
> @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
>  	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
>  }
>  
> +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;
> +}
> +
> +static inline u32 kvm_apic_id_and_group(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->vcpu_id;
> +}
> +
> +static inline u32 kvm_apic_group(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	return (vcpu->vcpu_id & kvm->arch.apic_id_group_mask) >>
> +	       kvm->arch.apic_id_group_shift;
> +}
> +
> +static inline void kvm_apic_id_set_group(struct kvm *kvm, u32 group,
> +					 u32 *apic_id)
> +{
> +	*apic_id |= ((group << kvm->arch.apic_id_group_shift) &
> +		     kvm->arch.apic_id_group_mask);
> +}
> +
> +static inline bool kvm_match_apic_group(struct kvm_vcpu *src,
> +					struct kvm_vcpu *dst)
> +{
> +	return kvm_apic_group(src) == kvm_apic_group(dst);
> +}
> +
>  #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e3eb608b6692..4cd3f00475c1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4526,6 +4526,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
>  	case KVM_CAP_IRQFD_RESAMPLE:
>  	case KVM_CAP_MEMORY_FAULT_INFO:
> +	case KVM_CAP_APIC_ID_GROUPS:
>  		r = 1;
>  		break;
>  	case KVM_CAP_EXIT_HYPERCALL:
> @@ -7112,6 +7113,20 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  		r = kvm_vm_ioctl_set_msr_filter(kvm, &filter);
>  		break;
>  	}
> +	case KVM_SET_APIC_ID_GROUPS: {
> +		struct kvm_apic_id_groups groups;
> +
> +		r = -EINVAL;
> +		if (kvm->created_vcpus)
> +			goto out;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&groups, argp, sizeof(groups)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_apic_id_groups(kvm, &groups);
> +		break;
> +	}
>  	default:
>  		r = -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5b5820d19e71..d7a01766bf21 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1219,6 +1219,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_MEMORY_ATTRIBUTES 232
>  #define KVM_CAP_GUEST_MEMFD 233
>  #define KVM_CAP_VM_TYPES 234
> +#define KVM_CAP_APIC_ID_GROUPS 235
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -2307,4 +2308,5 @@ struct kvm_create_guest_memfd {
>  
>  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
>  
> +#define KVM_SET_APIC_ID_GROUPS _IOW(KVMIO, 0xd7, struct kvm_apic_id_groups)
>  #endif /* __LINUX_KVM_H */





^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
  2023-11-08 12:16       ` Alexander Graf
@ 2023-11-28  6:57         ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  6:57 UTC (permalink / raw)
  To: Alexander Graf, Vitaly Kuznetsov, Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, anelkz, dwmw,
	jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

On Wed, 2023-11-08 at 13:16 +0100, Alexander Graf wrote:
> On 08.11.23 13:11, Vitaly Kuznetsov wrote:
> > Alexander Graf <graf@amazon.com> writes:
> > 
> > > On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> > > > Prepare infrastructure to be able to return data through the XMM
> > > > registers when Hyper-V hypercalls are issues in fast mode. The XMM
> > > > registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
> > > > restored on successful hypercall completion.
> > > > 
> > > > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > > > ---
> > > >    arch/x86/include/asm/hyperv-tlfs.h |  2 +-
> > > >    arch/x86/kvm/hyperv.c              | 33 +++++++++++++++++++++++++++++-
> > > >    include/uapi/linux/kvm.h           |  6 ++++++
> > > >    3 files changed, 39 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> > > > index 2ff26f53cd62..af594aa65307 100644
> > > > --- a/arch/x86/include/asm/hyperv-tlfs.h
> > > > +++ b/arch/x86/include/asm/hyperv-tlfs.h
> > > > @@ -49,7 +49,7 @@
> > > >    /* Support for physical CPU dynamic partitioning events is available*/
> > > >    #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE  BIT(3)
> > > >    /*
> > > > - * Support for passing hypercall input parameter block via XMM
> > > > + * Support for passing hypercall input and output parameter block via XMM
> > > >     * registers is available
> > > >     */
> > > >    #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE               BIT(4)
> > > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > > index 238afd7335e4..e1bc861ab3b0 100644
> > > > --- a/arch/x86/kvm/hyperv.c
> > > > +++ b/arch/x86/kvm/hyperv.c
> > > > @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall {
> > > >       u16 rep_idx;
> > > >       bool fast;
> > > >       bool rep;
> > > > +    bool xmm_dirty;
> > > >       sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
> > > > 
> > > >       /*
> > > > @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result)
> > > >       return ret;
> > > >    }
> > > > 
> > > > +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
> > > > +{
> > > > +    int reg;
> > > > +
> > > > +    kvm_fpu_get();
> > > > +    for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
> > > > +            const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
> > > > +            _kvm_write_sse_reg(reg, &data);
> > > > +    }
> > > > +    kvm_fpu_put();
> > > > +}
> > > > +
> > > > +static bool kvm_hv_is_xmm_output_hcall(u16 code)
> > > > +{
> > > > +    return false;
> > > > +}
> > > > +
> > > >    static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
> > > >    {
> > > > -    return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result);
> > > > +    bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
> > > > +    u16 code = vcpu->run->hyperv.u.hcall.input & 0xffff;
> > > > +    u64 result = vcpu->run->hyperv.u.hcall.result;
> > > > +
> > > > +    if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
> > > > +            kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
> > > > +
> > > > +    return kvm_hv_hypercall_complete(vcpu, result);
> > > >    }
> > > > 
> > > >    static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> > > > @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> > > >               break;
> > > >       }
> > > > 
> > > > +    if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty)
> > > > +            kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm);
> > > > +
> > > >    hypercall_complete:
> > > >       return kvm_hv_hypercall_complete(vcpu, ret);
> > > > 
> > > > @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> > > >       vcpu->run->hyperv.u.hcall.input = hc.param;
> > > >       vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
> > > >       vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
> > > > +    if (hc.fast)
> > > > +            memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
> > > >       vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
> > > >       return 0;
> > > >    }
> > > > @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
> > > >                       ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
> > > > 
> > > >                       ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
> > > > +                    ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
> > > 
> > > Shouldn't this be guarded by an ENABLE_CAP to make sure old user space
> > > that doesn't know about xmm outputs is still able to run with newer kernels?
> > > 
> > No, we don't do CAPs for new Hyper-V features anymore since we have
> > KVM_GET_SUPPORTED_HV_CPUID. Userspace is not supposed to simply copy
> > its output into guest visible CPUIDs, it must only enable features it
> > knows. Even 'hv_passthrough' option in QEMU doesn't pass unknown
> > features through.
> 
> Ah, nice :). That simplifies things.
> 
> 
> Alex


Besides other remarks I think that this patch is reasonable,
and maybe it can be queued before the main VSM series,
assuming that it comes with a unit test to avoid having
dead code in the kernel.

Best regards,
	Maxim Levitsky

> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 





^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function
  2023-11-08 11:17 ` [RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function Nicolas Saenz Julienne
@ 2023-11-28  7:01   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:01 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> The hypercall page patching is about to grow considerably, move it into
> its own function.
> 
> No functional change intended.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/hyperv.c | 69 ++++++++++++++++++++++++-------------------
>  1 file changed, 39 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index e1bc861ab3b0..78d053042667 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -256,6 +256,42 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
>  	kvm_make_request(KVM_REQ_HV_EXIT, vcpu);
>  }
>  
> +static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u8 instructions[9];
> +	int i = 0;
> +	u64 addr;
> +
> +	/*
> +	 * If Xen and Hyper-V hypercalls are both enabled, disambiguate
> +	 * the same way Xen itself does, by setting the bit 31 of EAX
> +	 * which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just
> +	 * going to be clobbered on 64-bit.
> +	 */
> +	if (kvm_xen_hypercall_enabled(kvm)) {
> +		/* orl $0x80000000, %eax */
> +		instructions[i++] = 0x0d;
> +		instructions[i++] = 0x00;
> +		instructions[i++] = 0x00;
> +		instructions[i++] = 0x00;
> +		instructions[i++] = 0x80;
> +	}
> +
> +	/* vmcall/vmmcall */
> +	static_call(kvm_x86_patch_hypercall)(vcpu, instructions + i);
> +	i += 3;
> +
> +	/* ret */
> +	((unsigned char *)instructions)[i++] = 0xc3;
> +
> +	addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
> +	if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static int synic_set_msr(struct kvm_vcpu_hv_synic *synic,
>  			 u32 msr, u64 data, bool host)
>  {
> @@ -1338,11 +1374,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
>  		if (!hv->hv_guest_os_id)
>  			hv->hv_hypercall &= ~HV_X64_MSR_HYPERCALL_ENABLE;
>  		break;
> -	case HV_X64_MSR_HYPERCALL: {
> -		u8 instructions[9];
> -		int i = 0;
> -		u64 addr;
> -
> +	case HV_X64_MSR_HYPERCALL:
>  		/* if guest os id is not set hypercall should remain disabled */
>  		if (!hv->hv_guest_os_id)
>  			break;
> @@ -1351,34 +1383,11 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
>  			break;
>  		}
>  
> -		/*
> -		 * If Xen and Hyper-V hypercalls are both enabled, disambiguate
> -		 * the same way Xen itself does, by setting the bit 31 of EAX
> -		 * which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just
> -		 * going to be clobbered on 64-bit.
> -		 */
> -		if (kvm_xen_hypercall_enabled(kvm)) {
> -			/* orl $0x80000000, %eax */
> -			instructions[i++] = 0x0d;
> -			instructions[i++] = 0x00;
> -			instructions[i++] = 0x00;
> -			instructions[i++] = 0x00;
> -			instructions[i++] = 0x80;
> -		}
> -
> -		/* vmcall/vmmcall */
> -		static_call(kvm_x86_patch_hypercall)(vcpu, instructions + i);
> -		i += 3;
> -
> -		/* ret */
> -		((unsigned char *)instructions)[i++] = 0xc3;
> -
> -		addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
> -		if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
> +		if (patch_hypercall_page(vcpu, data))
>  			return 1;
> +
>  		hv->hv_hypercall = data;
>  		break;
> -	}
>  	case HV_X64_MSR_REFERENCE_TSC:
>  		hv->hv_tsc_page = data;
>  		if (hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE) {


This looks like another nice cleanup that can be accepted to the kvm,
before the main VTL patch series.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-11-08 11:17 ` [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page Nicolas Saenz Julienne
  2023-11-08 11:53   ` Alexander Graf
@ 2023-11-28  7:08   ` Maxim Levitsky
  2023-11-28 16:33     ` Sean Christopherson
  2023-12-01 16:19     ` Nicolas Saenz Julienne
  1 sibling, 2 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> VTL call/return hypercalls have their own entry points in the hypercall
> page because they don't follow normal hyper-v hypercall conventions.
> Move the VTL call/return control input into ECX/RAX and set the
> hypercall code into EAX/RCX before calling the hypercall instruction in
> order to be able to use the Hyper-V hypercall entry function.
> 
> Guests can read an emulated code page offsets register to know the
> offsets into the hypercall page for the VTL call/return entries.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> 
> ---
> 
> My tree has the additional patch, we're still trying to understand under
> what conditions Windows expects the offset to be fixed.
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 54f7f36a89bf..9f2ea8c34447 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -294,6 +294,7 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> 
>         /* VTL call/return entries */
>         if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
> +               i = 22;
>  #ifdef CONFIG_X86_64
>                 if (is_64_bit_mode(vcpu)) {
>                         /*
> ---
>  arch/x86/include/asm/kvm_host.h   |  2 +
>  arch/x86/kvm/hyperv.c             | 78 ++++++++++++++++++++++++++++++-
>  include/asm-generic/hyperv-tlfs.h | 11 +++++
>  3 files changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a2f224f95404..00cd21b09f8c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1105,6 +1105,8 @@ struct kvm_hv {
>  	u64 hv_tsc_emulation_status;
>  	u64 hv_invtsc_control;
>  
> +	union hv_register_vsm_code_page_offsets vsm_code_page_offsets;
> +
>  	/* How many vCPUs have VP index != vCPU index */
>  	atomic_t num_mismatched_vp_indexes;
>  
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 78d053042667..d4b1b53ea63d 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
>  static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
>  {
>  	struct kvm *kvm = vcpu->kvm;
> -	u8 instructions[9];
> +	struct kvm_hv *hv = to_kvm_hv(kvm);
> +	u8 instructions[0x30];
>  	int i = 0;
>  	u64 addr;
>  
> @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
>  	/* ret */
>  	((unsigned char *)instructions)[i++] = 0xc3;
>  
> +	/* VTL call/return entries */
> +	if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
> +#ifdef CONFIG_X86_64
> +		if (is_64_bit_mode(vcpu)) {
> +			/*
> +			 * VTL call 64-bit entry prologue:
> +			 * 	mov %rcx, %rax
> +			 * 	mov $0x11, %ecx
> +			 * 	jmp 0:

This isn't really 'jmp 0' as I first wondered but actually backward jump 32 bytes back (if I did the calculation correctly).
This is very dangerous because code that was before can change and in fact I don't think that this
offset is even correct now, and on top of that it depends on support for xen hypercalls as well.

This can be fixed by calculating the offset in runtime, however I am thinking:


Since userspace will have to be aware of the offsets in this page, and since
pretty much everything else is done in userspace, it might make sense to create
the hypercall page in the userspace.

In fact, the fact that KVM currently overwrites the guest page, is a violation of
the HV spec.

It's more correct regardless of VTL to do userspace vm exit and let the userspace put a memslot ("overlay")
over the address, and put whatever userspace wants there, including the above code.

Then we won't need the new ioctl as well.

To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.

Best regards,
	Maxim Levitsky


> +			 */
> +			hv->vsm_code_page_offsets.vtl_call_offset = i;
> +			instructions[i++] = 0x48;
> +			instructions[i++] = 0x89;
> +			instructions[i++] = 0xc8;
> +			instructions[i++] = 0xb9;
> +			instructions[i++] = 0x11;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0xeb;
> +			instructions[i++] = 0xe0;
> +			/*
> +			 * VTL return 64-bit entry prologue:
> +			 * 	mov %rcx, %rax
> +			 * 	mov $0x12, %ecx
> +			 * 	jmp 0:
> +			 */
> +			hv->vsm_code_page_offsets.vtl_return_offset = i;
> +			instructions[i++] = 0x48;
> +			instructions[i++] = 0x89;
> +			instructions[i++] = 0xc8;
> +			instructions[i++] = 0xb9;
> +			instructions[i++] = 0x12;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0xeb;
> +			instructions[i++] = 0xd6;
> +		} else
> +#endif
> +		{
> +			/*
> +			 * VTL call 32-bit entry prologue:
> +			 * 	mov %eax, %ecx
> +			 * 	mov $0x11, %eax
> +			 * 	jmp 0:
> +			 */
> +			hv->vsm_code_page_offsets.vtl_call_offset = i;
> +			instructions[i++] = 0x89;
> +			instructions[i++] = 0xc1;
> +			instructions[i++] = 0xb8;
> +			instructions[i++] = 0x11;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0xeb;
> +			instructions[i++] = 0xf3;
> +			/*
> +			 * VTL return 32-bit entry prologue:
> +			 * 	mov %eax, %ecx
> +			 * 	mov $0x12, %eax
> +			 * 	jmp 0:
> +			 */
> +			hv->vsm_code_page_offsets.vtl_return_offset = i;
> +			instructions[i++] = 0x89;
> +			instructions[i++] = 0xc1;
> +			instructions[i++] = 0xb8;
> +			instructions[i++] = 0x12;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0x00;
> +			instructions[i++] = 0xeb;
> +			instructions[i++] = 0xea;
> +		}
> +	}
>  	addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK;
>  	if (kvm_vcpu_write_guest(vcpu, addr, instructions, i))
>  		return 1;
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index fdac4a1714ec..0e7643c1ef01 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -823,4 +823,15 @@ struct hv_mmio_write_input {
>  	u8 data[HV_HYPERCALL_MMIO_MAX_DATA_LENGTH];
>  } __packed;
>  
> +/*
> + * VTL call/return hypercall page offsets register
> + */
> +union hv_register_vsm_code_page_offsets {
> +	u64 as_u64;
> +	struct {
> +		u64 vtl_call_offset:12;
> +		u64 vtl_return_offset:12;
> +		u64 reserved:40;
> +	} __packed;
> +};
>  #endif






^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
  2023-11-08 11:17 ` [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs Nicolas Saenz Julienne
@ 2023-11-28  7:14   ` Maxim Levitsky
  2023-12-01 16:31     ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:14 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> HVCALL_SEND_IPI and HVCALL_SEND_IPI_EX allow targeting specific a
> specific VTL. Honour the requests.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/hyperv.c             | 24 +++++++++++++++++-------
>  arch/x86/kvm/trace.h              | 20 ++++++++++++--------
>  include/asm-generic/hyperv-tlfs.h |  6 ++++--
>  3 files changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index d4b1b53ea63d..2cf430f6ddd8 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2230,7 +2230,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>  }
>  
>  static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> -				    u64 *sparse_banks, u64 valid_bank_mask)
> +				    u64 *sparse_banks, u64 valid_bank_mask, int vtl)
>  {
>  	struct kvm_lapic_irq irq = {
>  		.delivery_mode = APIC_DM_FIXED,
> @@ -2245,6 +2245,9 @@ static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
>  					    valid_bank_mask, sparse_banks))
>  			continue;
>  
> +		if (kvm_hv_get_active_vtl(vcpu) != vtl)
> +			continue;

Do I understand correctly that this is a temporary limitation?
In other words, can a vCPU running in VTL1 send an IPI to a vCPU running VTL0,
forcing the target vCPU to do async switch to VTL1?
I think that this is possible.


If we go with my suggestion to use apic namespaces and/or with multiple VMs per VTLs,
then I imagine it like that:

In-kernel hyperv IPI emulation works as it does currently as long as the IPI's VTL matches the VTL
which is assigned to the vCPU, and if it doesn't, it should result in a userspace VM exit.

I do think that we will need KVM to know a vCPU VTL anyway, however we might get away
(or have to if we go with multiple VMs approach) with having explicit mapping between
all KVM's vCPUs which emulate a single VM's vCPU.

Best regards,
	Maxim Levitsky


> +
>  		/* We fail only when APIC is disabled */
>  		kvm_apic_set_irq(vcpu, &irq, NULL);
>  	}
> @@ -2257,13 +2260,19 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>  	struct kvm *kvm = vcpu->kvm;
>  	struct hv_send_ipi_ex send_ipi_ex;
>  	struct hv_send_ipi send_ipi;
> +	union hv_input_vtl *in_vtl;
>  	u64 valid_bank_mask;
>  	u32 vector;
>  	bool all_cpus;
> +	u8 vtl;
> +
> +	/* VTL is at the same offset on both IPI types */
> +	in_vtl = &send_ipi.in_vtl;
> +	vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu);
>  
>  	if (hc->code == HVCALL_SEND_IPI) {
>  		if (!hc->fast) {
> -			if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi,
> +			if (unlikely(kvm_vcpu_read_guest(vcpu, hc->ingpa, &send_ipi,
>  						    sizeof(send_ipi))))
>  				return HV_STATUS_INVALID_HYPERCALL_INPUT;
>  			sparse_banks[0] = send_ipi.cpu_mask;
> @@ -2278,10 +2287,10 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>  		all_cpus = false;
>  		valid_bank_mask = BIT_ULL(0);
>  
> -		trace_kvm_hv_send_ipi(vector, sparse_banks[0]);
> +		trace_kvm_hv_send_ipi(vector, sparse_banks[0], vtl);
>  	} else {
>  		if (!hc->fast) {
> -			if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi_ex,
> +			if (unlikely(kvm_vcpu_read_guest(vcpu, hc->ingpa, &send_ipi_ex,
>  						    sizeof(send_ipi_ex))))
>  				return HV_STATUS_INVALID_HYPERCALL_INPUT;
>  		} else {
> @@ -2292,7 +2301,8 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>  
>  		trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
>  					 send_ipi_ex.vp_set.format,
> -					 send_ipi_ex.vp_set.valid_bank_mask);
> +					 send_ipi_ex.vp_set.valid_bank_mask,
> +					 vtl);
>  
>  		vector = send_ipi_ex.vector;
>  		valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
> @@ -2322,9 +2332,9 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
>  		return HV_STATUS_INVALID_HYPERCALL_INPUT;
>  
>  	if (all_cpus)
> -		kvm_hv_send_ipi_to_many(kvm, vector, NULL, 0);
> +		kvm_hv_send_ipi_to_many(kvm, vector, NULL, 0, vtl);
>  	else
> -		kvm_hv_send_ipi_to_many(kvm, vector, sparse_banks, valid_bank_mask);
> +		kvm_hv_send_ipi_to_many(kvm, vector, sparse_banks, valid_bank_mask, vtl);
>  
>  ret_success:
>  	return HV_STATUS_SUCCESS;
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 83843379813e..ab8839c47bc7 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -1606,42 +1606,46 @@ TRACE_EVENT(kvm_hv_flush_tlb_ex,
>   * Tracepoints for kvm_hv_send_ipi.
>   */
>  TRACE_EVENT(kvm_hv_send_ipi,
> -	TP_PROTO(u32 vector, u64 processor_mask),
> -	TP_ARGS(vector, processor_mask),
> +	TP_PROTO(u32 vector, u64 processor_mask, u8 vtl),
> +	TP_ARGS(vector, processor_mask, vtl),
>  
>  	TP_STRUCT__entry(
>  		__field(u32, vector)
>  		__field(u64, processor_mask)
> +		__field(u8, vtl)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->vector = vector;
>  		__entry->processor_mask = processor_mask;
> +		__entry->vtl = vtl;
>  	),
>  
> -	TP_printk("vector %x processor_mask 0x%llx",
> -		  __entry->vector, __entry->processor_mask)
> +	TP_printk("vector %x processor_mask 0x%llx vtl %d",
> +		  __entry->vector, __entry->processor_mask, __entry->vtl)
>  );
>  
>  TRACE_EVENT(kvm_hv_send_ipi_ex,
> -	TP_PROTO(u32 vector, u64 format, u64 valid_bank_mask),
> -	TP_ARGS(vector, format, valid_bank_mask),
> +	TP_PROTO(u32 vector, u64 format, u64 valid_bank_mask, u8 vtl),
> +	TP_ARGS(vector, format, valid_bank_mask, vtl),
>  
>  	TP_STRUCT__entry(
>  		__field(u32, vector)
>  		__field(u64, format)
>  		__field(u64, valid_bank_mask)
> +		__field(u8, vtl)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->vector = vector;
>  		__entry->format = format;
>  		__entry->valid_bank_mask = valid_bank_mask;
> +		__entry->vtl = vtl;
>  	),
>  
> -	TP_printk("vector %x format %llx valid_bank_mask 0x%llx",
> +	TP_printk("vector %x format %llx valid_bank_mask 0x%llx vtl %d",
>  		  __entry->vector, __entry->format,
> -		  __entry->valid_bank_mask)
> +		  __entry->valid_bank_mask, __entry->vtl)
>  );
>  
>  TRACE_EVENT(kvm_pv_tlb_flush,
> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 0e7643c1ef01..40d7dc793c03 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -424,14 +424,16 @@ struct hv_vpset {
>  /* HvCallSendSyntheticClusterIpi hypercall */
>  struct hv_send_ipi {
>  	u32 vector;
> -	u32 reserved;
> +	union hv_input_vtl in_vtl;
> +	u8 reserved[3];
>  	u64 cpu_mask;
>  } __packed;
>  
>  /* HvCallSendSyntheticClusterIpiEx hypercall */
>  struct hv_send_ipi_ex {
>  	u32 vector;
> -	u32 reserved;
> +	union hv_input_vtl in_vtl;
> +	u8 reserved[3];
>  	struct hv_vpset vp_set;
>  } __packed;
>  






^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM
  2023-11-08 11:17 ` [RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM Nicolas Saenz Julienne
@ 2023-11-28  7:16   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:16 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Introduce a new capability to enable Hyper-V Virtual Secure Mode (VSM)
> emulation support.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 2 ++
>  arch/x86/kvm/hyperv.h           | 5 +++++
>  arch/x86/kvm/x86.c              | 5 +++++
>  include/uapi/linux/kvm.h        | 1 +
>  4 files changed, 13 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 00cd21b09f8c..7712e31b7537 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1118,6 +1118,8 @@ struct kvm_hv {
>  
>  	struct hv_partition_assist_pg *hv_pa_pg;
>  	struct kvm_hv_syndbg hv_syndbg;
> +
> +	bool hv_enable_vsm;
>  };
>  
>  struct msr_bitmap_range {
> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> index f83b8db72b11..2bfed69ba0db 100644
> --- a/arch/x86/kvm/hyperv.h
> +++ b/arch/x86/kvm/hyperv.h
> @@ -238,4 +238,9 @@ static inline int kvm_hv_verify_vp_assist(struct kvm_vcpu *vcpu)
>  
>  int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
>  
> +static inline bool kvm_hv_vsm_enabled(struct kvm *kvm)
> +{
> +       return kvm->arch.hyperv.hv_enable_vsm;
> +}
> +
>  #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4cd3f00475c1..b0512e433032 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4485,6 +4485,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_HYPERV_CPUID:
>  	case KVM_CAP_HYPERV_ENFORCE_CPUID:
>  	case KVM_CAP_SYS_HYPERV_CPUID:
> +	case KVM_CAP_HYPERV_VSM:
>  	case KVM_CAP_PCI_SEGMENT:
>  	case KVM_CAP_DEBUGREGS:
>  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> @@ -6519,6 +6520,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		}
>  		mutex_unlock(&kvm->lock);
>  		break;
> +	case KVM_CAP_HYPERV_VSM:
> +		kvm->arch.hyperv.hv_enable_vsm = true;
> +		r = 0;
> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 5ce06a1eee2b..168b6ac6ebe5 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1226,6 +1226,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_GUEST_MEMFD 233
>  #define KVM_CAP_VM_TYPES 234
>  #define KVM_CAP_APIC_ID_GROUPS 235
> +#define KVM_CAP_HYPERV_VSM 237
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

Do we actually need this? Can we detect if the userspace wants VSM using
guest CPUID?

Of course if we need to add a new ioctl or something it will have to be
done together with a new capability, and since we will need at least to
know a vCPU's VTL, we will probably need this capability.

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled
  2023-11-08 11:17 ` [RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled Nicolas Saenz Julienne
@ 2023-11-28  7:21   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:21 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> VSM's VTLs are modeled by using a distinct vCPU per VTL. While one VTL
> is running the rest of vCPUs are left idle. This doesn't play well with
> the approach of tracking emulated timer expiration by using the VMX
> preemption timer. Inactive VTL's timers are still meant to run and
> inject interrupts regardless of their runstate.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/lapic.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index f55d216cb2a0..8cc75b24381b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -152,9 +152,10 @@ static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu)
>  
>  bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu)
>  {
> -	return kvm_x86_ops.set_hv_timer
> -	       && !(kvm_mwait_in_guest(vcpu->kvm) ||
> -		    kvm_can_post_timer_interrupt(vcpu));
> +	return kvm_x86_ops.set_hv_timer &&
> +	       !(kvm_mwait_in_guest(vcpu->kvm) ||
> +		 kvm_can_post_timer_interrupt(vcpu)) &&
> +	       !(kvm_hv_vsm_enabled(vcpu->kvm));
>  }

This has to be fixed this way or another.

One idea is to introduce new MP state (KVM_MP_STATE_HALTED_USERSPACE), which will be set
on vCPUs that belong to inactive VTLs, and then userspace will do KVM_RUN which will block
as if it were for halted state but as soon as vCPU becomes unhalted, it will return to
the userspace instead of running again.

If we go with the approach of using polling on the inactive VTL's vcpus, then we can switch to a 
software timer just before we start polling.

Also note that AVIC/APICv and their IOMMU's have to be treated the same way. 

It is disabled during vCPU blocking due to the same reasons of vCPU not 
being assigned a physical CPU.

Currently it happens to work because you disable APIC accelerated map, which in turn disables (inhibits)
the APICv/AVIC.

Once again if we go with the approach of polling, we should ensure that polling does more or less
the same things as kvm_vcpu_block does (we should try to share as much code as possible as well).

Best regards,
	Maxim Levitsky






>  
>  static bool kvm_use_posted_timer_interrupt(struct kvm_vcpu *vcpu)





^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
  2023-11-08 11:17 ` [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers Nicolas Saenz Julienne
  2023-11-08 12:21   ` Alexander Graf
@ 2023-11-28  7:25   ` Maxim Levitsky
  1 sibling, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:25 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Introduce two helper functions. The first one queries a vCPU's VTL
> level, the second one, given a struct kvm_vcpu and VTL pair, returns the
> corresponding 'sibling' struct kvm_vcpu at the right VTL.
> 
> We keep track of each VTL's state by having a distinct struct kvm_vpcu
> for each level. VTL-vCPUs that belong to the same guest CPU share the
> same physical APIC id, but belong to different APIC groups where the
> apic group represents the vCPU's VTL.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/hyperv.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> index 2bfed69ba0db..5433107e7cc8 100644
> --- a/arch/x86/kvm/hyperv.h
> +++ b/arch/x86/kvm/hyperv.h
> @@ -23,6 +23,7 @@
>  
>  #include <linux/kvm_host.h>
>  #include "x86.h"
> +#include "lapic.h"
>  
>  /* "Hv#1" signature */
>  #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648
> @@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct kvm_vcpu *vcpu)
>  	return &vcpu->kvm->arch.hyperv.hv_syndbg;
>  }
>  
> +static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, int vtl)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u32 target_id = kvm_apic_id(vcpu);
> +
> +	kvm_apic_id_set_group(kvm, vtl, &target_id);
> +	if (vcpu->vcpu_id == target_id)
> +		return vcpu;
> +
> +	return kvm_get_vcpu_by_id(kvm, target_id);
> +}

> +
> +static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_apic_group(vcpu);
> +}
> +
>  static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);


Ideally I'll prefer the kernel to not know the VTL mapping at all but rather,
that each vCPU be assigned to an apic group / namespace and has its assigned VTL.

Then the kernel works in this way:

* Regular APIC IPI -> send it to the apic namespace to which the sender belongs or if we go with the idea of using
  multiple VMs, then this will work unmodified.

* Hardware interrupt -> send it to the vCPU/VM which userspace configured it to send via GSI mappings.

* HyperV IPI -> if same VTL as the vCPU assigned VTL -> deal with it the same as with regular IPI
             -> otherwise exit to the userspace.

* Page fault -> if related to violation of current VTL protection,
  exit to userspace, and the userspace can then queue the SynIC message, and wakeup the higher VTL.


Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE
  2023-11-08 11:17 ` [RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE Nicolas Saenz Julienne
@ 2023-11-28  7:26   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:26 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> HVCALL_GET_VP_REGISTERS exposes the VTL call hypercall page entry
> offsets to the guest. This hypercall is implemented in user-space while
> the hypercall page patching happens in-kernel. So expose it as part of
> the partition wide VSM state.
> 
> NOTE: Alternatively there is the option of sharing this information
> through a VTL KVM device attribute (the device is introduced in
> subsequent patches).
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/include/uapi/asm/kvm.h |  5 +++++
>  arch/x86/kvm/hyperv.c           |  8 ++++++++
>  arch/x86/kvm/hyperv.h           |  2 ++
>  arch/x86/kvm/x86.c              | 18 ++++++++++++++++++
>  include/uapi/linux/kvm.h        |  4 ++++
>  5 files changed, 37 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index f73d137784d7..370483d5d5fd 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -570,4 +570,9 @@ struct kvm_apic_id_groups {
>  	__u8 n_bits; /* nr of bits used to represent group in the APIC ID */
>  };
>  
> +/* for KVM_HV_GET_VSM_STATE */
> +struct kvm_hv_vsm_state {
> +	__u64 vsm_code_page_offsets;
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 2cf430f6ddd8..caaa859932c5 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2990,3 +2990,11 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>  
>  	return 0;
>  }
> +
> +int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state)
> +{
> +	struct kvm_hv* hv = &kvm->arch.hyperv;
> +
> +	state->vsm_code_page_offsets = hv->vsm_code_page_offsets.as_u64;
> +	return 0;
> +}
> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> index 5433107e7cc8..b3d1113efe82 100644
> --- a/arch/x86/kvm/hyperv.h
> +++ b/arch/x86/kvm/hyperv.h
> @@ -261,4 +261,6 @@ static inline bool kvm_hv_vsm_enabled(struct kvm *kvm)
>         return kvm->arch.hyperv.hv_enable_vsm;
>  }
>  
> +int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state);
> +
>  #endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0512e433032..57f9c58e1e32 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7132,6 +7132,24 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  		r = kvm_vm_ioctl_set_apic_id_groups(kvm, &groups);
>  		break;
>  	}
> +	case KVM_HV_GET_VSM_STATE: {
> +		struct kvm_hv_vsm_state vsm_state;
> +
> +		r = -EINVAL;
> +		if (!kvm_hv_vsm_enabled(kvm))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_get_hv_vsm_state(kvm, &vsm_state);
> +		if (r)
> +			goto out;
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &vsm_state, sizeof(vsm_state)))
> +			goto out;
> +
> +		r = 0;
> +		break;
> +	}
>  	default:
>  		r = -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 168b6ac6ebe5..03f5c08fd7aa 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -2316,4 +2316,8 @@ struct kvm_create_guest_memfd {
>  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
>  
>  #define KVM_SET_APIC_ID_GROUPS _IOW(KVMIO, 0xd7, struct kvm_apic_id_groups)
> +
> +/* Get/Set Hyper-V VSM state. Available with KVM_CAP_HYPERV_VSM */
> +#define KVM_HV_GET_VSM_STATE _IOR(KVMIO, 0xd5, struct kvm_hv_vsm_state)
> +
>  #endif /* __LINUX_KVM_H */

Looks reasonable but if we do hypercall patching in userspace as I suggested,
we might not need this.

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space
  2023-11-08 12:14   ` Alexander Graf
@ 2023-11-28  7:26     ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:26 UTC (permalink / raw)
  To: Alexander Graf, Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

On Wed, 2023-11-08 at 13:14 +0100, Alexander Graf wrote:
> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> > Let user-space handle HVCALL_GET_VP_REGISTERS and
> > HVCALL_SET_VP_REGISTERS through the KVM_EXIT_HYPERV_HVCALL exit reason.
> > Additionally, expose the cpuid bit.
> > 
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > ---
> >   arch/x86/kvm/hyperv.c             | 9 +++++++++
> >   include/asm-generic/hyperv-tlfs.h | 1 +
> >   2 files changed, 10 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > index caaa859932c5..a3970d52eef1 100644
> > --- a/arch/x86/kvm/hyperv.c
> > +++ b/arch/x86/kvm/hyperv.c
> > @@ -2456,6 +2456,9 @@ static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
> >   
> >   static bool kvm_hv_is_xmm_output_hcall(u16 code)
> >   {
> > +	if (code == HVCALL_GET_VP_REGISTERS)
> > +		return true;
> > +
> >   	return false;
> >   }
> >   
> > @@ -2520,6 +2523,8 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
> >   	case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX:
> >   	case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX:
> >   	case HVCALL_SEND_IPI_EX:
> > +	case HVCALL_GET_VP_REGISTERS:
> > +	case HVCALL_SET_VP_REGISTERS:
> >   		return true;
> >   	}
> >   
> > @@ -2738,6 +2743,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> >   			break;
> >   		}
> >   		goto hypercall_userspace_exit;
> > +	case HVCALL_GET_VP_REGISTERS:
> > +	case HVCALL_SET_VP_REGISTERS:
> > +		goto hypercall_userspace_exit;
> >   	default:
> >   		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
> >   		break;
> > @@ -2903,6 +2911,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
> >   			ent->ebx |= HV_POST_MESSAGES;
> >   			ent->ebx |= HV_SIGNAL_EVENTS;
> >   			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
> > +			ent->ebx |= HV_ACCESS_VP_REGISTERS;
> 
> Do we need to guard this?

I think so, check should be added to 'hv_check_hypercall_access'.

I do wonder though why KVM can't just pass all unknown hypercalls to userspace
instead of having a whitelist.


Best regards,
	Maxim Levitsky

> 
> 
> Alex
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 





^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls in user-space
  2023-11-08 11:17 ` [RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls " Nicolas Saenz Julienne
@ 2023-11-28  7:28   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:28 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Let user-space handle all hypercalls that fall under the AccessVsm
> partition privilege flag. That is:
>  - HVCALL_MODIFY_VTL_PROTECTION_MASK:
>  - HVCALL_ENABLE_PARTITION_VTL:
>  - HVCALL_ENABLE_VP_VTL:
>  - HVCALL_VTL_CALL:
>  - HVCALL_VTL_RETURN:
> The hypercalls are processed through the KVM_EXIT_HYPERV_HVCALL exit.
> Additionally, expose the cpuid bit.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/hyperv.c             | 15 +++++++++++++++
>  include/asm-generic/hyperv-tlfs.h |  7 ++++++-
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index a3970d52eef1..a266c5d393f5 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2462,6 +2462,11 @@ static bool kvm_hv_is_xmm_output_hcall(u16 code)
>  	return false;
>  }
>  
> +static inline bool kvm_hv_is_vtl_call_return(u16 code)
> +{
> +	return code == HVCALL_VTL_CALL || code == HVCALL_VTL_RETURN;
> +}
> +
>  static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>  {
>  	bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
> @@ -2471,6 +2476,9 @@ static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>  	if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast)
>  		kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
>  
> +	if (kvm_hv_is_vtl_call_return(code))
> +		return kvm_skip_emulated_instruction(vcpu);

Can you add justification for this?
If this is justified, does it make sense to move this code to kvm_hv_hypercall_complete
(which also calls kvm_skip_emulated_instruction())



> +
>  	return kvm_hv_hypercall_complete(vcpu, result);
>  }
>  
> @@ -2525,6 +2533,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc)
>  	case HVCALL_SEND_IPI_EX:
>  	case HVCALL_GET_VP_REGISTERS:
>  	case HVCALL_SET_VP_REGISTERS:
> +	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
>  		return true;
>  	}
>  
> @@ -2745,6 +2754,11 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>  		goto hypercall_userspace_exit;
>  	case HVCALL_GET_VP_REGISTERS:
>  	case HVCALL_SET_VP_REGISTERS:
> +	case HVCALL_MODIFY_VTL_PROTECTION_MASK:
> +	case HVCALL_ENABLE_PARTITION_VTL:
> +	case HVCALL_ENABLE_VP_VTL:
> +	case HVCALL_VTL_CALL:
> +	case HVCALL_VTL_RETURN:
>  		goto hypercall_userspace_exit;
>  	default:

Also those new hypercalls also should be added to hv_check_hypercall_access.

>  		ret = HV_STATUS_INVALID_HYPERCALL_CODE;
> @@ -2912,6 +2926,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>  			ent->ebx |= HV_SIGNAL_EVENTS;
>  			ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
>  			ent->ebx |= HV_ACCESS_VP_REGISTERS;
> +			ent->ebx |= HV_ACCESS_VSM;
>  
>  			ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
>  			ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;

Best regards,
	Maxim Levitsky

> diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h
> index 24ea699a3d8e..a8b5c8a84bbc 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -89,6 +89,7 @@
>  #define HV_ACCESS_STATS				BIT(8)
>  #define HV_DEBUGGING				BIT(11)
>  #define HV_CPU_MANAGEMENT			BIT(12)
> +#define HV_ACCESS_VSM				BIT(16)
>  #define HV_ACCESS_VP_REGISTERS			BIT(17)
>  #define HV_ENABLE_EXTENDED_HYPERCALLS		BIT(20)
>  #define HV_ISOLATION				BIT(22)
> @@ -147,9 +148,13 @@ union hv_reference_tsc_msr {
>  /* Declare the various hypercall operations. */
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE	0x0002
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST	0x0003
> -#define HVCALL_ENABLE_VP_VTL			0x000f
>  #define HVCALL_NOTIFY_LONG_SPIN_WAIT		0x0008
>  #define HVCALL_SEND_IPI				0x000b
> +#define HVCALL_MODIFY_VTL_PROTECTION_MASK	0x000c
> +#define HVCALL_ENABLE_PARTITION_VTL		0x000d
> +#define HVCALL_ENABLE_VP_VTL			0x000f
> +#define HVCALL_VTL_CALL				0x0011
> +#define HVCALL_VTL_RETURN			0x0012
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX	0x0013
>  #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX	0x0014
>  #define HVCALL_SEND_IPI_EX			0x0015





^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 13/33] KVM: Allow polling vCPUs for events
  2023-11-08 11:17 ` [RFC 13/33] KVM: Allow polling vCPUs for events Nicolas Saenz Julienne
@ 2023-11-28  7:30   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:30 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> A number of use cases have surfaced where it'd be beneficial to have a
> vCPU stop its execution in user-space, as opposed to having it sleep
> in-kernel. Be it in order to make better use of the pCPU's time while
> the vCPU is halted, or to implement security features like Hyper-V's
> VSM.


> 
> A problem with this approach is that user-space has no way of knowing
> whether the vCPU has pending events (interrupts, timers, etc...), so we
> need a new interface to query if they are. poll() turned out to be a
> very good fit.
> 
> So enable polling vCPUs. The poll() interface considers a vCPU has a
> pending event if it didn't enter the guest since being kicked by an
> event source (being kicked forces a guest exit). Kicking a vCPU that has
> pollers wakes up the polling threads.
> 
> NOTES:
>  - There is a race between the 'vcpu->kicked' check in the polling
>    thread and the vCPU thread re-entering the guest. This hardly affects
>    the use-cases stated above, but needs to be fixed.
> 
>  - This was tested alongside a WIP Hyper-V Virtual Trust Level
>    implementation which makes ample use of the poll() interface.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/x86.c       |  2 ++
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c      | 30 ++++++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 57f9c58e1e32..bf4891bc044e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10788,6 +10788,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		goto cancel_injection;
>  	}
>  
> +	WRITE_ONCE(vcpu->kicked, false);
> +
>  	if (req_immediate_exit) {
>  		kvm_make_request(KVM_REQ_EVENT, vcpu);
>  		static_call(kvm_x86_request_immediate_exit)(vcpu);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 687589ce9f63..71e1e8cf8936 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -336,6 +336,7 @@ struct kvm_vcpu {
>  #endif
>  	int mode;
>  	u64 requests;
> +	bool kicked;
>  	unsigned long guest_debug;
>  
>  	struct mutex mutex;
> @@ -395,6 +396,7 @@ struct kvm_vcpu {
>  	 */
>  	struct kvm_memory_slot *last_used_slot;
>  	u64 last_used_slot_gen;
> +	wait_queue_head_t wqh;
>  };
>  
>  /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ad9aab898a0c..fde004a0ac46 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -497,12 +497,14 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	kvm_vcpu_set_dy_eligible(vcpu, false);
>  	vcpu->preempted = false;
>  	vcpu->ready = false;
> +	vcpu->kicked = false;
>  	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
>  	vcpu->last_used_slot = NULL;
>  
>  	/* Fill the stats id string for the vcpu */
>  	snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d",
>  		 task_pid_nr(current), id);
> +	init_waitqueue_head(&vcpu->wqh);
>  }
>  
>  static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
> @@ -3970,6 +3972,10 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
>  		if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
>  			smp_send_reschedule(cpu);
>  	}
> +
> +	if (!cmpxchg(&vcpu->kicked, false, true))
> +		wake_up_interruptible(&vcpu->wqh);
> +
>  out:
>  	put_cpu();
>  }
> @@ -4174,6 +4180,29 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +static __poll_t kvm_vcpu_poll(struct file *file, poll_table *wait)
> +{
> +	struct kvm_vcpu *vcpu = file->private_data;
> +
> +	poll_wait(file, &vcpu->wqh, wait);
> +
> +	/*
> +	 * Make sure we read vcpu->kicked after adding the vcpu into
> +	 * the waitqueue list. Otherwise we might have the following race:
> +	 *
> +	 *   READ_ONCE(vcpu->kicked)
> +	 *					cmpxchg(&vcpu->kicked, false, true))
> +	 *					wake_up_interruptible(&vcpu->wqh)
> +	 *   list_add_tail(wait, &vcpu->wqh)
> +	 */
> +	smp_mb();
> +	if (READ_ONCE(vcpu->kicked)) {
> +		return EPOLLIN;
> +	}
> +
> +	return 0;
> +}
> +
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvm_vcpu *vcpu = filp->private_data;
> @@ -4186,6 +4215,7 @@ static const struct file_operations kvm_vcpu_fops = {
>  	.release        = kvm_vcpu_release,
>  	.unlocked_ioctl = kvm_vcpu_ioctl,
>  	.mmap           = kvm_vcpu_mmap,
> +	.poll		= kvm_vcpu_poll,
>  	.llseek		= noop_llseek,
>  	KVM_COMPAT(kvm_vcpu_compat_ioctl),
>  };



A few ideas on the design:

I think that we can do this in a simpler way.


I am thinking about the following API:

-> vCPU does vtlcall and KVM exits to the userspace.

-> The userspace sets the vCPU to the new MP runstate (KVM_MP_STATE_HALTED_USERSPACE) which is just like regular halt
but once vCPU is ready to run, the KVM instead exits to userspace.


-> The userspace does another KVM_RUN which blocks till an event comes and exits back to userspace.

-> The userspace can now decide what to do, and it might for example send signal to vCPU thread which runs VTL0,
to kick it out of the guest mode, and resume the VTL1 vCPU.


Best regards,
	Maxim Levitsky 







^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role
  2023-11-10 18:52     ` Nicolas Saenz Julienne
@ 2023-11-28  7:34       ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:34 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, Sean Christopherson
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, kys, haiyangz, decui, x86, linux-doc

On Fri, 2023-11-10 at 18:52 +0000, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 5:26 PM UTC, Sean Christopherson wrote:
> > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > > With the upcoming introduction of per-VTL memory protections, make MMU
> > > roles VTL aware. This will avoid sharing PTEs between vCPUs that belong
> > > to different VTLs, and that have distinct memory access restrictions.
> > > 
> > > Four bits are allocated to store the VTL number in the MMU role, since
> > > the TLFS states there is a maximum of 16 levels.
> > 
> > How many does KVM actually allow/support?  Multiplying the number of possible
> > roots by 16x is a *major* change.
> 
> AFAIK in practice only VTL0/1 are used. Don't know if Microsoft will
> come up with more in the future. We could introduce a CAP that expses
> the number of supported VTLs to user-space, and leave it as a compile
> option.
> 

Actually hyperv spec says that currently only two VTLs are implemented in HyperV

https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm

"Architecturally, up to 16 levels of VTLs are supported; however a hypervisor may choose to implement fewer than 16 VTL’s. Currently, only two VTLs are implemented."

We shouldn't completely hardcode two VTLs but I think that it is safe to make optimizations aiming at two VTLs,
and also have a compile time switch for the number of supported VTLs.

In terms of adding VTLs to MMU role, as long as it's only 2 VTLs, I don't think that this is a terrible idea.

This does bring a question: what we are going to do about SMM? Windows will need it due to secure boot,
so we can't just say that VSM is only supported without SMM.


However if we take the approach of having a VM per VTL, then all of this is free, except that every time userspace changes memslots,
it will have to do so for both VMs at the same time (and that might introduce races).

Also TLB flushes might be tricky to synchronize between these two VMs and so on.

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults
  2023-11-08 11:17 ` [RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults Nicolas Saenz Julienne
@ 2023-11-28  7:34   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:34 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> The upcoming per-VTL memory protections support needs to fault in
> non-executable memory. Introduce a new attribute in struct
> kvm_page_fault, map_executable, to control whether the gfn range should
> be mapped as executable.
> 
> No functional change intended.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 6 +++++-
>  arch/x86/kvm/mmu/mmu_internal.h | 2 ++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 8 ++++++--
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2afef86863fb..4e02d506cc25 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3245,6 +3245,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	struct kvm_mmu_page *sp;
>  	int ret;
>  	gfn_t base_gfn = fault->gfn;
> +	unsigned access = ACC_ALL;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
>  
> @@ -3274,7 +3275,10 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	if (WARN_ON_ONCE(it.level != fault->goal_level))
>  		return -EFAULT;
>  
> -	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
> +	if (!fault->map_executable)
> +		access &= ~ACC_EXEC_MASK;
> +
> +	ret = mmu_set_spte(vcpu, fault->slot, it.sptep, access,
>  			   base_gfn, fault->pfn, fault);
>  	if (ret == RET_PF_SPURIOUS)
>  		return ret;


> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b66a7d47e0e4..bd62c4d5d5f1 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -239,6 +239,7 @@ struct kvm_page_fault {
>  	kvm_pfn_t pfn;
>  	hva_t hva;
>  	bool map_writable;
> +	bool map_executable;
>  
>  	/*
>  	 * Indicates the guest is trying to write a gfn that contains one or
> @@ -298,6 +299,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		.req_level = PG_LEVEL_4K,
>  		.goal_level = PG_LEVEL_4K,
>  		.is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
> +		.map_executable = true,
>  	};
>  	int r;
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6cd4dd631a2f..46f3e72ab770 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -957,14 +957,18 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	u64 new_spte;
>  	int ret = RET_PF_FIXED;
>  	bool wrprot = false;
> +	unsigned access = ACC_ALL;
>  
>  	if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
>  		return RET_PF_RETRY;
>  
> +	if (!fault->map_executable)
> +		access &= ~ACC_EXEC_MASK;
> +
>  	if (unlikely(!fault->slot))
> -		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> +		new_spte = make_mmio_spte(vcpu, iter->gfn, access);
>  	else
> -		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> +		wrprot = make_spte(vcpu, sp, fault->slot, access, iter->gfn,
>  					 fault->pfn, iter->old_spte, fault->prefetch, true,
>  					 fault->map_writable, &new_spte);

Overall this patch makes sense but I don't know the mmu well enough to be sure
that there are no corner cases which are not handeled here.

>  


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits
  2023-11-08 11:17 ` [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits Nicolas Saenz Julienne
@ 2023-11-28  7:36   ` Maxim Levitsky
  2023-11-28 16:31     ` Sean Christopherson
  0 siblings, 1 reply; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:36 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Include the fault's read, write and execute status when exiting to
> user-space.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 4 ++--
>  include/linux/kvm_host.h | 9 +++++++--
>  include/uapi/linux/kvm.h | 6 ++++++
>  3 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4e02d506cc25..feca077c0210 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4300,8 +4300,8 @@ static inline u8 kvm_max_level_for_order(int order)
>  static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  					      struct kvm_page_fault *fault)
>  {
> -	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> -				      PAGE_SIZE, fault->write, fault->exec,
> +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, PAGE_SIZE,
> +				      fault->write, fault->exec, fault->user,
>  				      fault->is_private);
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 71e1e8cf8936..631fd532c97a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2367,14 +2367,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  						 gpa_t gpa, gpa_t size,
>  						 bool is_write, bool is_exec,
> -						 bool is_private)
> +						 bool is_read, bool is_private)

It almost feels like there is a need for a struct to hold all of those parameters.

>  {
>  	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>  	vcpu->run->memory_fault.gpa = gpa;
>  	vcpu->run->memory_fault.size = size;
>  
> -	/* RWX flags are not (yet) defined or communicated to userspace. */
>  	vcpu->run->memory_fault.flags = 0;
> +	if (is_read)
> +		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_READ;
> +	if (is_write)
> +		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_WRITE;
> +	if (is_exec)
> +		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_EXECUTE;
>  	if (is_private)
>  		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
>  }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 03f5c08fd7aa..0ddffb8b0c99 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -533,7 +533,13 @@ struct kvm_run {
>  		} notify;
>  		/* KVM_EXIT_MEMORY_FAULT */
>  		struct {
> +#define KVM_MEMORY_EXIT_FLAG_READ	(1ULL << 0)
> +#define KVM_MEMORY_EXIT_FLAG_WRITE	(1ULL << 1)
> +#define KVM_MEMORY_EXIT_FLAG_EXECUTE	(1ULL << 2)
>  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
> +#define KVM_MEMORY_EXIT_NO_ACCESS                            \
> +	(KVM_MEMORY_EXIT_FLAG_NR | KVM_MEMORY_EXIT_FLAG_NW | \
> +	 KVM_MEMORY_EXIT_FLAG_NX)
>  			__u64 flags;
>  			__u64 gpa;
>  			__u64 size;


I don't think that KVM_MEMORY_EXIT_FLAG_NR, KVM_MEMORY_EXIT_FLAG_NW, KVM_MEMORY_EXIT_FLAG_NX are defined anywhere.
Also why KVM_MEMORY_EXIT_NO_ACCESS is needed - userspace can infer it from the lack of other access flags.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled
  2023-11-08 11:17 ` [RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled Nicolas Saenz Julienne
@ 2023-11-28  7:39   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:39 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> VSM is also a user of memory attributes, so let it use
> kvm_set_mem_attributes().
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index feca077c0210..a1fbb905258b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7265,7 +7265,8 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>  	 * Zapping SPTEs in this case ensures KVM will reassess whether or not
>  	 * a hugepage can be used for affected ranges.
>  	 */
> -	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
> +	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm) &&
> +			 !kvm_hv_vsm_enabled(kvm)))
>  		return false;

IMHO on the long term, memory attributes should either be always enabled,
or the above check should became more generic.

But otherwise this patch looks reasonable.

>  
>  	return kvm_unmap_gfn_range(kvm, range);
> @@ -7322,7 +7323,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  	 * a range that has PRIVATE GFNs, and conversely converting a range to
>  	 * SHARED may now allow hugepages.
>  	 */
> -	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm)))
> +	if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm) &&
> +			 !kvm_hv_vsm_enabled(kvm)))
>  		return false;
>  
>  	/*

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array
  2023-11-08 11:17 ` [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array Nicolas Saenz Julienne
  2023-11-08 16:59   ` Sean Christopherson
@ 2023-11-28  7:41   ` Maxim Levitsky
  1 sibling, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:41 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array to
> allow other memory attribute sources to use the function.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 5 +++--
>  include/linux/kvm_host.h | 8 +++++---
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a1fbb905258b..96421234ca88 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7301,7 +7301,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
>  
>  	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
>  		if (hugepage_test_mixed(slot, gfn, level - 1) ||
> -		    attrs != kvm_get_memory_attributes(kvm, gfn))
> +		    attrs != kvm_get_memory_attributes(&kvm->mem_attr_array, gfn))
>  			return false;
>  	}
>  	return true;
> @@ -7401,7 +7401,8 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
>  		 * be manually checked as the attributes may already be mixed.
>  		 */
>  		for (gfn = start; gfn < end; gfn += nr_pages) {
> -			unsigned long attrs = kvm_get_memory_attributes(kvm, gfn);
> +			unsigned long attrs =
> +				kvm_get_memory_attributes(&kvm->mem_attr_array, gfn);
>  
>  			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
>  				hugepage_clear_mixed(slot, gfn, level);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 631fd532c97a..4242588e3dfb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2385,9 +2385,10 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  }
>  
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +static inline unsigned long
> +kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)
>  {
> -	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> +	return xa_to_value(xa_load(mem_attr_array, gfn));
>  }

Can we wrap the 'struct xarray *' with a struct even if it will have a single member
to make it clearer what type the 'kvm_get_memory_attributes' receives.
Also maybe rename this to something like 'kvm_get_memory_attributes_for_gfn'?

>  
>  bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> @@ -2400,7 +2401,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  {
>  	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
> -	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +	       kvm_get_memory_attributes(&kvm->mem_attr_array, gfn) &
> +		       KVM_MEMORY_ATTRIBUTE_PRIVATE;
>  }
>  #else
>  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)


Also if we go with VM per VTL approach, we won't need this, each VM can already have its own memory attributes.

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() from struct kvm's mem_attr_array
  2023-11-08 11:17 ` [RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() " Nicolas Saenz Julienne
@ 2023-11-28  7:42   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:42 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Decouple kvm_range_has_memory_attributes() from struct kvm's
> mem_attr_array to allow other memory attribute sources to use the
> function.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 3 ++-
>  include/linux/kvm_host.h | 4 ++--
>  virt/kvm/kvm_main.c      | 9 +++++----
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 96421234ca88..4ace2f8660b0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7297,7 +7297,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
>  
>  	if (level == PG_LEVEL_2M)
> -		return kvm_range_has_memory_attributes(kvm, start, end, attrs);
> +		return kvm_range_has_memory_attributes(&kvm->mem_attr_array,
> +						       start, end, attrs);
>  
>  	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
>  		if (hugepage_test_mixed(slot, gfn, level - 1) ||
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4242588e3dfb..32cf05637647 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2391,8 +2391,8 @@ kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)
>  	return xa_to_value(xa_load(mem_attr_array, gfn));
>  }
>  
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -				     unsigned long attrs);
> +bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start,
> +				     gfn_t end, unsigned long attrs);
>  bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
>  					struct kvm_gfn_range *range);
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fde004a0ac46..6bb23eaf7aa6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2440,10 +2440,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>   * Returns true if _all_ gfns in the range [@start, @end) have attributes
>   * matching @attrs.
>   */
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> -				     unsigned long attrs)
> +bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start,
> +				     gfn_t end, unsigned long attrs)
>  {
> -	XA_STATE(xas, &kvm->mem_attr_array, start);
> +	XA_STATE(xas, mem_attr_array, start);
>  	unsigned long index;
>  	bool has_attrs;
>  	void *entry;
> @@ -2582,7 +2582,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>  	mutex_lock(&kvm->slots_lock);
>  
>  	/* Nothing to do if the entire range as the desired attributes. */
> -	if (kvm_range_has_memory_attributes(kvm, start, end, attributes))
> +	if (kvm_range_has_memory_attributes(&kvm->mem_attr_array, start, end,
> +					    attributes))
>  		goto out_unlock;
>  
>  	/*


Same comments as for previous patch + how about 
'kvm_gfn_range_has_memory_attributes'

(I didn't review the memfd patch series and it shows :( )

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() from struct kvm's mem_attr_array
  2023-11-08 11:17 ` [RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() " Nicolas Saenz Julienne
@ 2023-11-28  7:43   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:43 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Decouple hugepage_has_attrs() from struct kvm's mem_attr_array to
> allow other memory attribute sources to use the function.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4ace2f8660b0..c0fd3afd6be5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7290,19 +7290,19 @@ static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
>  	lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG;
>  }
>  
> -static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
> -			       gfn_t gfn, int level, unsigned long attrs)
> +static bool hugepage_has_attrs(struct xarray *mem_attr_array,
> +			       struct kvm_memory_slot *slot, gfn_t gfn,
> +			       int level, unsigned long attrs)
>  {
>  	const unsigned long start = gfn;
>  	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
>  
>  	if (level == PG_LEVEL_2M)
> -		return kvm_range_has_memory_attributes(&kvm->mem_attr_array,
> -						       start, end, attrs);
> +		return kvm_range_has_memory_attributes(mem_attr_array, start, end, attrs);
>  
>  	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
>  		if (hugepage_test_mixed(slot, gfn, level - 1) ||
> -		    attrs != kvm_get_memory_attributes(&kvm->mem_attr_array, gfn))
> +		    attrs != kvm_get_memory_attributes(mem_attr_array, gfn))
>  			return false;
>  	}
>  	return true;
> @@ -7344,7 +7344,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  			 * misaligned address regardless of memory attributes.
>  			 */
>  			if (gfn >= slot->base_gfn) {
> -				if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
> +				if (hugepage_has_attrs(&kvm->mem_attr_array,
> +						       slot, gfn, level, attrs))
>  					hugepage_clear_mixed(slot, gfn, level);
>  				else
>  					hugepage_set_mixed(slot, gfn, level);
> @@ -7366,7 +7367,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  		 */
>  		if (gfn < range->end &&
>  		    (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) {
> -			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
> +			if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn,
> +					       level, attrs))
>  				hugepage_clear_mixed(slot, gfn, level);
>  			else
>  				hugepage_set_mixed(slot, gfn, level);
> @@ -7405,7 +7407,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
>  			unsigned long attrs =
>  				kvm_get_memory_attributes(&kvm->mem_attr_array, gfn);
>  
> -			if (hugepage_has_attrs(kvm, slot, gfn, level, attrs))
> +			if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn, level, attrs))
>  				hugepage_clear_mixed(slot, gfn, level);
>  			else
>  				hugepage_set_mixed(slot, gfn, level);

hugepage_has_attrs is also not a name that conveys the meaning of what it does IMHO,
but I don't know a better name to be honest.

Same remarks as for other two patches apply here as well.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL memory attributes
  2023-11-08 11:17 ` [RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL " Nicolas Saenz Julienne
@ 2023-11-28  7:44   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:44 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> Introduce KVM_SET_MEMORY_ATTRIBUTES ioctl support for VTL KVM devices.
> The attributes are stored in an xarray private to the VTL device.
> 
> The following memory attributes are supported:
>  - KVM_MEMORY_ATTRIBUTE_READ
>  - KVM_MEMORY_ATTRIBUTE_WRITE
>  - KVM_MEMORY_ATTRIBUTE_EXECUTE
>  - KVM_MEMORY_ATTRIBUTE_NO_ACCESS
> Although only some combinations are valid, see code comment below.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/hyperv.c | 61 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 0d8402dba596..bcace0258af1 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -62,6 +62,10 @@
>   */
>  #define HV_EXT_CALL_MAX (HV_EXT_CALL_QUERY_CAPABILITIES + 64)
>  
> +#define KVM_HV_VTL_ATTRS						\
> +	(KVM_MEMORY_ATTRIBUTE_READ | KVM_MEMORY_ATTRIBUTE_WRITE |	\
> +	 KVM_MEMORY_ATTRIBUTE_EXECUTE | KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
> +
>  static void stimer_mark_pending(struct kvm_vcpu_hv_stimer *stimer,
>  				bool vcpu_kick);
>  
> @@ -3025,6 +3029,7 @@ int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *stat
>  
>  struct kvm_hv_vtl_dev {
>  	int vtl;
> +	struct xarray mem_attrs;
>  };
>  
>  static int kvm_hv_vtl_get_attr(struct kvm_device *dev,
> @@ -3047,16 +3052,71 @@ static void kvm_hv_vtl_release(struct kvm_device *dev)
>  {
>  	struct kvm_hv_vtl_dev *vtl_dev = dev->private;
>  
> +	xa_destroy(&vtl_dev->mem_attrs);
>  	kfree(vtl_dev);
>  	kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */
>  }
>  
> +/*
> + * The TLFS lists the valid memory protection combinations (15.9.3):
> + *  - No access
> + *  - Read-only, no execute
> + *  - Read-only, execute
> + *  - Read/write, no execute
> + *  - Read/write, execute
> + */
> +static bool kvm_hv_validate_vtl_mem_attributes(struct kvm_memory_attributes *attrs)
> +{
> +	u64 attr = attrs->attributes;
> +
> +	if (attr & ~KVM_HV_VTL_ATTRS)
> +		return false;
> +
> +	if (attr == KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
> +		return true;
> +
> +	if (!(attr & KVM_MEMORY_ATTRIBUTE_READ))
> +		return false;
> +
> +	return true;
> +}
> +
> +static long kvm_hv_vtl_ioctl(struct kvm_device *dev, unsigned int ioctl,
> +			     unsigned long arg)
> +{
> +	switch (ioctl) {
> +	case KVM_SET_MEMORY_ATTRIBUTES: {
> +		struct kvm_hv_vtl_dev *vtl_dev = dev->private;
> +		struct kvm_memory_attributes attrs;
> +		int r;
> +
> +		if (copy_from_user(&attrs, (void __user *)arg, sizeof(attrs)))
> +			return -EFAULT;
> +
> +		r = -EINVAL;
> +		if (!kvm_hv_validate_vtl_mem_attributes(&attrs))
> +			return r;
> +
> +		r = kvm_ioctl_set_mem_attributes(dev->kvm, &vtl_dev->mem_attrs,
> +						 KVM_HV_VTL_ATTRS, &attrs);
> +		if (r)
> +			return r;
> +		break;
> +	}
> +	default:
> +		return -ENOTTY;
> +	}
> +
> +	return 0;
> +}
> +
>  static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type);
>  
>  static struct kvm_device_ops kvm_hv_vtl_ops = {
>  	.name = "kvm-hv-vtl",
>  	.create = kvm_hv_vtl_create,
>  	.release = kvm_hv_vtl_release,
> +	.ioctl = kvm_hv_vtl_ioctl,
>  	.get_attr = kvm_hv_vtl_get_attr,
>  };
>  
> @@ -3076,6 +3136,7 @@ static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type)
>  			vtl++;
>  
>  	vtl_dev->vtl = vtl;
> +	xa_init(&vtl_dev->mem_attrs);
>  	dev->private = vtl_dev;
>  
>  	return 0;

It makes sense, but hopefully we won't need it if we adopt the VM per VTL approach.

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots
  2023-11-08 11:18 ` [RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots Nicolas Saenz Julienne
@ 2023-11-28  7:46   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:46 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:18 +0000, Nicolas Saenz Julienne wrote:
> Introduce a new step in __kvm_faultin_pfn() that'll validate the
> fault against the vCPU's VTL protections and generate a user space exit
> when invalid.
> 
> Note that kvm_hv_faultin_pfn() has to be run after resolving the fault
> against the memslots, since that operation steps over
> 'fault->map_writable'.
> 
> Non VSM users shouldn't see any behaviour change.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/kvm/hyperv.c  | 66 ++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/hyperv.h  |  1 +
>  arch/x86/kvm/mmu/mmu.c |  9 +++++-
>  3 files changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index bcace0258af1..eb6a4848e306 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -42,6 +42,8 @@
>  #include "irq.h"
>  #include "fpu.h"
>  
> +#include "mmu/mmu_internal.h"
> +
>  #define KVM_HV_MAX_SPARSE_VCPU_SET_BITS DIV_ROUND_UP(KVM_MAX_VCPUS, HV_VCPUS_PER_SPARSE_BANK)
>  
>  /*
> @@ -3032,6 +3034,55 @@ struct kvm_hv_vtl_dev {
>  	struct xarray mem_attrs;
>  };
>  
> +static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu);
> +
> +bool kvm_hv_vsm_access_valid(struct kvm_page_fault *fault, unsigned long attrs)
> +{
> +	if (attrs == KVM_MEMORY_ATTRIBUTE_NO_ACCESS)
> +		return false;
> +
> +	/* We should never get here without read permissions, force a fault. */
> +	if (WARN_ON_ONCE(!(attrs & KVM_MEMORY_ATTRIBUTE_READ)))
> +		return false;
> +
> +	if (fault->write && !(attrs & KVM_MEMORY_ATTRIBUTE_WRITE))
> +		return false;
> +
> +	if (fault->exec && !(attrs & KVM_MEMORY_ATTRIBUTE_EXECUTE))
> +		return false;
> +
> +	return true;
> +}
> +
> +static unsigned long kvm_hv_vsm_get_memory_attributes(struct kvm_vcpu *vcpu,
> +						      gfn_t gfn)
> +{
> +	struct xarray *prots = kvm_hv_vsm_get_memprots(vcpu);
> +
> +	if (!prots)
> +		return 0;
> +
> +	return xa_to_value(xa_load(prots, gfn));
> +}
> +
> +int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +{
> +	unsigned long attrs;
> +
> +	attrs = kvm_hv_vsm_get_memory_attributes(vcpu, fault->gfn);
> +	if (!attrs)
> +		return RET_PF_CONTINUE;
> +
> +	if (kvm_hv_vsm_access_valid(fault, attrs)) {
> +		fault->map_executable =
> +			!!(attrs & KVM_MEMORY_ATTRIBUTE_EXECUTE);
> +		fault->map_writable = !!(attrs & KVM_MEMORY_ATTRIBUTE_WRITE);
> +		return RET_PF_CONTINUE;
> +	}
> +
> +	return -EFAULT;
> +}
> +
>  static int kvm_hv_vtl_get_attr(struct kvm_device *dev,
>  			       struct kvm_device_attr *attr)
>  {
> @@ -3120,6 +3171,21 @@ static struct kvm_device_ops kvm_hv_vtl_ops = {
>  	.get_attr = kvm_hv_vtl_get_attr,
>  };
>  
> +static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_hv_vtl_dev *vtl_dev;
> +	struct kvm_device *tmp;
> +
> +	list_for_each_entry(tmp, &vcpu->kvm->devices, vm_node)
> +		if (tmp->ops == &kvm_hv_vtl_ops) {
> +			vtl_dev = tmp->private;
> +			if (vtl_dev->vtl == kvm_hv_get_active_vtl(vcpu))
> +				return &vtl_dev->mem_attrs;
> +		}
> +
> +	return NULL;
> +}
> +
>  static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvm_hv_vtl_dev *vtl_dev;
> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
> index 3cc664e144d8..ae781b4d4669 100644
> --- a/arch/x86/kvm/hyperv.h
> +++ b/arch/x86/kvm/hyperv.h
> @@ -271,5 +271,6 @@ static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu,
>  
>  int kvm_hv_vtl_dev_register(void);
>  void kvm_hv_vtl_dev_unregister(void);
> +int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a76028aa8fb3..ba454c7277dc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4374,7 +4374,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  					  fault->write, &fault->map_writable,
>  					  &fault->hva);
>  	if (!async)
> -		return RET_PF_CONTINUE; /* *pfn has correct page already */
> +		goto pf_continue; /* *pfn has correct page already */
>  
>  	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4395,6 +4395,13 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
>  					  fault->write, &fault->map_writable,
>  					  &fault->hva);
> +pf_continue:
> +	if (kvm_hv_vsm_enabled(vcpu->kvm)) {
> +		if (kvm_hv_faultin_pfn(vcpu, fault)) {
> +			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +			return -EFAULT;
> +		}
> +	}
>  	return RET_PF_CONTINUE;
>  }
>  

If we don't go with Sean's suggestion of having a VM per VTL,
then this feature should be VSM agnostic IMHO, because it might be useful for other security features that might
want to change the guest memory protection based on some 'level' like VSM.

Even SMM to some extent fits this description although in theory I think that SMM can have "different" memory mapped to the same GPA.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 28/33] x86/hyper-v: Introduce memory intercept message structure
  2023-11-08 11:18 ` [RFC 28/33] x86/hyper-v: Introduce memory intercept message structure Nicolas Saenz Julienne
@ 2023-11-28  7:53   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  7:53 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:18 +0000, Nicolas Saenz Julienne wrote:
> Introduce struct hv_memory_intercept_message, which is used when issuing
> memory intercepts to a Hyper-V VSM guest.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 76 ++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index af594aa65307..d3d74fde6da1 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -799,6 +799,82 @@ struct hv_get_vp_from_apic_id_in {
>  	u32 apic_ids[];
>  } __packed;
>  
> +
> +/* struct hv_intercept_header::access_type_mask */
> +#define HV_INTERCEPT_ACCESS_MASK_NONE    0
> +#define HV_INTERCEPT_ACCESS_MASK_READ    1
> +#define HV_INTERCEPT_ACCESS_MASK_WRITE   2
> +#define HV_INTERCEPT_ACCESS_MASK_EXECUTE 4
> +
> +/* struct hv_intercept_exception::cache_type */
> +#define HV_X64_CACHE_TYPE_UNCACHED       0
> +#define HV_X64_CACHE_TYPE_WRITECOMBINING 1
> +#define HV_X64_CACHE_TYPE_WRITETHROUGH   4
> +#define HV_X64_CACHE_TYPE_WRITEPROTECTED 5
> +#define HV_X64_CACHE_TYPE_WRITEBACK      6
> +
> +/* Intecept message header */
> +struct hv_intercept_header {
> +	__u32 vp_index;
> +	__u8 instruction_length;
> +#define HV_INTERCEPT_ACCESS_READ    0
> +#define HV_INTERCEPT_ACCESS_WRITE   1
> +#define HV_INTERCEPT_ACCESS_EXECUTE 2
> +	__u8 access_type_mask;
> +	union {
> +		__u16 as_u16;
> +		struct {
> +			__u16 cpl:2;
> +			__u16 cr0_pe:1;
> +			__u16 cr0_am:1;
> +			__u16 efer_lma:1;
> +			__u16 debug_active:1;
> +			__u16 interruption_pending:1;
> +			__u16 reserved:9;
> +		};
> +	} exec_state;
> +	struct hv_x64_segment_register cs;
> +	__u64 rip;
> +	__u64 rflags;
> +} __packed;


Although the struct/field names in the TLFS spec are terrible for obvious reasons,
we should still try to stick to them as much as possible to make one's life
less miserable when trying to find them in the spec.

It is also a good idea to mention from which part of the spec these fields
come (hint, it's not from VSM part).

Copying here the structs that I found in the spec:

typedef struct
{
	HV_VP_INDEX VpIndex;
	UINT8 InstructionLength;
	HV_INTERCEPT_ACCESS_TYPE_MASK InterceptAccessType;
	HV_X64_VP_EXECUTION_STATE ExecutionState;
	HV_X64_SEGMENT_REGISTER CsSegment;
	UINT64 Rip;
	UINT64 Rflags;
} HV_X64_INTERCEPT_MESSAGE_HEADER;


typedef struct
{
	UINT16 Cpl:2;
	UINT16 Cr0Pe:1;
	UINT16 Cr0Am:1;
	UINT16 EferLma:1;
	UINT16 DebugActive:1;
	UINT16 InterruptionPending:1;
	UINT16 Reserved:4;
	UINT16 Reserved:5;
} HV_X64_VP_EXECUTION_STATE;


For example 'access_type_mask' should be called intercept_access_type,
and so on.



> +
> +union hv_x64_memory_access_info {
> +	__u8 as_u8;
> +	struct {
> +		__u8 gva_valid:1;
> +		__u8 _reserved:7;
> +	};
> +};

typedef struct
{
	UINT8 GvaValid:1;
	UINT8 Reserved:7;

} HV_X64_MEMORY_ACCESS_INFO;

> +
> +struct hv_memory_intercept_message {
> +	struct hv_intercept_header header;
> +	__u32 cache_type;
> +	__u8 instruction_byte_count;

If I understand correctly this is the size of the following
'instruction_bytes' field?


> +	union hv_x64_memory_access_info memory_access_info;
> +	__u16 _reserved;
> +	__u64 gva;
> +	__u64 gpa;
> +	__u8 instruction_bytes[16];
> +	struct hv_x64_segment_register ds;
> +	struct hv_x64_segment_register ss;
> +	__u64 rax;
> +	__u64 rcx;
> +	__u64 rdx;
> +	__u64 rbx;
> +	__u64 rsp;
> +	__u64 rbp;
> +	__u64 rsi;
> +	__u64 rdi;
> +	__u64 r8;
> +	__u64 r9;
> +	__u64 r10;
> +	__u64 r11;
> +	__u64 r12;
> +	__u64 r13;
> +	__u64 r14;
> +	__u64 r15;
> +} __packed;

I can't seem to find this struct at all in the spec. If it was reverse-engineered,
then we must document everything that we know to help future readers of this code.


Best regards,
	Maxim Levitsky

> +
>  #include <asm-generic/hyperv-tlfs.h>
>  
>  #endif



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
  2023-11-08 13:38     ` Nicolas Saenz Julienne
@ 2023-11-28  8:19       ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  8:19 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, Alexander Graf, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	dwmw, jgowans, corbert, kys, haiyangz, decui, x86, linux-doc

On Wed, 2023-11-08 at 13:38 +0000, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 12:45 PM UTC, Alexander Graf wrote:
> > On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > > Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows
> > > injecting out-of-band Hyper-V secure intercepts. For now only memory
> > > access intercepts are supported. These are triggered when access a GPA
> > > protected by a higher VTL. The memory intercept metadata is filled based
> > > on the GPA provided through struct kvm_vcpu_hv_intercept_info, and
> > > injected into the guest through SynIC message.
> > > 
> > > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > 
> > IMHO memory protection violations should result in a user space exit. 
> 
> It already does, it's not very explicit from the patch itself, since the
> functionality was introduced in through the "KVM: guest_memfd() and
> per-page attributes" series [1].
> 
> See this snippet in patch #27:
> 
> +	if (kvm_hv_vsm_enabled(vcpu->kvm)) {
> +		if (kvm_hv_faultin_pfn(vcpu, fault)) {
> +			kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +			return -EFAULT;
> +		}
> +	}
> 
> Otherwise the doc in patch #33 also mentions this. :)
> 
> > User space can then validate what to do with the violation and if 
> > necessary inject an intercept.
> 
> I do agree that secure intercept injection should be moved into to
> user-space, and happen as a reaction to a user-space memory fault exit.
> I was unable to do so yet, since the intercepts require a level of
> introspection that is not yet available to QEMU. For example, providing
> the length of the instruction that caused the fault. I'll work on
> exposing the necessary information to user-space and move the whole
> intercept concept there.

All the missing information should be included in the new userspace VM exit payload.


Also I would like to share my knowledge of SYNIC and how to deal with it in userspace,
because you will have to send lots of SYNC messages from userspace if we go with
the suggested approach of doing it in the userspace.

- SYNIC has one message per channel, so there is no way to queue more than one
message in the same channel. Usually only channel 0 is used (but I haven't researched
this much).

- In-kernel STIMER emulation queues Synic messages, but it always does this in
the vCPU thread by processing the request 'KVM_REQ_HV_STIMER', and when userspace
wants to queue something with SYNIC it also does this on vCPU thread, this is how
races are avoided.

kvm_hv_process_stimers -> stimer_expiration -> stimer_send_msg(stimer);

If the delivery fails (that is if SYNIC slot already has a message pending there),
then the timer remains pending and the next KVM_REQ_HV_STIMER request will attempt to
deliver it again.
 
After the guest processes a SYNIC message, it erases it by overwriting its message type with 0,
and then the guest notifies the hypervisor about a free slot by either doing a write
to a special MSR (HV_X64_MSR_EOM) or by EOI'ing the APIC interrupt.

According to my observation windows uses the second approach (EOI),
which thankfully works even on AVIC because the Sync Interrupts happen to be level triggered,
and AVIC does intercept level triggered EOI.

Once intercepted the EOI event triggers a delivery of an another stimer message via the vCPU thread,
by raising another KVM_REQ_HV_STIMER request on it.

kvm_hv_notify_acked_sint -> stimer_mark_pending -> kvm_make_request(KVM_REQ_HV_STIMER, vcpu);


Now if the userspace faces already full SYNIC slot, it has to wait, and I don't know if
it can be notified of an EOI or it busy waits somehow.

Note that Qemu's VMBUS/SYNC implementation was never tested in production IMHO, it was once implemented
and is only used currently by a few unit tests.

It might make sense to add userspace SYNIC message queuing to the kernel,
so that userspace could queue as many messages as it wants and let the kernel
copy the first message in the queue to the actual SYNIC slot, every time
it becomes free.


Final note on SYNIC is that qemu's synic code actually installs an overlay
page over sync slots area and writes to it when it queues a message, 
but the in-kernel stimer code just writes to the GPA regardless
if there is an overlay memslot or not.

Another benefit of a proper way of queuing SYNIC messages from the userspace,
is that it might enable the kernel's STIMER to queue the SYNIC message
directly from the timer interrupt routine, which will remove about 1000 vmexits per second that
are caused by KVM_REQ_HV_STIMER on vCPU0, even when posted timer interrupts are used.

I can implement this if you think that this makes sense.

Those are my 0.2 cents.

Best regards,
	Maxim Levitsky


> 
> Nicolas
> 
> [1] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonzini@redhat.com/.
> 



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM"
  2023-11-08 11:18 ` [RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM" Nicolas Saenz Julienne
@ 2023-11-28  8:19   ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-11-28  8:19 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Wed, 2023-11-08 at 11:18 +0000, Nicolas Saenz Julienne wrote:
> Introduce "Emulating Hyper-V VSM with KVM", which describes the KVM APIs
> made available to a VMM that wants to emulate Hyper-V's VSM.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> ---
>  .../virt/kvm/x86/emulating-hyperv-vsm.rst     | 136 ++++++++++++++++++
>  1 file changed, 136 insertions(+)
>  create mode 100644 Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst
> 
> diff --git a/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst b/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst
> new file mode 100644
> index 000000000000..8f76bf09c530
> --- /dev/null
> +++ b/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst
> @@ -0,0 +1,136 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============================
> +Emulating Hyper-V VSM with KVM
> +==============================
> +
> +Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature
> +that leverages the hypervisor to create secure execution environments
> +within a guest. VSM is documented as part of Microsoft's Hypervisor Top
> +Level Functional Specification[1].
> +
> +Emulating Hyper-V's Virtual Secure Mode with KVM requires coordination
> +between KVM and the VMM. Most of the VSM state and configuration is left
> +to be handled by user-space, but some has made its way into KVM. This
> +document describes the mechanisms through which a VMM can implement VSM
> +support.
> +
> +Virtual Trust Levels
> +--------------------
> +
> +The main concept VSM introduces are Virtual Trust Levels or VTLs. Each
> +VTL is a CPU mode, with its own private CPU architectural state,
> +interrupt subsystem (limited to a local APIC), and memory access
> +permissions. VTLs are hierarchical, where VTL0 corresponds to normal
> +guest execution and VTL > 0 to privileged execution contexts. In
> +practice, when virtualising Windows on top of KVM, we only see VTL0 and
> +VTL1. Although the spec allows going all the way to VTL15. VTLs are
> +orthogonal to ring levels, so each VTL is capable of runnig its own
> +operating system and user-space[2].
> +
> +  ┌──────────────────────────────��? ┌──────────────────────────────��?
> +  │ Normal Mode (VTL0)           │ │ Secure Mode (VTL1)           │
> +  │ ┌──────────────────────────��? │ │ ┌──────────────────────────��? │
> +  │ │   User-mode Processes    │ │ │ │Secure User-mode Processes│ │
> +  │ └──────────────────────────┘ │ │ └──────────────────────────┘ │
> +  │ ┌──────────────────────────��? │ │ ┌──────────────────────────��? │
> +  │ │         Kernel           │ │ │ │      Secure Kernel       │ │
> +  │ └──────────────────────────┘ │ │ └──────────────────────────┘ │
> +  └──────────────────────────────┘ └──────────────────────────────┘
> +  ┌───────────────────────────────────────────────────────────────��?
> +  │                         Hypervisor/KVM                        │
> +  └───────────────────────────────────────────────────────────────┘
> +  ┌───────────────────────────────────────────────────────────────��?
> +  │                           Hardware                            │
> +  └───────────────────────────────────────────────────────────────┘
> +
> +VTLs break the core assumption that a vCPU has a single architectural
> +state, lAPIC state, SynIC state, etc. As such, each VTL is modeled as a
> +distinct KVM vCPU, with the restriction that only one is allowed to run
> +at any moment in time. Having multiple KVM vCPUs tracking a single guest
> +CPU complicates vCPU numbering. VMs that enable VSM are expected to use
> +CAP_APIC_ID_GROUPS to segregate vCPUs (and their lAPICs) into different
> +groups. For example, a 4 CPU VSM VM will setup the APIC ID groups feature
> +so only the first two bits of the APIC ID are exposed to the guest. The
> +remaining bits represent the vCPU's VTL. The 'sibling' vCPU to VTL0's
> +vCPU2 at VTL3 will have an APIC ID of 0xE. Using this approach a VMM and
> +KVM are capable of querying a vCPU's VTL, or finding the vCPU associated
> +to a specific VTL.
> +
> +KVM's lAPIC implementation is aware of groups, and takes note of the
> +source vCPU's group when delivering IPIs. As such, it shouldn't be
> +possible to target a different VTL through the APIC. Interrupts are
> +delivered to the vCPU's lAPIC subsystem regardless of the VTL's runstate,
> +this also includes timers. Ultimately, any interrupt incoming from an
> +outside source (IOAPIC/MSIs) is routed to VTL0.
> +
> +Moving Between VTLs
> +-------------------
> +
> +All VSM configuration and VTL handling hypercalls are passed through to
> +user-space. Notably the two primitives that allow switching between VTLs.
> +All shared state synchronization and KVM vCPU scheduling is left to the
> +VMM to manage. For example, upon receiving a VTL call, the VMM stops the
> +vCPU that issued the hypercall, and schedules the vCPU corresponding to
> +the next privileged VTL. When that privileged vCPU is done executing, it
> +issues a VTL return hypercall, so the opposite operation happens. All
> +this is transparent to KVM, which limits itself to running vCPUs.
> +
> +An interrupt directed at a privileged VTL always has precedence over the
> +execution of lower VTLs. To honor this, the VMM can monitor events
> +targeted at privileged vCPUs with poll(), and should trigger an
> +asynchronous VTL switch whenever events become available. Additionally,
> +the target VTL's vCPU VP assist overlay page is used to notify the target
> +VTL with the reason for the switch. The VMM can keep track of the VP
> +assist page by installing an MSR filter for HV_X64_MSR_VP_ASSIST_PAGE.
> +
> +Hyper-V VP registers
> +--------------------
> +
> +VP register hypercalls are passed through to user-space. All requests can
> +be fulfilled either by using already existing KVM state ioctls, or are
> +related to VSM's configuration, which is already handled by the VMM. Note
> +that HV_REGISTER_VSM_CODE_PAGE_OFFSETS is the only VSM specific VP
> +register the kernel controls, as such it is made available through the
> +KVM_HV_GET_VSM_STATE ioctl.
> +
> +Per-VTL Memory Protections
> +--------------------------
> +
> +A privileged VTL can change the memory access restrictions of lower VTLs.
> +It does so to hide secrets from them, or to control what they are allowed
> +to execute. The list of memory protections allowed is[3]:
> + - No access
> + - Read-only, no execute
> + - Read-only, execute
> + - Read/write, no execute
> + - Read/write, execute
> +
> +VTL memory protection hypercalls are passed through to user-space, but
> +KVM provides an interface that allows changing memory protections on a
> +per-VTL basis. This is made possible by the KVM VTL device. VMMs can
> +create one per VTL and it exposes a ioctl, KVM_SET_MEMORY_ATTRIBUTES,
> +that controls the memory protections applied to that VTL. The KVM TDP MMU
> +is VTL aware and page faults are resolved taking into account the
> +corresponding VTL device's memory attributes.
> +
> +When a memory access violates VTL memory protections, KVM issues a secure
> +memory intercept, which is passed as a SynIC message into the next
> +privileged VTL. This happens transparently for the VMM. Additionally, KVM
> +exits with a user-space memory fault. This allows the VMM to stop the
> +vCPU while the secure intercept is handled by the privileged VTL. In the
> +good case, the instruction that triggered the fault is emulated and
> +control is returned to the lower VTL, in the bad case, Windows crashes
> +gracefully.
> +
> +Hyper-V's TLFS also states that DMA should follow VTL0's memory access
> +restrictions. This is out of scope for this document, as IOMMU mappings
> +are not handled by KVM.
> +
> +[1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
> +
> +[2] Conceptually this design is similar to arm's TrustZone: The
> +hypervisor plays the role of EL3. Windows (VTL0) runs in Non-Secure
> +(EL0/EL1) and the secure kernel (VTL1) in Secure World (EL1s/EL0s).
> +
> +[3] TLFS 15.9.3

As Sean said, we should also document on how VTL support is implemented in the KVM (e.g things that matter to KVM developers),
but the above information is also very useful and should be kept.


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits
  2023-11-28  7:36   ` Maxim Levitsky
@ 2023-11-28 16:31     ` Sean Christopherson
  0 siblings, 0 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-28 16:31 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, graf, dwmw, jgowans, corbert, kys,
	haiyangz, decui, x86, linux-doc

On Tue, Nov 28, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > Include the fault's read, write and execute status when exiting to
> > user-space.
> > 
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c   | 4 ++--
> >  include/linux/kvm_host.h | 9 +++++++--
> >  include/uapi/linux/kvm.h | 6 ++++++
> >  3 files changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4e02d506cc25..feca077c0210 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4300,8 +4300,8 @@ static inline u8 kvm_max_level_for_order(int order)
> >  static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> >  					      struct kvm_page_fault *fault)
> >  {
> > -	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > -				      PAGE_SIZE, fault->write, fault->exec,
> > +	kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, PAGE_SIZE,
> > +				      fault->write, fault->exec, fault->user,
> >  				      fault->is_private);
> >  }
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 71e1e8cf8936..631fd532c97a 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2367,14 +2367,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> >  static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> >  						 gpa_t gpa, gpa_t size,
> >  						 bool is_write, bool is_exec,
> > -						 bool is_private)
> > +						 bool is_read, bool is_private)
> 
> It almost feels like there is a need for a struct to hold all of those parameters.

The most obvious solution would be to make "struct kvm_page_fault" common, e.g.
ARM's user_mem_abort() fills RWX booleans just like x86 fills kvm_page_fault.
But I think it's best to wait to do something like that until after Anish's series
lands[*].  That way the conversion can be more of a pure refactoring.

[*] https://lore.kernel.org/all/20231109210325.3806151-1-amoorthy@google.com


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-11-28  7:08   ` Maxim Levitsky
@ 2023-11-28 16:33     ` Sean Christopherson
  2023-12-01 16:19     ` Nicolas Saenz Julienne
  1 sibling, 0 replies; 108+ messages in thread
From: Sean Christopherson @ 2023-11-28 16:33 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, graf, dwmw, jgowans, corbert, kys,
	haiyangz, decui, x86, linux-doc

On Tue, Nov 28, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > index 78d053042667..d4b1b53ea63d 100644
> > --- a/arch/x86/kvm/hyperv.c
> > +++ b/arch/x86/kvm/hyperv.c
> > @@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
> >  static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> >  {
> >  	struct kvm *kvm = vcpu->kvm;
> > -	u8 instructions[9];
> > +	struct kvm_hv *hv = to_kvm_hv(kvm);
> > +	u8 instructions[0x30];
> >  	int i = 0;
> >  	u64 addr;
> >  
> > @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> >  	/* ret */
> >  	((unsigned char *)instructions)[i++] = 0xc3;
> >  
> > +	/* VTL call/return entries */
> > +	if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
> > +#ifdef CONFIG_X86_64
> > +		if (is_64_bit_mode(vcpu)) {
> > +			/*
> > +			 * VTL call 64-bit entry prologue:
> > +			 * 	mov %rcx, %rax
> > +			 * 	mov $0x11, %ecx
> > +			 * 	jmp 0:
> 
> This isn't really 'jmp 0' as I first wondered but actually backward jump 32
> bytes back (if I did the calculation correctly).  This is very dangerous
> because code that was before can change and in fact I don't think that this
> offset is even correct now, and on top of that it depends on support for xen
> hypercalls as well.
> 
> This can be fixed by calculating the offset in runtime, however I am
> thinking:
> 
> 
> Since userspace will have to be aware of the offsets in this page, and since
> pretty much everything else is done in userspace, it might make sense to
> create the hypercall page in the userspace.
> 
> In fact, the fact that KVM currently overwrites the guest page, is a
> violation of the HV spec.
> 
> It's more correct regardless of VTL to do userspace vm exit and let the
> userspace put a memslot ("overlay") over the address, and put whatever
> userspace wants there, including the above code.
> 
> Then we won't need the new ioctl as well.
> 
> To support this I think that we can add a userspace msr filter on the
> HV_X64_MSR_HYPERCALL, although I am not 100% sure if a userspace msr filter
> overrides the in-kernel msr handling.

Yep, userspace MSR filters override in-kernel handling.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
  2023-11-28  6:56   ` Maxim Levitsky
@ 2023-12-01 15:25     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-12-01 15:25 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc, Anel Orazgaliyeva

Hi Maxim,

On Tue Nov 28, 2023 at 6:56 AM UTC, Maxim Levitsky wrote:
> On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > From: Anel Orazgaliyeva <anelkz@amazon.de>
> >
> > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> > ids into two. The lower bits, the physical APIC id, represent the part
> > that's exposed to the guest. The higher bits, which are private to KVM,
> > groups APICs together. APICs in different groups are isolated from each
> > other, and IPIs can only be directed at APICs that share the same group
> > as its source. Furthermore, groups are only relevant to IPIs, anything
> > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> > PV-IPIs is targeted at the default APIC group, group 0.
> >
> > When routing IPIs with physical destinations, KVM will OR the source's
> > vCPU APIC group with the ICR's destination ID and use that to resolve
> > the target lAPIC. The APIC physical map is also made group aware in
> > order to speed up this process. For the sake of simplicity, the logical
> > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> > routing to the slower per-vCPU scan method.
> >
> > This capability serves as a building block to implement virtualisation
> > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> > introduces a para-virtualised switch that allows for guest CPUs to jump
> > into a different execution context, this switches into a different CPU
> > state, lAPIC state, and memory protections. We model this in KVM by
> > using distinct kvm_vcpus for each context. Moreover, execution contexts
> > are hierarchical and its APICs are meant to remain functional even when
> > the context isn't 'scheduled in'. For example, we have to keep track of
> > timers' expirations, and interrupt execution of lesser priority contexts
> > when relevant. Hence the need to alias physical APIC ids, while keeping
> > the ability to target specific execution contexts.
>
>
> A few general remarks on this patch (assuming that we don't go with
> the approach of a VM per VTL, in which case this patch is not needed)
>
> -> This feature has to be done in the kernel because vCPUs sharing same VTL,
>    will have same APIC ID.
>    (In addition to that, APIC state is private to a VTL so each VTL
>    can even change its apic id).
>
>    Because of this KVM has to have at least some awareness of this.
>
> -> APICv/AVIC should be supported with VTL eventually:
>    This is thankfully possible by having separate physid/pid tables per VTL,
>    and will mostly just work but needs KVM awareness.
>
> -> I am somewhat against reserving bits in apic id, because that will limit
>    the number of apic id bits available to userspace. Currently this is not
>    a problem but it might be in the future if for some reason the userspace
>    will want an apic id with high bits set.
>
>    But still things change, and with this being part of KVM's ABI, it might backfire.
>    A better idea IMHO is just to have 'APIC namespaces', which like say PID namespaces,
>    such as each namespace is isolated IPI wise on its own, and let each vCPU belong to
>    a one namespace.
>
>    In fact Intel's PRM has a brief mention of a 'hierarchical cluster' mode in which
>    roughly describes this situation in which there are multiple not interconnected APIC buses,
>    and communication between them needs a 'cluster manager device'
>
>    However I don't think that we need an explicit pairs of vCPUs and VTL awareness in the kernel
>    all of this I think can be done in userspace.
>
>    TL;DR: Lets have APIC namespace. a vCPU can belong to a single namespace, and all vCPUs
>    in a namespace send IPIs to each other and know nothing about vCPUs from other namespace.
>
>    A vCPU sending IPI to a different VTL thankfully can only do this using a hypercall,
>    and thus can be handled in the userspace.
>
>
> Overall though IMHO the approach of a VM per VTL is better unless some show stoppers show up.
> If we go with a VM per VTL, we gain APIC namespaces for free, together with AVIC support and
> such.


Thanks, for the thorough review! I took note of all your design comments
(here and in subsequent patches).

I agree that the way to go is the VM per VTL approach. I'll prepare a
PoC as soon as I'm back from the holidays and share my results.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-11-28  7:08   ` Maxim Levitsky
  2023-11-28 16:33     ` Sean Christopherson
@ 2023-12-01 16:19     ` Nicolas Saenz Julienne
  2023-12-01 16:32       ` Sean Christopherson
  1 sibling, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-12-01 16:19 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, kys, haiyangz, decui, x86, linux-doc

On Tue Nov 28, 2023 at 7:08 AM UTC, Maxim Levitsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > VTL call/return hypercalls have their own entry points in the hypercall
> > page because they don't follow normal hyper-v hypercall conventions.
> > Move the VTL call/return control input into ECX/RAX and set the
> > hypercall code into EAX/RCX before calling the hypercall instruction in
> > order to be able to use the Hyper-V hypercall entry function.
> >
> > Guests can read an emulated code page offsets register to know the
> > offsets into the hypercall page for the VTL call/return entries.
> >
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> >
> > ---
> >
> > My tree has the additional patch, we're still trying to understand under
> > what conditions Windows expects the offset to be fixed.
> >
> > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > index 54f7f36a89bf..9f2ea8c34447 100644
> > --- a/arch/x86/kvm/hyperv.c
> > +++ b/arch/x86/kvm/hyperv.c
> > @@ -294,6 +294,7 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> >
> >         /* VTL call/return entries */
> >         if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
> > +               i = 22;
> >  #ifdef CONFIG_X86_64
> >                 if (is_64_bit_mode(vcpu)) {
> >                         /*
> > ---
> >  arch/x86/include/asm/kvm_host.h   |  2 +
> >  arch/x86/kvm/hyperv.c             | 78 ++++++++++++++++++++++++++++++-
> >  include/asm-generic/hyperv-tlfs.h | 11 +++++
> >  3 files changed, 90 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index a2f224f95404..00cd21b09f8c 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1105,6 +1105,8 @@ struct kvm_hv {
> >       u64 hv_tsc_emulation_status;
> >       u64 hv_invtsc_control;
> >
> > +     union hv_register_vsm_code_page_offsets vsm_code_page_offsets;
> > +
> >       /* How many vCPUs have VP index != vCPU index */
> >       atomic_t num_mismatched_vp_indexes;
> >
> > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > index 78d053042667..d4b1b53ea63d 100644
> > --- a/arch/x86/kvm/hyperv.c
> > +++ b/arch/x86/kvm/hyperv.c
> > @@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
> >  static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> >  {
> >       struct kvm *kvm = vcpu->kvm;
> > -     u8 instructions[9];
> > +     struct kvm_hv *hv = to_kvm_hv(kvm);
> > +     u8 instructions[0x30];
> >       int i = 0;
> >       u64 addr;
> >
> > @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
> >       /* ret */
> >       ((unsigned char *)instructions)[i++] = 0xc3;
> >
> > +     /* VTL call/return entries */
> > +     if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) {
> > +#ifdef CONFIG_X86_64
> > +             if (is_64_bit_mode(vcpu)) {
> > +                     /*
> > +                      * VTL call 64-bit entry prologue:
> > +                      *      mov %rcx, %rax
> > +                      *      mov $0x11, %ecx
> > +                      *      jmp 0:
>
> This isn't really 'jmp 0' as I first wondered but actually backward jump 32 bytes back (if I did the calculation correctly).
> This is very dangerous because code that was before can change and in fact I don't think that this
> offset is even correct now, and on top of that it depends on support for xen hypercalls as well.

You're absolutely right. The offset is wrong as is, and the overall
approach might break in the future.

Another solution is to explicitly do the vmcall and avoid any jumps.
This seems to be what Hyper-V does:
https://hal.science/hal-03117362/document (Figure 8).

> This can be fixed by calculating the offset in runtime, however I am thinking:
>
>
> Since userspace will have to be aware of the offsets in this page, and since
> pretty much everything else is done in userspace, it might make sense to create
> the hypercall page in the userspace.
>
> In fact, the fact that KVM currently overwrites the guest page, is a violation of
> the HV spec.
>
> It's more correct regardless of VTL to do userspace vm exit and let the userspace put a memslot ("overlay")
> over the address, and put whatever userspace wants there, including the above code.

I agree we should be on the safe side and fully implement overlays. That
said, I suspect they are not actually necessary in practice (with or
without VSM support).

> Then we won't need the new ioctl as well.
>
> To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.

I thought about it at the time. It's not that simple though, we should
still let KVM set the hypercall bytecode, and other quirks like the Xen
one. Additionally, we have no way of knowing where they are going to be
located. We could do something like this, but it's not pretty:

  - Exit to user-space on HV_X64_MSR_HYPERCALL (it overrides the msr
    handling).
  - Setup the overlay.
  - Call HV_X64_MSR_HYPERCALL from user-space so KVM writes its share of
    the hypercall page.
  - Copy the VSM parts from user-space in an area we know to be safe.

We could maybe introduce and extension CAP that provides a safe offset
in the hypercall page for user-space to use?

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
  2023-11-28  7:14   ` Maxim Levitsky
@ 2023-12-01 16:31     ` Nicolas Saenz Julienne
  2023-12-05 15:02       ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-12-01 16:31 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Tue Nov 28, 2023 at 7:14 AM UTC, Maxim Levitsky wrote:
> On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > HVCALL_SEND_IPI and HVCALL_SEND_IPI_EX allow targeting specific a
> > specific VTL. Honour the requests.
> >
> > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > ---
> >  arch/x86/kvm/hyperv.c             | 24 +++++++++++++++++-------
> >  arch/x86/kvm/trace.h              | 20 ++++++++++++--------
> >  include/asm-generic/hyperv-tlfs.h |  6 ++++--
> >  3 files changed, 33 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > index d4b1b53ea63d..2cf430f6ddd8 100644
> > --- a/arch/x86/kvm/hyperv.c
> > +++ b/arch/x86/kvm/hyperv.c
> > @@ -2230,7 +2230,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> >  }
> >
> >  static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> > -                                 u64 *sparse_banks, u64 valid_bank_mask)
> > +                                 u64 *sparse_banks, u64 valid_bank_mask, int vtl)
> >  {
> >       struct kvm_lapic_irq irq = {
> >               .delivery_mode = APIC_DM_FIXED,
> > @@ -2245,6 +2245,9 @@ static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> >                                           valid_bank_mask, sparse_banks))
> >                       continue;
> >
> > +             if (kvm_hv_get_active_vtl(vcpu) != vtl)
> > +                     continue;
>
> Do I understand correctly that this is a temporary limitation?
> In other words, can a vCPU running in VTL1 send an IPI to a vCPU running VTL0,
> forcing the target vCPU to do async switch to VTL1?
> I think that this is possible.


The diff is missing some context. See this simplified implementation
(when all_cpus is set in the parent function):

static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector, int vtl)
{
	[...]
	kvm_for_each_vcpu(i, vcpu, kvm) {
		if (kvm_hv_get_active_vtl(vcpu) != vtl)
			continue;

		kvm_apic_set_irq(vcpu, &irq, NULL);
	}
}

With the one vCPU per VTL approach, kvm_for_each_vcpu() will iterate
over *all* vCPUs regardless of their VTL. The IPI is targetted at a
specific VTL, hence the need to filter.

VTL1 -> VTL0 IPIs are supported and happen (although they are extremely
rare).

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-01 16:19     ` Nicolas Saenz Julienne
@ 2023-12-01 16:32       ` Sean Christopherson
  2023-12-01 16:50         ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-12-01 16:32 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Maxim Levitsky, kvm, linux-kernel, linux-hyperv, pbonzini,
	vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz, decui, x86,
	linux-doc

On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> 
> I thought about it at the time. It's not that simple though, we should
> still let KVM set the hypercall bytecode, and other quirks like the Xen
> one.

Yeah, that Xen quirk is quite the killer.

Can you provide pseudo-assembly for what the final page is supposed to look like?
I'm struggling mightily to understand what this is actually trying to do.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-01 16:32       ` Sean Christopherson
@ 2023-12-01 16:50         ` Nicolas Saenz Julienne
  2023-12-01 17:47           ` Sean Christopherson
  0 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-12-01 16:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, kvm, linux-kernel, linux-hyperv, pbonzini,
	vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz, decui, x86,
	linux-doc

On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> >
> > I thought about it at the time. It's not that simple though, we should
> > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > one.
>
> Yeah, that Xen quirk is quite the killer.
>
> Can you provide pseudo-assembly for what the final page is supposed to look like?
> I'm struggling mightily to understand what this is actually trying to do.

I'll make it as simple as possible (diregard 32bit support and that xen
exists):

vmcall	     <-  Offset 0, regular Hyper-V hypercalls enter here
ret
mov rax,rcx  <-  VTL call hypercall enters here
mov rcx,0x11
vmcall
ret
mov rax,rcx  <-  VTL return hypercall enters here
mov rcx,0x12
vmcall
ret

rcx needs to be saved as it contains a "VTL call control input to the
hypervisor" (TLFS 15.6.1). I don't remember seeing it being used in
practice. Then, KVM expects the hypercall code in rcx, hence the
0x11/0x12 mov.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-01 16:50         ` Nicolas Saenz Julienne
@ 2023-12-01 17:47           ` Sean Christopherson
  2023-12-01 18:15             ` Nicolas Saenz Julienne
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-12-01 17:47 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Maxim Levitsky, kvm, linux-kernel, linux-hyperv, pbonzini,
	vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz, decui, x86,
	linux-doc

On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> > >
> > > I thought about it at the time. It's not that simple though, we should
> > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > one.
> >
> > Yeah, that Xen quirk is quite the killer.
> >
> > Can you provide pseudo-assembly for what the final page is supposed to look like?
> > I'm struggling mightily to understand what this is actually trying to do.
> 
> I'll make it as simple as possible (diregard 32bit support and that xen
> exists):
> 
> vmcall	     <-  Offset 0, regular Hyper-V hypercalls enter here
> ret
> mov rax,rcx  <-  VTL call hypercall enters here

I'm missing who/what defines "here" though.  What generates the CALL that points
at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
we screwed with the whole Xen quirk, which inserts 5 bytes before that first VMCALL?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-01 17:47           ` Sean Christopherson
@ 2023-12-01 18:15             ` Nicolas Saenz Julienne
  2023-12-05 19:21               ` Sean Christopherson
  0 siblings, 1 reply; 108+ messages in thread
From: Nicolas Saenz Julienne @ 2023-12-01 18:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, kvm, linux-kernel, linux-hyperv, pbonzini,
	vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz, decui, x86,
	linux-doc

On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> > > >
> > > > I thought about it at the time. It's not that simple though, we should
> > > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > > one.
> > >
> > > Yeah, that Xen quirk is quite the killer.
> > >
> > > Can you provide pseudo-assembly for what the final page is supposed to look like?
> > > I'm struggling mightily to understand what this is actually trying to do.
> >
> > I'll make it as simple as possible (diregard 32bit support and that xen
> > exists):
> >
> > vmcall             <-  Offset 0, regular Hyper-V hypercalls enter here
> > ret
> > mov rax,rcx  <-  VTL call hypercall enters here
>
> I'm missing who/what defines "here" though.  What generates the CALL that points
> at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
> we screwed with the whole Xen quirk, which inserts 5 bytes before that first VMCALL?

Yes, sorry, I should've included some more context.

Here's a rundown (from memory) of how the first VTL call happens:
 - CPU0 start running at VTL0.
 - Hyper-V enables VTL1 on the partition.
 - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
   the initial VTL1 CPU state alongside the enablement hypercall
   arguments.
 - Hyper-V sets the Hypercall page overlay address through
   HV_X64_MSR_HYPERCALL. KVM fills it.
 - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
   page using the VP Register HvRegisterVsmCodePageOffsets (VP register
   handling is in user-space).
 - Hyper-V performs the first VTL-call, and has all it needs to move
   between VTL0/1.

Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
  2023-12-01 16:31     ` Nicolas Saenz Julienne
@ 2023-12-05 15:02       ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-12-05 15:02 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, kvm
  Cc: linux-kernel, linux-hyperv, pbonzini, seanjc, vkuznets, anelkz,
	graf, dwmw, jgowans, corbert, kys, haiyangz, decui, x86,
	linux-doc

On Fri, 2023-12-01 at 16:31 +0000, Nicolas Saenz Julienne wrote:
> On Tue Nov 28, 2023 at 7:14 AM UTC, Maxim Levitsky wrote:
> > On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote:
> > > HVCALL_SEND_IPI and HVCALL_SEND_IPI_EX allow targeting specific a
> > > specific VTL. Honour the requests.
> > > 
> > > Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
> > > ---
> > >  arch/x86/kvm/hyperv.c             | 24 +++++++++++++++++-------
> > >  arch/x86/kvm/trace.h              | 20 ++++++++++++--------
> > >  include/asm-generic/hyperv-tlfs.h |  6 ++++--
> > >  3 files changed, 33 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > index d4b1b53ea63d..2cf430f6ddd8 100644
> > > --- a/arch/x86/kvm/hyperv.c
> > > +++ b/arch/x86/kvm/hyperv.c
> > > @@ -2230,7 +2230,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
> > >  }
> > > 
> > >  static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> > > -                                 u64 *sparse_banks, u64 valid_bank_mask)
> > > +                                 u64 *sparse_banks, u64 valid_bank_mask, int vtl)
> > >  {
> > >       struct kvm_lapic_irq irq = {
> > >               .delivery_mode = APIC_DM_FIXED,
> > > @@ -2245,6 +2245,9 @@ static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> > >                                           valid_bank_mask, sparse_banks))
> > >                       continue;
> > > 
> > > +             if (kvm_hv_get_active_vtl(vcpu) != vtl)
> > > +                     continue;
> > 
> > Do I understand correctly that this is a temporary limitation?
> > In other words, can a vCPU running in VTL1 send an IPI to a vCPU running VTL0,
> > forcing the target vCPU to do async switch to VTL1?
> > I think that this is possible.
> 
> The diff is missing some context. See this simplified implementation
> (when all_cpus is set in the parent function):
> 
> static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector, int vtl)
> {
> 	[...]
> 	kvm_for_each_vcpu(i, vcpu, kvm) {
> 		if (kvm_hv_get_active_vtl(vcpu) != vtl)
> 			continue;
> 
> 		kvm_apic_set_irq(vcpu, &irq, NULL);
> 	}
> }
> 
> With the one vCPU per VTL approach, kvm_for_each_vcpu() will iterate
> over *all* vCPUs regardless of their VTL. The IPI is targetted at a
> specific VTL, hence the need to filter.
> 
> VTL1 -> VTL0 IPIs are supported and happen (although they are extremely
> rare).

Makes sense now, thanks!

Best regards,
	Maxim Levitsky

> 
> Nicolas
> 



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-01 18:15             ` Nicolas Saenz Julienne
@ 2023-12-05 19:21               ` Sean Christopherson
  2023-12-05 20:04                 ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-12-05 19:21 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Maxim Levitsky, kvm, linux-kernel, linux-hyperv, pbonzini,
	vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz, decui, x86,
	linux-doc

On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > > > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> > > > >
> > > > > I thought about it at the time. It's not that simple though, we should
> > > > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > > > one.
> > > >
> > > > Yeah, that Xen quirk is quite the killer.
> > > >
> > > > Can you provide pseudo-assembly for what the final page is supposed to look like?
> > > > I'm struggling mightily to understand what this is actually trying to do.
> > >
> > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > exists):
> > >
> > > vmcall             <-  Offset 0, regular Hyper-V hypercalls enter here
> > > ret
> > > mov rax,rcx  <-  VTL call hypercall enters here
> >
> > I'm missing who/what defines "here" though.  What generates the CALL that points
> > at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
> > we screwed with the whole Xen quirk, which inserts 5 bytes before that first VMCALL?
> 
> Yes, sorry, I should've included some more context.
> 
> Here's a rundown (from memory) of how the first VTL call happens:
>  - CPU0 start running at VTL0.
>  - Hyper-V enables VTL1 on the partition.
>  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
>    the initial VTL1 CPU state alongside the enablement hypercall
>    arguments.
>  - Hyper-V sets the Hypercall page overlay address through
>    HV_X64_MSR_HYPERCALL. KVM fills it.
>  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
>    page using the VP Register HvRegisterVsmCodePageOffsets (VP register
>    handling is in user-space).

Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets via
a HvSetVpRegisters() hypercall.

I don't see a sane way to handle this in KVM if userspace handles HvSetVpRegisters().
E.g. if the guest requests offsets that don't leave enough room for KVM to shove
in its data, then presumably userspace needs to reject HvSetVpRegisters().  But
that requires userspace to know exactly how many bytes KVM is going to write at
each offsets.

My vote is to have userspace do all the patching.  IIUC, all of this is going to
be mutually exclusive with kvm_xen_hypercall_enabled(), i.e. userspace doesn't
need to worry about setting RAX[31].  At that point, it's just VMCALL versus
VMMCALL, and userspace is more than capable of identifying whether its running
on Intel or AMD.

>  - Hyper-V performs the first VTL-call, and has all it needs to move
>    between VTL0/1.
> 
> Nicolas

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-05 19:21               ` Sean Christopherson
@ 2023-12-05 20:04                 ` Maxim Levitsky
  2023-12-06  0:07                   ` Sean Christopherson
  0 siblings, 1 reply; 108+ messages in thread
From: Maxim Levitsky @ 2023-12-05 20:04 UTC (permalink / raw)
  To: Sean Christopherson, Nicolas Saenz Julienne
  Cc: kvm, linux-kernel, linux-hyperv, pbonzini, vkuznets, anelkz,
	graf, dwmw, jgowans, kys, haiyangz, decui, x86, linux-doc

On Tue, 2023-12-05 at 11:21 -0800, Sean Christopherson wrote:
> On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > > > > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> > > > > > 
> > > > > > I thought about it at the time. It's not that simple though, we should
> > > > > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > > > > one.
> > > > > 
> > > > > Yeah, that Xen quirk is quite the killer.
> > > > > 
> > > > > Can you provide pseudo-assembly for what the final page is supposed to look like?
> > > > > I'm struggling mightily to understand what this is actually trying to do.
> > > > 
> > > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > > exists):
> > > > 
> > > > vmcall             <-  Offset 0, regular Hyper-V hypercalls enter here
> > > > ret
> > > > mov rax,rcx  <-  VTL call hypercall enters here
> > > 
> > > I'm missing who/what defines "here" though.  What generates the CALL that points
> > > at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
> > > we screwed with the whole Xen quirk, which inserts 5 bytes before that first VMCALL?
> > 
> > Yes, sorry, I should've included some more context.
> > 
> > Here's a rundown (from memory) of how the first VTL call happens:
> >  - CPU0 start running at VTL0.
> >  - Hyper-V enables VTL1 on the partition.
> >  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
> >    the initial VTL1 CPU state alongside the enablement hypercall
> >    arguments.
> >  - Hyper-V sets the Hypercall page overlay address through
> >    HV_X64_MSR_HYPERCALL. KVM fills it.
> >  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
> >    page using the VP Register HvRegisterVsmCodePageOffsets (VP register
> >    handling is in user-space).
> 
> Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets via
> a HvSetVpRegisters() hypercall.

No, you didn't understand this correctly. 

The guest writes the HV_X64_MSR_HYPERCALL, and in the response hyperv fills the hypercall page,
including the VSM thunks.

Then the guest can _read_ the offsets, hyperv chose there by issuing another hypercall. 

In the current implementation,
the offsets that the kernel choose are first exposed to the userspace via new ioctl, and then the userspace
exposes these offsets to the guest via that 'another hypercall' 
(reading a pseudo partition register 'HvRegisterVsmCodePageOffsets')

I personally don't know for sure anymore if the userspace or kernel based hypercall page is better
here, it's ugly regardless :(


Best regards,
	Maxim Levitsky

> 
> I don't see a sane way to handle this in KVM if userspace handles HvSetVpRegisters().
> E.g. if the guest requests offsets that don't leave enough room for KVM to shove
> in its data, then presumably userspace needs to reject HvSetVpRegisters().  But
> that requires userspace to know exactly how many bytes KVM is going to write at
> each offsets.
> 
> My vote is to have userspace do all the patching.  IIUC, all of this is going to
> be mutually exclusive with kvm_xen_hypercall_enabled(), i.e. userspace doesn't
> need to worry about setting RAX[31].  At that point, it's just VMCALL versus
> VMMCALL, and userspace is more than capable of identifying whether its running
> on Intel or AMD.
> 
> >  - Hyper-V performs the first VTL-call, and has all it needs to move
> >    between VTL0/1.
> > 
> > Nicolas



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-05 20:04                 ` Maxim Levitsky
@ 2023-12-06  0:07                   ` Sean Christopherson
  2023-12-06 16:19                     ` Maxim Levitsky
  0 siblings, 1 reply; 108+ messages in thread
From: Sean Christopherson @ 2023-12-06  0:07 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz,
	decui, x86, linux-doc

On Tue, Dec 05, 2023, Maxim Levitsky wrote:
> On Tue, 2023-12-05 at 11:21 -0800, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > > > > > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> > > > > > > 
> > > > > > > I thought about it at the time. It's not that simple though, we should
> > > > > > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > > > > > one.
> > > > > > 
> > > > > > Yeah, that Xen quirk is quite the killer.
> > > > > > 
> > > > > > Can you provide pseudo-assembly for what the final page is supposed to look like?
> > > > > > I'm struggling mightily to understand what this is actually trying to do.
> > > > > 
> > > > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > > > exists):
> > > > > 
> > > > > vmcall             <-  Offset 0, regular Hyper-V hypercalls enter here
> > > > > ret
> > > > > mov rax,rcx  <-  VTL call hypercall enters here
> > > > 
> > > > I'm missing who/what defines "here" though.  What generates the CALL that points
> > > > at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
> > > > we screwed with the whole Xen quirk, which inserts 5 bytes before that first VMCALL?
> > > 
> > > Yes, sorry, I should've included some more context.
> > > 
> > > Here's a rundown (from memory) of how the first VTL call happens:
> > >  - CPU0 start running at VTL0.
> > >  - Hyper-V enables VTL1 on the partition.
> > >  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
> > >    the initial VTL1 CPU state alongside the enablement hypercall
> > >    arguments.
> > >  - Hyper-V sets the Hypercall page overlay address through
> > >    HV_X64_MSR_HYPERCALL. KVM fills it.
> > >  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
> > >    page using the VP Register HvRegisterVsmCodePageOffsets (VP register
> > >    handling is in user-space).
> > 
> > Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets via
> > a HvSetVpRegisters() hypercall.
> 
> No, you didn't understand this correctly. 
> 
> The guest writes the HV_X64_MSR_HYPERCALL, and in the response hyperv fills

When people say "Hyper-V", do y'all mean "root partition"?  If so, can we just
say "root partition"?  Part of my confusion is that I don't instinctively know
whether things like "Hyper-V enables VTL1 on the partition" are talking about the
root partition (or I guess parent partition?) or the hypervisor.  Functionally it
probably doesn't matter, it's just hard to reconcile things with the TLFS, which
is written largely to describe the hypervisor's behavior.

> the hypercall page, including the VSM thunks.
>
> Then the guest can _read_ the offsets, hyperv chose there by issuing another hypercall. 

Hrm, now I'm really confused.  Ah, the TLFS contradicts itself.  The blurb for
AccessVpRegisters says:

  The partition can invoke the hypercalls HvSetVpRegisters and HvGetVpRegisters.

And HvSetVpRegisters confirms that requirement:

  The caller must either be the parent of the partition specified by PartitionId,
  or the partition specified must be “self” and the partition must have the
  AccessVpRegisters privilege

But it's absent from HvGetVpRegisters:

  The caller must be the parent of the partition specified by PartitionId or the
  partition specifying its own partition ID.

> In the current implementation, the offsets that the kernel choose are first
> exposed to the userspace via new ioctl, and then the userspace exposes these
> offsets to the guest via that 'another hypercall' (reading a pseudo partition
> register 'HvRegisterVsmCodePageOffsets')
> 
> I personally don't know for sure anymore if the userspace or kernel based
> hypercall page is better here, it's ugly regardless :(

Hrm.  Requiring userspace to intercept the WRMSR will be a mess because then KVM
will have zero knowledge of the hypercall page, e.g. userspace would be forced to
intercept HV_X64_MSR_GUEST_OS_ID as well.  That's not the end of the world, but
it's not exactly ideal either.

What if we exit to userspace with a new kvm_hyperv_exit reason that requires
completion?  I.e. punt to userspace if VSM is enabled, but still record the data
in KVM?  Ugh, but even that's a mess because kvm_hv_set_msr_pw() is deep in the
WRMSR emulation call stack and can't easily signal that an exit to userspace is
needed.  Blech.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
  2023-12-06  0:07                   ` Sean Christopherson
@ 2023-12-06 16:19                     ` Maxim Levitsky
  0 siblings, 0 replies; 108+ messages in thread
From: Maxim Levitsky @ 2023-12-06 16:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nicolas Saenz Julienne, kvm, linux-kernel, linux-hyperv,
	pbonzini, vkuznets, anelkz, graf, dwmw, jgowans, kys, haiyangz,
	decui, x86, linux-doc

On Tue, 2023-12-05 at 16:07 -0800, Sean Christopherson wrote:
> On Tue, Dec 05, 2023, Maxim Levitsky wrote:
> > On Tue, 2023-12-05 at 11:21 -0800, Sean Christopherson wrote:
> > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > > > > To support this I think that we can add a userspace msr filter on the HV_X64_MSR_HYPERCALL,
> > > > > > > > > although I am not 100% sure if a userspace msr filter overrides the in-kernel msr handling.
> > > > > > > > 
> > > > > > > > I thought about it at the time. It's not that simple though, we should
> > > > > > > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > > > > > > one.
> > > > > > > 
> > > > > > > Yeah, that Xen quirk is quite the killer.
> > > > > > > 
> > > > > > > Can you provide pseudo-assembly for what the final page is supposed to look like?
> > > > > > > I'm struggling mightily to understand what this is actually trying to do.
> > > > > > 
> > > > > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > > > > exists):
> > > > > > 
> > > > > > vmcall             <-  Offset 0, regular Hyper-V hypercalls enter here
> > > > > > ret
> > > > > > mov rax,rcx  <-  VTL call hypercall enters here
> > > > > 
> > > > > I'm missing who/what defines "here" though.  What generates the CALL that points
> > > > > at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
> > > > > we screwed with the whole Xen quirk, which inserts 5 bytes before that first VMCALL?
> > > > 
> > > > Yes, sorry, I should've included some more context.
> > > > 
> > > > Here's a rundown (from memory) of how the first VTL call happens:
> > > >  - CPU0 start running at VTL0.
> > > >  - Hyper-V enables VTL1 on the partition.
> > > >  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
> > > >    the initial VTL1 CPU state alongside the enablement hypercall
> > > >    arguments.
> > > >  - Hyper-V sets the Hypercall page overlay address through
> > > >    HV_X64_MSR_HYPERCALL. KVM fills it.
> > > >  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
> > > >    page using the VP Register HvRegisterVsmCodePageOffsets (VP register
> > > >    handling is in user-space).
> > > 
> > > Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets via
> > > a HvSetVpRegisters() hypercall.
> > 
> > No, you didn't understand this correctly. 
> > 
> > The guest writes the HV_X64_MSR_HYPERCALL, and in the response hyperv fills
> 
> When people say "Hyper-V", do y'all mean "root partition"?  
> If so, can we just
> say "root partition"?  Part of my confusion is that I don't instinctively know
> whether things like "Hyper-V enables VTL1 on the partition" are talking about the
> root partition (or I guess parent partition?) or the hypervisor.  Functionally it
> probably doesn't matter, it's just hard to reconcile things with the TLFS, which
> is written largely to describe the hypervisor's behavior.
> 
> > the hypercall page, including the VSM thunks.
> > 
> > Then the guest can _read_ the offsets, hyperv chose there by issuing another hypercall. 
> 
> Hrm, now I'm really confused.  Ah, the TLFS contradicts itself.  The blurb for
> AccessVpRegisters says:
> 
>   The partition can invoke the hypercalls HvSetVpRegisters and HvGetVpRegisters.
> 
> And HvSetVpRegisters confirms that requirement:
> 
>   The caller must either be the parent of the partition specified by PartitionId,
>   or the partition specified must be “self” and the partition must have the
>   AccessVpRegisters privilege
> 
> But it's absent from HvGetVpRegisters:
> 
>   The caller must be the parent of the partition specified by PartitionId or the
>   partition specifying its own partition ID.

Yes, it is indeed very strange, that a partition would do a hypercall to read its own
registers - but then the 'register' is also not really a register but more of a 'hack', and I guess
they allowed it in this particular case. That is why I wrote the 'another hypercall'
thing, because it is very strange that they (ab)used the HvGetVpRegisters for that.


But regardless of the above, guests (root partition or any other partition) do the
VTL calls, and in order to do a VTL call, that guest has to know the hypercall page offsets,
and for that the guest has to do the HvGetVpRegisters hypercall first.

> 
> > In the current implementation, the offsets that the kernel choose are first
> > exposed to the userspace via new ioctl, and then the userspace exposes these
> > offsets to the guest via that 'another hypercall' (reading a pseudo partition
> > register 'HvRegisterVsmCodePageOffsets')
> > 
> > I personally don't know for sure anymore if the userspace or kernel based
> > hypercall page is better here, it's ugly regardless :(
> 
> Hrm.  Requiring userspace to intercept the WRMSR will be a mess because then KVM
> will have zero knowledge of the hypercall page, e.g. userspace would be forced to
> intercept HV_X64_MSR_GUEST_OS_ID as well.
>   That's not the end of the world, but
> it's not exactly ideal either.
> 
> What if we exit to userspace with a new kvm_hyperv_exit reason that requires
> completion? 

BTW the other option is to do the whole thing in kernel - the offset bug in the hypercall page
can be easily solved with a variable, and then the kernel can also intercept the HvGetVpRegisters
hypercall and return these offsets for HvRegisterVsmCodePageOffsets, and for all
other VP registers it can still exit to userspace - that way we also avoid adding a new ioctl,
and have the whole thing in one place.

All of the above can even be done unconditionally (or be conditionally tied to a Kconfig option),
because it doesn't add much overhead and neither should break backward compatibility - I don't think
hyperv guests rely on hypervisor not touching the hypercall page beyond the few bytes that KVM
does write currently.

Best regards,
	Maxim Levitsky


>  I.e. punt to userspace if VSM is enabled, but still record the data
> in KVM?  Ugh, but even that's a mess because kvm_hv_set_msr_pw() is deep in the
> WRMSR emulation call stack and can't easily signal that an exit to userspace is
> needed.  Blech.
> 



^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2023-12-06 16:20 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-08 11:17 [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Nicolas Saenz Julienne
2023-11-08 11:17 ` [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h Nicolas Saenz Julienne
2023-11-08 16:11   ` Sean Christopherson
2023-11-08 11:17 ` [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Nicolas Saenz Julienne
2023-11-08 12:11   ` Alexander Graf
2023-11-08 17:47   ` Sean Christopherson
2023-11-10 18:46     ` Nicolas Saenz Julienne
2023-11-28  6:56   ` Maxim Levitsky
2023-12-01 15:25     ` Nicolas Saenz Julienne
2023-11-08 11:17 ` [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support Nicolas Saenz Julienne
2023-11-08 11:44   ` Alexander Graf
2023-11-08 12:11     ` Vitaly Kuznetsov
2023-11-08 12:16       ` Alexander Graf
2023-11-28  6:57         ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function Nicolas Saenz Julienne
2023-11-28  7:01   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page Nicolas Saenz Julienne
2023-11-08 11:53   ` Alexander Graf
2023-11-08 14:10     ` Nicolas Saenz Julienne
2023-11-28  7:08   ` Maxim Levitsky
2023-11-28 16:33     ` Sean Christopherson
2023-12-01 16:19     ` Nicolas Saenz Julienne
2023-12-01 16:32       ` Sean Christopherson
2023-12-01 16:50         ` Nicolas Saenz Julienne
2023-12-01 17:47           ` Sean Christopherson
2023-12-01 18:15             ` Nicolas Saenz Julienne
2023-12-05 19:21               ` Sean Christopherson
2023-12-05 20:04                 ` Maxim Levitsky
2023-12-06  0:07                   ` Sean Christopherson
2023-12-06 16:19                     ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs Nicolas Saenz Julienne
2023-11-28  7:14   ` Maxim Levitsky
2023-12-01 16:31     ` Nicolas Saenz Julienne
2023-12-05 15:02       ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM Nicolas Saenz Julienne
2023-11-28  7:16   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled Nicolas Saenz Julienne
2023-11-28  7:21   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers Nicolas Saenz Julienne
2023-11-08 12:21   ` Alexander Graf
2023-11-08 14:04     ` Nicolas Saenz Julienne
2023-11-28  7:25   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE Nicolas Saenz Julienne
2023-11-28  7:26   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space Nicolas Saenz Julienne
2023-11-08 12:14   ` Alexander Graf
2023-11-28  7:26     ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls " Nicolas Saenz Julienne
2023-11-28  7:28   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 13/33] KVM: Allow polling vCPUs for events Nicolas Saenz Julienne
2023-11-28  7:30   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 14/33] KVM: x86: Add VTL to the MMU role Nicolas Saenz Julienne
2023-11-08 17:26   ` Sean Christopherson
2023-11-10 18:52     ` Nicolas Saenz Julienne
2023-11-28  7:34       ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults Nicolas Saenz Julienne
2023-11-28  7:34   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits Nicolas Saenz Julienne
2023-11-28  7:36   ` Maxim Levitsky
2023-11-28 16:31     ` Sean Christopherson
2023-11-08 11:17 ` [RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled Nicolas Saenz Julienne
2023-11-28  7:39   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array Nicolas Saenz Julienne
2023-11-08 16:59   ` Sean Christopherson
2023-11-28  7:41   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() " Nicolas Saenz Julienne
2023-11-28  7:42   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() " Nicolas Saenz Julienne
2023-11-28  7:43   ` Maxim Levitsky
2023-11-08 11:17 ` [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument Nicolas Saenz Julienne
2023-11-08 17:08   ` Sean Christopherson
2023-11-08 11:17 ` [RFC 22/33] KVM: Decouple kvm_ioctl_set_mem_attributes() from kvm's mem_attr_array Nicolas Saenz Julienne
2023-11-08 11:17 ` [RFC 23/33] KVM: Expose memory attribute helper functions unanimously Nicolas Saenz Julienne
2023-11-08 11:17 ` [RFC 24/33] KVM: x86: hyper-v: Introduce KVM VTL device Nicolas Saenz Julienne
2023-11-08 11:17 ` [RFC 25/33] KVM: Introduce a set of new memory attributes Nicolas Saenz Julienne
2023-11-08 12:30   ` Alexander Graf
2023-11-08 16:43     ` Sean Christopherson
2023-11-08 11:17 ` [RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL " Nicolas Saenz Julienne
2023-11-28  7:44   ` Maxim Levitsky
2023-11-08 11:18 ` [RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots Nicolas Saenz Julienne
2023-11-28  7:46   ` Maxim Levitsky
2023-11-08 11:18 ` [RFC 28/33] x86/hyper-v: Introduce memory intercept message structure Nicolas Saenz Julienne
2023-11-28  7:53   ` Maxim Levitsky
2023-11-08 11:18 ` [RFC 29/33] KVM: VMX: Save instruction length on EPT violation Nicolas Saenz Julienne
2023-11-08 12:40   ` Alexander Graf
2023-11-08 16:15     ` Sean Christopherson
2023-11-08 17:11       ` Alexander Graf
2023-11-08 17:20   ` Sean Christopherson
2023-11-08 17:27     ` Alexander Graf
2023-11-08 18:19       ` Jim Mattson
2023-11-08 11:18 ` [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request Nicolas Saenz Julienne
2023-11-08 12:45   ` Alexander Graf
2023-11-08 13:38     ` Nicolas Saenz Julienne
2023-11-28  8:19       ` Maxim Levitsky
2023-11-08 11:18 ` [RFC 31/33] KVM: x86: hyper-v: Inject intercept on VTL memory protection fault Nicolas Saenz Julienne
2023-11-08 11:18 ` [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS Nicolas Saenz Julienne
2023-11-08 12:49   ` Alexander Graf
2023-11-08 13:44     ` Nicolas Saenz Julienne
2023-11-08 11:18 ` [RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM" Nicolas Saenz Julienne
2023-11-28  8:19   ` Maxim Levitsky
2023-11-08 11:40 ` [RFC 0/33] KVM: x86: hyperv: Introduce VSM support Alexander Graf
2023-11-08 14:41   ` Nicolas Saenz Julienne
2023-11-08 16:55 ` Sean Christopherson
2023-11-08 18:33   ` Sean Christopherson
2023-11-10 17:56     ` Nicolas Saenz Julienne
2023-11-10 19:32       ` Sean Christopherson
2023-11-11 11:55         ` Nicolas Saenz Julienne
2023-11-10 19:04   ` Nicolas Saenz Julienne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.