KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 0/3] Allow user space to restrict and augment MSR emulation
@ 2020-07-31 21:49 Alexander Graf
  2020-07-31 21:49 ` [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space Alexander Graf
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Alexander Graf @ 2020-07-31 21:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, KarimAllah Raslan,
	Aaron Lewis, kvm, linux-doc, linux-kernel

While tying to add support for the MSR_CORE_THREAD_COUNT MSR in KVM,
I realized that we were still in a world where user space has no control
over what happens with MSR emulation in KVM.

That is bad for multiple reasons. In my case, I wanted to emulate the
MSR in user space, because it's a CPU specific register that does not
exist on older CPUs and that really only contains informational data that
is on the package level, so it's a natural fit for user space to provide
it.

However, it is also bad on a platform compatibility level. Currrently,
KVM has no way to expose different MSRs based on the selected target CPU
type.

This patch set introduces a way for user space to indicate to KVM which
MSRs should be handled in kernel space. With that, we can solve part of
the platform compatibility story. Or at least we can not handle AMD specific
MSRs on an Intel platform and vice versa.

In addition, it introduces a way for user space to get into the loop
when an MSR access would generate a #GP fault, such as when KVM finds an
MSR that is not handled by the in-kernel MSR emulation or when the guest
is trying to access reserved registers.

In combination with the allow list, the user space trapping allows us
to emulate arbitrary MSRs in user space, paving the way for target CPU
specific MSR implementations from user space.

v1 -> v2:

  - s/ETRAP_TO_USER_SPACE/ENOENT/g
  - deflect all #GP injection events to user space, not just unknown MSRs.
    That was we can also deflect allowlist errors later
  - fix emulator case
  - new patch: KVM: x86: Introduce allow list for MSR emulation
  - new patch: KVM: selftests: Add test for user space MSR handling

v2 -> v3:

  - return r if r == X86EMUL_IO_NEEDED
  - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
  - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
  - Use complete_userspace_io logic instead of reply field
  - Simplify trapping code
  - document flags for KVM_X86_ADD_MSR_ALLOWLIST
  - generalize exit path, always unlock when returning
  - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
  - Add KVM_X86_CLEAR_MSR_ALLOWLIST
  - Add test to clear whitelist
  - Adjust to reply-less API
  - Fix asserts
  - Actually trap on MSR_IA32_POWER_CTL writes

Alexander Graf (3):
  KVM: x86: Deflect unknown MSR accesses to user space
  KVM: x86: Introduce allow list for MSR emulation
  KVM: selftests: Add test for user space MSR handling

 Documentation/virt/kvm/api.rst                | 153 +++++++++++
 arch/x86/include/asm/kvm_host.h               |  16 ++
 arch/x86/include/uapi/asm/kvm.h               |  15 ++
 arch/x86/kvm/emulate.c                        |  18 +-
 arch/x86/kvm/x86.c                            | 241 +++++++++++++++++-
 include/trace/events/kvm.h                    |   2 +-
 include/uapi/linux/kvm.h                      |  15 ++
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/x86_64/user_msr_test.c      | 221 ++++++++++++++++
 9 files changed, 675 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/user_msr_test.c

-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space
  2020-07-31 21:49 [PATCH v3 0/3] Allow user space to restrict and augment MSR emulation Alexander Graf
@ 2020-07-31 21:49 ` Alexander Graf
  2020-07-31 23:36   ` Jim Mattson
  2020-08-03 11:27   ` Vitaly Kuznetsov
  2020-07-31 21:49 ` [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation Alexander Graf
  2020-07-31 21:49 ` [PATCH v3 3/3] KVM: selftests: Add test for user space MSR handling Alexander Graf
  2 siblings, 2 replies; 11+ messages in thread
From: Alexander Graf @ 2020-07-31 21:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, KarimAllah Raslan,
	Aaron Lewis, kvm, linux-doc, linux-kernel

MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.

Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.

This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.

Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.

Signed-off-by: Alexander Graf <graf@amazon.com>

---

v1 -> v2:

  - s/ETRAP_TO_USER_SPACE/ENOENT/g
  - deflect all #GP injection events to user space, not just unknown MSRs.
    That was we can also deflect allowlist errors later
  - fix emulator case

v2 -> v3:

  - return r if r == X86EMUL_IO_NEEDED
  - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
  - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
  - Use complete_userspace_io logic instead of reply field
  - Simplify trapping code
---
 Documentation/virt/kvm/api.rst  |  62 +++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |   6 ++
 arch/x86/kvm/emulate.c          |  18 +++++-
 arch/x86/kvm/x86.c              | 106 ++++++++++++++++++++++++++++++--
 include/trace/events/kvm.h      |   2 +-
 include/uapi/linux/kvm.h        |  10 +++
 6 files changed, 197 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 320788f81a05..79c3e2fdfae4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5155,6 +5155,35 @@ Note that KVM does not skip the faulting instruction as it does for
 KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
 if it decides to decode and emulate the instruction.
 
+::
+
+		/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
+		struct {
+			__u8 error;
+			__u8 pad[3];
+			__u32 index;
+			__u64 data;
+		} msr;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
+enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
+will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
+exit for writes.
+
+For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
+wants to read. To respond to this request with a successful read, user space
+writes the respective data into the "data" field and must continue guest
+execution to ensure the read data is transferred into guest register state.
+
+If the RDMSR request was unsuccessful, user space indicates that with a "1" in
+the "error" field. This will inject a #GP into the guest when the VCPU is
+executed again.
+
+For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
+wants to write. Once finished processing the event, user space must continue
+vCPU execution. If the MSR write was unsuccessful, user space also sets the
+"error" field to "1".
+
 ::
 
 		/* Fix the size of the union. */
@@ -5844,6 +5873,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
 the maximum halt time to specified on a per-VM basis, effectively overriding
 the module parameter for the target VM.
 
+7.21 KVM_CAP_X86_USER_SPACE_MSR
+-------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise
+:Returns: 0 on success; -1 on error
+
+This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
+into user space.
+
+When a guest requests to read or write an MSR, KVM may not implement all MSRs
+that are relevant to a respective system. It also does not differentiate by
+CPU type.
+
+To allow more fine grained control over MSR handling, user space may enable
+this capability. With it enabled, MSR accesses that would usually trigger
+a #GP event inside the guest by KVM will instead trigger KVM_EXIT_X86_RDMSR
+and KVM_EXIT_X86_WRMSR exit notifications which user space can then handle to
+implement model specific MSR handling and/or user notifications to inform
+a user that an MSR was not handled.
+
 8. Other capabilities.
 ======================
 
@@ -6151,3 +6202,14 @@ KVM can therefore start protected VMs.
 This capability governs the KVM_S390_PV_COMMAND ioctl and the
 KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
 guests when the state change is invalid.
+
+8.24 KVM_CAP_X86_USER_SPACE_MSR
+----------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports deflection of MSR reads and
+writes to user space. It can be enabled on a VM level. If enabled, MSR
+accesses that would usually trigger a #GP by KVM into the guest will
+instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
+KVM_EXIT_X86_WRMSR exit notifications.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..809eed0dbdea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -829,6 +829,9 @@ struct kvm_vcpu_arch {
 
 	/* AMD MSRC001_0015 Hardware Configuration */
 	u64 msr_hwcr;
+
+	/* User space is handling an MSR request */
+	bool pending_user_msr;
 };
 
 struct kvm_lpage_info {
@@ -1002,6 +1005,9 @@ struct kvm_arch {
 	bool guest_can_read_msr_platform_info;
 	bool exception_payload_enabled;
 
+	/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
+	bool user_space_msr_enabled;
+
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
 };
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d0e2825ae617..744ab9c92b73 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3689,11 +3689,18 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)
 
 static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
 {
+	u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
 	u64 msr_data;
+	int r;
 
 	msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX)
 		| ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
-	if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data))
+	r = ctxt->ops->set_msr(ctxt, msr_index, msr_data);
+
+	if (r == X86EMUL_IO_NEEDED)
+		return r;
+
+	if (r)
 		return emulate_gp(ctxt, 0);
 
 	return X86EMUL_CONTINUE;
@@ -3701,9 +3708,16 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
 
 static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
 {
+	u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
 	u64 msr_data;
+	int r;
+
+	r = ctxt->ops->get_msr(ctxt, msr_index, &msr_data);
+
+	if (r == X86EMUL_IO_NEEDED)
+		return r;
 
-	if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data))
+	if (r)
 		return emulate_gp(ctxt, 0);
 
 	*reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..24c72250f6df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1549,12 +1549,75 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_set_msr);
 
+static int complete_emulated_msr(struct kvm_vcpu *vcpu, bool is_read)
+{
+	BUG_ON(!vcpu->arch.pending_user_msr);
+
+	if (vcpu->run->msr.error) {
+		kvm_inject_gp(vcpu, 0);
+	} else if (is_read) {
+		kvm_rax_write(vcpu, (u32)vcpu->run->msr.data);
+		kvm_rdx_write(vcpu, vcpu->run->msr.data >> 32);
+	}
+
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
+static int complete_emulated_rdmsr(struct kvm_vcpu *vcpu)
+{
+	return complete_emulated_msr(vcpu, true);
+}
+
+static int complete_emulated_wrmsr(struct kvm_vcpu *vcpu)
+{
+	return complete_emulated_msr(vcpu, false);
+}
+
+static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)
+{
+	if (!vcpu->kvm->arch.user_space_msr_enabled)
+		return 0;
+
+	vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
+	vcpu->run->msr.error = 0;
+	vcpu->run->msr.index = index;
+	vcpu->arch.pending_user_msr = true;
+	vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;
+
+	return 1;
+}
+
+static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+	if (!vcpu->kvm->arch.user_space_msr_enabled)
+		return 0;
+
+	vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
+	vcpu->run->msr.error = 0;
+	vcpu->run->msr.index = index;
+	vcpu->run->msr.data = data;
+	vcpu->arch.pending_user_msr = true;
+	vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;
+
+	return 1;
+}
+
 int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu)
 {
 	u32 ecx = kvm_rcx_read(vcpu);
 	u64 data;
+	int r;
+
+	r = kvm_get_msr(vcpu, ecx, &data);
 
-	if (kvm_get_msr(vcpu, ecx, &data)) {
+	/* MSR read failed? See if we should ask user space */
+	if (r && kvm_get_msr_user_space(vcpu, ecx)) {
+		/* Bounce to user space */
+		return 0;
+	}
+
+	/* MSR read failed? Inject a #GP */
+	if (r) {
 		trace_kvm_msr_read_ex(ecx);
 		kvm_inject_gp(vcpu, 0);
 		return 1;
@@ -1572,8 +1635,18 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu)
 {
 	u32 ecx = kvm_rcx_read(vcpu);
 	u64 data = kvm_read_edx_eax(vcpu);
+	int r;
+
+	r = kvm_set_msr(vcpu, ecx, data);
 
-	if (kvm_set_msr(vcpu, ecx, data)) {
+	/* MSR write failed? See if we should ask user space */
+	if (r && kvm_set_msr_user_space(vcpu, ecx, data)) {
+		/* Bounce to user space */
+		return 0;
+	}
+
+	/* MSR write failed? Inject a #GP */
+	if (r) {
 		trace_kvm_msr_write_ex(ecx, data);
 		kvm_inject_gp(vcpu, 0);
 		return 1;
@@ -3476,6 +3549,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
 	case KVM_CAP_SET_GUEST_DEBUG:
+	case KVM_CAP_X86_USER_SPACE_MSR:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -4990,6 +5064,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exception_payload_enabled = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_X86_USER_SPACE_MSR:
+		kvm->arch.user_space_msr_enabled = cap->args[0];
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -6319,13 +6397,33 @@ static void emulator_set_segment(struct x86_emulate_ctxt *ctxt, u16 selector,
 static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
 			    u32 msr_index, u64 *pdata)
 {
-	return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
+	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+	int r;
+
+	r = kvm_get_msr(vcpu, msr_index, pdata);
+
+	if (r && kvm_get_msr_user_space(vcpu, msr_index)) {
+		/* Bounce to user space */
+		return X86EMUL_IO_NEEDED;
+	}
+
+	return r;
 }
 
 static int emulator_set_msr(struct x86_emulate_ctxt *ctxt,
 			    u32 msr_index, u64 data)
 {
-	return kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
+	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+	int r;
+
+	r = kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
+
+	if (r && kvm_set_msr_user_space(vcpu, msr_index, data)) {
+		/* Bounce to user space */
+		return X86EMUL_IO_NEEDED;
+	}
+
+	return r;
 }
 
 static u64 emulator_get_smbase(struct x86_emulate_ctxt *ctxt)
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 9417a34aad08..26cfb0fa8e7e 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -17,7 +17,7 @@
 	ERSN(NMI), ERSN(INTERNAL_ERROR), ERSN(OSI), ERSN(PAPR_HCALL),	\
 	ERSN(S390_UCONTROL), ERSN(WATCHDOG), ERSN(S390_TSCH), ERSN(EPR),\
 	ERSN(SYSTEM_EVENT), ERSN(S390_STSI), ERSN(IOAPIC_EOI),          \
-	ERSN(HYPERV), ERSN(ARM_NISV)
+	ERSN(HYPERV), ERSN(ARM_NISV), ERSN(X86_RDMSR), ERSN(X86_WRMSR)
 
 TRACE_EVENT(kvm_userspace_exit,
 	    TP_PROTO(__u32 reason, int errno),
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..13fc7de1eb50 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -248,6 +248,8 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_X86_RDMSR        29
+#define KVM_EXIT_X86_WRMSR        30
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -412,6 +414,13 @@ struct kvm_run {
 			__u64 esr_iss;
 			__u64 fault_ipa;
 		} arm_nisv;
+		/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
+		struct {
+			__u8 error;
+			__u8 pad[3];
+			__u32 index;
+			__u64 data;
+		} msr;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1031,6 +1040,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_X86_USER_SPACE_MSR 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation
  2020-07-31 21:49 [PATCH v3 0/3] Allow user space to restrict and augment MSR emulation Alexander Graf
  2020-07-31 21:49 ` [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space Alexander Graf
@ 2020-07-31 21:49 ` Alexander Graf
  2020-08-03 11:37   ` Vitaly Kuznetsov
  2020-07-31 21:49 ` [PATCH v3 3/3] KVM: selftests: Add test for user space MSR handling Alexander Graf
  2 siblings, 1 reply; 11+ messages in thread
From: Alexander Graf @ 2020-07-31 21:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, KarimAllah Raslan,
	Aaron Lewis, kvm, linux-doc, linux-kernel

It's not desireable to have all MSRs always handled by KVM kernel space. Some
MSRs would be useful to handle in user space to either emulate behavior (like
uCode updates) or differentiate whether they are valid based on the CPU model.

To allow user space to specify which MSRs it wants to see handled by KVM,
this patch introduces a new ioctl to push allow lists of bitmaps into
KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
denied MSR events to user space to operate on.

If no allowlist is populated, MSR handling stays identical to before.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Alexander Graf <graf@amazon.com>

---

v2 -> v3:

  - document flags for KVM_X86_ADD_MSR_ALLOWLIST
  - generalize exit path, always unlock when returning
  - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
  - Add KVM_X86_CLEAR_MSR_ALLOWLIST
---
 Documentation/virt/kvm/api.rst  |  91 +++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |  10 +++
 arch/x86/include/uapi/asm/kvm.h |  15 ++++
 arch/x86/kvm/x86.c              | 135 ++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h        |   5 ++
 5 files changed, 256 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 79c3e2fdfae4..d611ddd326fc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4697,6 +4697,82 @@ KVM_PV_VM_VERIFY
   Verify the integrity of the unpacked image. Only if this succeeds,
   KVM is allowed to start protected VCPUs.
 
+4.126 KVM_X86_ADD_MSR_ALLOWLIST
+-------------------------------
+
+:Capability: KVM_CAP_X86_MSR_ALLOWLIST
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_msr_allowlist
+:Returns: 0 on success, < 0 on error
+
+::
+
+  struct kvm_msr_allowlist {
+         __u32 flags;
+         __u32 nmsrs; /* number of msrs in bitmap */
+         __u32 base;  /* base address for the MSRs bitmap */
+         __u32 pad;
+
+         __u8 bitmap[0]; /* a set bit allows that the operation set in flags */
+  };
+
+flags values:
+
+KVM_MSR_ALLOW_READ
+
+  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
+  indicates that a read should immediately fail, while a 1 indicates that
+  a read should be handled by the normal KVM MSR emulation logic.
+
+KVM_MSR_ALLOW_WRITE
+
+  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
+  indicates that a write should immediately fail, while a 1 indicates that
+  a write should be handled by the normal KVM MSR emulation logic.
+
+KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE
+
+  Filter booth read and write accesses to MSRs using the given bitmap. A 0
+  in the bitmap indicates that both reads and writes should immediately fail,
+  while a 1 indicates that reads and writes should be handled by the normal
+  KVM MSR emulation logic.
+
+This ioctl allows user space to define a set of bitmaps of MSR ranges to
+specify whether a certain MSR access is allowed or not.
+
+If this ioctl has never been invoked, MSR accesses are not guarded and the
+old KVM in-kernel emulation behavior is fully preserved.
+
+As soon as the first allow list was specified, only allowed MSR accesses
+are permitted inside of KVM's MSR code.
+
+Each allowlist specifies a range of MSRs to potentially allow access on.
+The range goes from MSR index [base .. base+nmsrs]. The flags field
+indicates whether reads, writes or both reads and writes are permitted
+by setting a 1 bit in the bitmap for the corresponding MSR index.
+
+If an MSR access is not permitted through the allow list, it generates a
+#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
+allows user space to deflect and potentially handle various MSR accesses
+into user space.
+
+4.124 KVM_X86_CLEAR_MSR_ALLOWLIST
+---------------------------------
+
+:Capability: KVM_CAP_X86_MSR_ALLOWLIST
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: none
+:Returns: 0
+
+This ioctl resets all internal MSR allow lists. After this call, no allow
+list is present and the guest would execute as if no allow lists were set,
+so all MSRs are considered allowed and thus handled by the in-kernel MSR
+emulation logic.
+
+No vCPU may be in running state when calling this ioctl.
+
 
 5. The kvm_run structure
 ========================
@@ -6213,3 +6289,18 @@ writes to user space. It can be enabled on a VM level. If enabled, MSR
 accesses that would usually trigger a #GP by KVM into the guest will
 instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
 KVM_EXIT_X86_WRMSR exit notifications.
+
+8.25 KVM_CAP_X86_MSR_ALLOWLIST
+------------------------------
+
+:Architectures: x86
+
+This capability indicates that KVM supports emulation of only select MSR
+registers. With this capability exposed, KVM exports two new VM ioctls:
+KVM_X86_ADD_MSR_ALLOWLIST which user space can call to specify bitmaps of MSR
+ranges that KVM should emulate in kernel space and KVM_X86_CLEAR_MSR_ALLOWLIST
+which user space can call to remove all MSR allow lists from the VM context.
+
+In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
+trap and emulate MSRs that are outside of the scope of KVM as well as
+limit the attack surface on KVM's MSR emulation code.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 809eed0dbdea..21358ed4e590 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -904,6 +904,13 @@ struct kvm_hv {
 	struct kvm_hv_syndbg hv_syndbg;
 };
 
+struct msr_bitmap_range {
+	u32 flags;
+	u32 nmsrs;
+	u32 base;
+	unsigned long *bitmap;
+};
+
 enum kvm_irqchip_mode {
 	KVM_IRQCHIP_NONE,
 	KVM_IRQCHIP_KERNEL,       /* created with KVM_CREATE_IRQCHIP */
@@ -1008,6 +1015,9 @@ struct kvm_arch {
 	/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
 	bool user_space_msr_enabled;
 
+	struct msr_bitmap_range msr_allowlist_ranges[10];
+	int msr_allowlist_ranges_count;
+
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
 };
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0780f97c1850..c33fb1d72d52 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -192,6 +192,21 @@ struct kvm_msr_list {
 	__u32 indices[0];
 };
 
+#define KVM_MSR_ALLOW_READ  (1 << 0)
+#define KVM_MSR_ALLOW_WRITE (1 << 1)
+
+/* Maximum size of the of the bitmap in bytes */
+#define KVM_MSR_ALLOWLIST_MAX_LEN 0x600
+
+/* for KVM_X86_ADD_MSR_ALLOWLIST */
+struct kvm_msr_allowlist {
+	__u32 flags;
+	__u32 nmsrs; /* number of msrs in bitmap */
+	__u32 base;  /* base address for the MSRs bitmap */
+	__u32 pad;
+
+	__u8 bitmap[0]; /* a set bit allows that the operation set in flags */
+};
 
 struct kvm_cpuid_entry {
 	__u32 function;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 24c72250f6df..7a2be00a3512 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1472,6 +1472,29 @@ void kvm_enable_efer_bits(u64 mask)
 }
 EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
 
+static bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
+{
+	struct msr_bitmap_range *ranges = vcpu->kvm->arch.msr_allowlist_ranges;
+	u32 count = vcpu->kvm->arch.msr_allowlist_ranges_count;
+	u32 i;
+
+	/* MSR allowlist not set up, allow everything */
+	if (!count)
+		return true;
+
+	for (i = 0; i < count; i++) {
+		u32 start = ranges[i].base;
+		u32 end = start + ranges[i].nmsrs;
+		int flags = ranges[i].flags;
+		unsigned long *bitmap = ranges[i].bitmap;
+
+		if ((index >= start) && (index < end) && (flags & type))
+			return !!test_bit(index - start, bitmap);
+	}
+
+	return false;
+}
+
 /*
  * Write @data into the MSR specified by @index.  Select MSR specific fault
  * checks are bypassed if @host_initiated is %true.
@@ -1483,6 +1506,9 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 {
 	struct msr_data msr;
 
+	if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_ALLOW_WRITE))
+		return -ENOENT;
+
 	switch (index) {
 	case MSR_FS_BASE:
 	case MSR_GS_BASE:
@@ -1528,6 +1554,9 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
 	struct msr_data msr;
 	int ret;
 
+	if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_ALLOW_READ))
+		return -ENOENT;
+
 	msr.index = index;
 	msr.host_initiated = host_initiated;
 
@@ -3550,6 +3579,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_EXCEPTION_PAYLOAD:
 	case KVM_CAP_SET_GUEST_DEBUG:
 	case KVM_CAP_X86_USER_SPACE_MSR:
+	case KVM_CAP_X86_MSR_ALLOWLIST:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -5075,6 +5105,101 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 	return r;
 }
 
+static bool msr_range_overlaps(struct kvm *kvm, struct msr_bitmap_range *range)
+{
+	struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
+	u32 i, count = kvm->arch.msr_allowlist_ranges_count;
+
+	for (i = 0; i < count; i++) {
+		u32 start = max(range->base, ranges[i].base);
+		u32 end = min(range->base + range->nmsrs,
+			      ranges[i].base + ranges[i].nmsrs);
+
+		if ((start < end) && (range->flags & ranges[i].flags))
+			return true;
+	}
+
+	return false;
+}
+
+static int kvm_vm_ioctl_add_msr_allowlist(struct kvm *kvm, void __user *argp)
+{
+	struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
+	struct kvm_msr_allowlist __user *user_msr_allowlist = argp;
+	struct msr_bitmap_range range;
+	struct kvm_msr_allowlist kernel_msr_allowlist;
+	unsigned long *bitmap = NULL;
+	size_t bitmap_size;
+	int r = 0;
+
+	if (copy_from_user(&kernel_msr_allowlist, user_msr_allowlist,
+			   sizeof(kernel_msr_allowlist))) {
+		r = -EFAULT;
+		goto out;
+	}
+
+	bitmap_size = BITS_TO_LONGS(kernel_msr_allowlist.nmsrs) * sizeof(long);
+	if (bitmap_size > KVM_MSR_ALLOWLIST_MAX_LEN) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	bitmap = memdup_user(user_msr_allowlist->bitmap, bitmap_size);
+	if (IS_ERR(bitmap)) {
+		r = PTR_ERR(bitmap);
+		goto out;
+	}
+
+	range = (struct msr_bitmap_range) {
+		.flags = kernel_msr_allowlist.flags,
+		.base = kernel_msr_allowlist.base,
+		.nmsrs = kernel_msr_allowlist.nmsrs,
+		.bitmap = bitmap,
+	};
+
+	if (range.flags & ~(KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE)) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Protect from concurrent calls to this function that could trigger
+	 * a TOCTOU violation on kvm->arch.msr_allowlist_ranges_count.
+	 */
+	mutex_lock(&kvm->lock);
+
+	if (kvm->arch.msr_allowlist_ranges_count >=
+	    ARRAY_SIZE(kvm->arch.msr_allowlist_ranges)) {
+		r = -E2BIG;
+		goto out_locked;
+	}
+
+	if (msr_range_overlaps(kvm, &range)) {
+		r = -EINVAL;
+		goto out_locked;
+	}
+
+	/* Everything ok, add this range identifier to our global pool */
+	ranges[kvm->arch.msr_allowlist_ranges_count++] = range;
+
+out_locked:
+	mutex_unlock(&kvm->lock);
+out:
+	if (r)
+		kfree(bitmap);
+
+	return r;
+}
+
+static int kvm_vm_ioctl_clear_msr_allowlist(struct kvm *kvm)
+{
+	mutex_lock(&kvm->lock);
+	kvm->arch.msr_allowlist_ranges_count = 0;
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -5381,6 +5506,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	case KVM_SET_PMU_EVENT_FILTER:
 		r = kvm_vm_ioctl_set_pmu_event_filter(kvm, argp);
 		break;
+	case KVM_X86_ADD_MSR_ALLOWLIST:
+		r = kvm_vm_ioctl_add_msr_allowlist(kvm, argp);
+		break;
+	case KVM_X86_CLEAR_MSR_ALLOWLIST:
+		r = kvm_vm_ioctl_clear_msr_allowlist(kvm);
+		break;
 	default:
 		r = -ENOTTY;
 	}
@@ -10086,6 +10217,8 @@ void kvm_arch_pre_destroy_vm(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	int i;
+
 	if (current->mm == kvm->mm) {
 		/*
 		 * Free memory regions allocated on behalf of userspace,
@@ -10102,6 +10235,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	}
 	if (kvm_x86_ops.vm_destroy)
 		kvm_x86_ops.vm_destroy(kvm);
+	for (i = 0; i < kvm->arch.msr_allowlist_ranges_count; i++)
+		kfree(kvm->arch.msr_allowlist_ranges[i].bitmap);
 	kvm_pic_destroy(kvm);
 	kvm_ioapic_destroy(kvm);
 	kvm_free_vcpus(kvm);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13fc7de1eb50..4d6bb06e0fb1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1041,6 +1041,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
 #define KVM_CAP_X86_USER_SPACE_MSR 184
+#define KVM_CAP_X86_MSR_ALLOWLIST 185
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1542,6 +1543,10 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_S390_PROTECTED */
 #define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
 
+/* Available with KVM_CAP_X86_MSR_ALLOWLIST */
+#define KVM_X86_ADD_MSR_ALLOWLIST	_IOW(KVMIO,  0xc6, struct kvm_msr_allowlist)
+#define KVM_X86_CLEAR_MSR_ALLOWLIST	_IO(KVMIO,  0xc7)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 3/3] KVM: selftests: Add test for user space MSR handling
  2020-07-31 21:49 [PATCH v3 0/3] Allow user space to restrict and augment MSR emulation Alexander Graf
  2020-07-31 21:49 ` [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space Alexander Graf
  2020-07-31 21:49 ` [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation Alexander Graf
@ 2020-07-31 21:49 ` Alexander Graf
  2 siblings, 0 replies; 11+ messages in thread
From: Alexander Graf @ 2020-07-31 21:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, KarimAllah Raslan,
	Aaron Lewis, kvm, linux-doc, linux-kernel

Now that we have the ability to handle MSRs from user space and also to
select which ones we do want to prevent in-kernel KVM code from handling,
let's add a selftest to show case and verify the API.

Signed-off-by: Alexander Graf <graf@amazon.com>

---

v2 -> v3:

  - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
  - Add test to clear whitelist
  - Adjust to reply-less API
  - Fix asserts
  - Actually trap on MSR_IA32_POWER_CTL writes
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/x86_64/user_msr_test.c      | 221 ++++++++++++++++++
 2 files changed, 222 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/user_msr_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 4a166588d99f..80d5c348354c 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -55,6 +55,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
 TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
+TEST_GEN_PROGS_x86_64 += x86_64/user_msr_test
 TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
diff --git a/tools/testing/selftests/kvm/x86_64/user_msr_test.c b/tools/testing/selftests/kvm/x86_64/user_msr_test.c
new file mode 100644
index 000000000000..7b149424690d
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/user_msr_test.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * tests for KVM_CAP_X86_USER_SPACE_MSR and KVM_X86_ADD_MSR_ALLOWLIST
+ *
+ * Copyright (C) 2020, Amazon Inc.
+ *
+ * This is a functional test to verify that we can deflect MSR events
+ * into user space.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include "test_util.h"
+
+#include "kvm_util.h"
+#include "processor.h"
+
+#define VCPU_ID                  5
+
+u32 msr_reads, msr_writes;
+
+struct range_desc {
+	struct kvm_msr_allowlist allow;
+	void (*populate)(struct kvm_msr_allowlist *range);
+};
+
+static void populate_c0000000_read(struct kvm_msr_allowlist *range)
+{
+	u8 *bitmap = range->bitmap;
+	u32 idx = MSR_SYSCALL_MASK & (KVM_MSR_ALLOWLIST_MAX_LEN - 1);
+
+	bitmap[idx / 8] &= ~(1 << (idx % 8));
+}
+
+static void populate_00000000_write(struct kvm_msr_allowlist *range)
+{
+	u8 *bitmap = range->bitmap;
+	u32 idx = MSR_IA32_POWER_CTL & (KVM_MSR_ALLOWLIST_MAX_LEN - 1);
+
+	bitmap[idx / 8] &= ~(1 << (idx % 8));
+}
+
+struct range_desc ranges[] = {
+	{
+		.allow = {
+			.flags = KVM_MSR_ALLOW_READ,
+			.base = 0x00000000,
+			.nmsrs = KVM_MSR_ALLOWLIST_MAX_LEN * BITS_PER_BYTE,
+		},
+	}, {
+		.allow = {
+			.flags = KVM_MSR_ALLOW_WRITE,
+			.base = 0x00000000,
+			.nmsrs = KVM_MSR_ALLOWLIST_MAX_LEN * BITS_PER_BYTE,
+		},
+		.populate = populate_00000000_write,
+	}, {
+		.allow = {
+			.flags = KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE,
+			.base = 0x40000000,
+			.nmsrs = KVM_MSR_ALLOWLIST_MAX_LEN * BITS_PER_BYTE,
+		},
+	}, {
+		.allow = {
+			.flags = KVM_MSR_ALLOW_READ,
+			.base = 0xc0000000,
+			.nmsrs = KVM_MSR_ALLOWLIST_MAX_LEN * BITS_PER_BYTE,
+		},
+		.populate = populate_c0000000_read,
+	}, {
+		.allow = {
+			.flags = KVM_MSR_ALLOW_WRITE,
+			.base = 0xc0000000,
+			.nmsrs = KVM_MSR_ALLOWLIST_MAX_LEN * BITS_PER_BYTE,
+		},
+	},
+};
+
+static void guest_msr_calls(bool trapped)
+{
+	/* This goes into the in-kernel emulation */
+	wrmsr(MSR_SYSCALL_MASK, 0);
+
+	if (trapped) {
+		/* This goes into user space emulation */
+		GUEST_ASSERT(rdmsr(MSR_SYSCALL_MASK) == MSR_SYSCALL_MASK);
+	} else {
+		GUEST_ASSERT(rdmsr(MSR_SYSCALL_MASK) != MSR_SYSCALL_MASK);
+	}
+
+	/* If trapped == true, this goes into user space emulation */
+	wrmsr(MSR_IA32_POWER_CTL, 0x1234);
+
+	/* This goes into the in-kernel emulation */
+	rdmsr(MSR_IA32_POWER_CTL);
+}
+
+static void guest_code(void)
+{
+	guest_msr_calls(true);
+
+	/*
+	 * Disable allow listing, so that the kernel
+	 * handles everything in the next round
+	 */
+	GUEST_SYNC(0);
+
+	guest_msr_calls(false);
+
+	GUEST_DONE();
+}
+
+static int handle_ucall(struct kvm_vm *vm)
+{
+	struct ucall uc;
+
+	switch (get_ucall(vm, VCPU_ID, &uc)) {
+	case UCALL_ABORT:
+		TEST_FAIL("Guest assertion not met");
+		break;
+	case UCALL_SYNC:
+		vm_ioctl(vm, KVM_X86_CLEAR_MSR_ALLOWLIST, NULL);
+		break;
+	case UCALL_DONE:
+		return 1;
+	default:
+		TEST_FAIL("Unknown ucall %lu", uc.cmd);
+	}
+
+	return 0;
+}
+
+static void handle_rdmsr(struct kvm_run *run)
+{
+	run->msr.data = run->msr.index;
+	msr_reads++;
+}
+
+static void handle_wrmsr(struct kvm_run *run)
+{
+	/* ignore */
+	msr_writes++;
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_enable_cap cap = {
+		.cap = KVM_CAP_X86_USER_SPACE_MSR,
+		.args[0] = 1,
+	};
+	struct kvm_vm *vm;
+	struct kvm_run *run;
+	int rc;
+	int i;
+
+	/* Tell stdout not to buffer its content */
+	setbuf(stdout, NULL);
+
+	/* Create VM */
+	vm = vm_create_default(VCPU_ID, 0, guest_code);
+	vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+	run = vcpu_state(vm, VCPU_ID);
+
+	rc = kvm_check_cap(KVM_CAP_X86_USER_SPACE_MSR);
+	TEST_ASSERT(rc, "KVM_CAP_X86_USER_SPACE_MSR is available");
+	vm_enable_cap(vm, &cap);
+
+	rc = kvm_check_cap(KVM_CAP_X86_MSR_ALLOWLIST);
+	TEST_ASSERT(rc, "KVM_CAP_X86_MSR_ALLOWLIST is available");
+
+	/* Set up MSR allowlist */
+	for (i = 0; i < ARRAY_SIZE(ranges); i++) {
+		struct kvm_msr_allowlist *a = &ranges[i].allow;
+		u32 bitmap_size = a->nmsrs / BITS_PER_BYTE;
+		struct kvm_msr_allowlist *range = malloc(sizeof(*a) + bitmap_size);
+
+		TEST_ASSERT(range, "range alloc failed (%ld bytes)\n", sizeof(*a) + bitmap_size);
+
+		*range = *a;
+
+		/* Allow everything by default */
+		memset(range->bitmap, 0xff, bitmap_size);
+
+		if (ranges[i].populate)
+			ranges[i].populate(range);
+
+		vm_ioctl(vm, KVM_X86_ADD_MSR_ALLOWLIST, range);
+	}
+
+	while (1) {
+		rc = _vcpu_run(vm, VCPU_ID);
+
+		TEST_ASSERT(rc == 0, "vcpu_run failed: %d\n", rc);
+
+		switch (run->exit_reason) {
+		case KVM_EXIT_X86_RDMSR:
+			handle_rdmsr(run);
+			break;
+		case KVM_EXIT_X86_WRMSR:
+			handle_wrmsr(run);
+			break;
+		case KVM_EXIT_IO:
+			if (handle_ucall(vm))
+				goto done;
+			break;
+		}
+
+	}
+
+done:
+	TEST_ASSERT(msr_reads == 1, "Handled 1 rdmsr in user space");
+	TEST_ASSERT(msr_writes == 1, "Handled 1 wrmsr in user space");
+
+	kvm_vm_free(vm);
+
+	return 0;
+}
-- 
2.17.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space
  2020-07-31 21:49 ` [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space Alexander Graf
@ 2020-07-31 23:36   ` Jim Mattson
  2020-08-03 10:08     ` Alexander Graf
  2020-08-03 11:27   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 11+ messages in thread
From: Jim Mattson @ 2020-07-31 23:36 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, KarimAllah Raslan,
	Aaron Lewis, kvm list, linux-doc, LKML

On Fri, Jul 31, 2020 at 2:50 PM Alexander Graf <graf@amazon.com> wrote:
>
> MSRs are weird. Some of them are normal control registers, such as EFER.
> Some however are registers that really are model specific, not very
> interesting to virtualization workloads, and not performance critical.
> Others again are really just windows into package configuration.
>
> Out of these MSRs, only the first category is necessary to implement in
> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
> certain CPU models and MSRs that contain information on the package level
> are much better suited for user space to process. However, over time we have
> accumulated a lot of MSRs that are not the first category, but still handled
> by in-kernel KVM code.
>
> This patch adds a generic interface to handle WRMSR and RDMSR from user
> space. With this, any future MSR that is part of the latter categories can
> be handled in user space.
>
> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
> something that applies per-VM rather than on the full system. That way you
> can run productive VMs in parallel to experimental ones where you don't care
> about proper MSR handling.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
>
> ---
>
> v1 -> v2:
>
>   - s/ETRAP_TO_USER_SPACE/ENOENT/g
>   - deflect all #GP injection events to user space, not just unknown MSRs.
>     That was we can also deflect allowlist errors later
>   - fix emulator case
>
> v2 -> v3:
>
>   - return r if r == X86EMUL_IO_NEEDED
>   - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
>   - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
>   - Use complete_userspace_io logic instead of reply field
>   - Simplify trapping code
> ---
>  Documentation/virt/kvm/api.rst  |  62 +++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h |   6 ++
>  arch/x86/kvm/emulate.c          |  18 +++++-
>  arch/x86/kvm/x86.c              | 106 ++++++++++++++++++++++++++++++--
>  include/trace/events/kvm.h      |   2 +-
>  include/uapi/linux/kvm.h        |  10 +++
>  6 files changed, 197 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 320788f81a05..79c3e2fdfae4 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst

The new exit reasons should probably be mentioned here (around line 4866):

.. note::

      For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
      KVM_EXIT_EPR the corresponding

operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN.  The kernel side will first finish
incomplete operations and then check for pending signals.  Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.

Other than that, my remaining comments are all nits. Feel free to ignore them.

> +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)

Return bool rather than int?

> +{
> +       if (!vcpu->kvm->arch.user_space_msr_enabled)
> +               return 0;
> +
> +       vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
> +       vcpu->run->msr.error = 0;

Should we clear 'pad' in case anyone can think of a reason to use this
space to extend the API in the future?

> +       vcpu->run->msr.index = index;
> +       vcpu->arch.pending_user_msr = true;
> +       vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;

complete_userspace_io could perhaps be renamed to
complete_userspace_emulation (in a separate commit).

> +
> +       return 1;
> +}
> +
> +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)

Return bool rather than int?

> +{
> +       if (!vcpu->kvm->arch.user_space_msr_enabled)
> +               return 0;
> +
> +       vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
> +       vcpu->run->msr.error = 0;

Same question about 'pad' as above.

> +       vcpu->run->msr.index = index;
> +       vcpu->run->msr.data = data;
> +       vcpu->arch.pending_user_msr = true;
> +       vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;
> +
> +       return 1;
> +}
> +

Reviewed-by: Jim Mattson <jmattson@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space
  2020-07-31 23:36   ` Jim Mattson
@ 2020-08-03 10:08     ` Alexander Graf
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Graf @ 2020-08-03 10:08 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, KarimAllah Raslan,
	Aaron Lewis, kvm list, linux-doc, LKML



On 01.08.20 01:36, Jim Mattson wrote:
> 
> On Fri, Jul 31, 2020 at 2:50 PM Alexander Graf <graf@amazon.com> wrote:
>>
>> MSRs are weird. Some of them are normal control registers, such as EFER.
>> Some however are registers that really are model specific, not very
>> interesting to virtualization workloads, and not performance critical.
>> Others again are really just windows into package configuration.
>>
>> Out of these MSRs, only the first category is necessary to implement in
>> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
>> certain CPU models and MSRs that contain information on the package level
>> are much better suited for user space to process. However, over time we have
>> accumulated a lot of MSRs that are not the first category, but still handled
>> by in-kernel KVM code.
>>
>> This patch adds a generic interface to handle WRMSR and RDMSR from user
>> space. With this, any future MSR that is part of the latter categories can
>> be handled in user space.
>>
>> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
>> something that applies per-VM rather than on the full system. That way you
>> can run productive VMs in parallel to experimental ones where you don't care
>> about proper MSR handling.
>>
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>
>> ---
>>
>> v1 -> v2:
>>
>>    - s/ETRAP_TO_USER_SPACE/ENOENT/g
>>    - deflect all #GP injection events to user space, not just unknown MSRs.
>>      That was we can also deflect allowlist errors later
>>    - fix emulator case
>>
>> v2 -> v3:
>>
>>    - return r if r == X86EMUL_IO_NEEDED
>>    - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
>>    - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
>>    - Use complete_userspace_io logic instead of reply field
>>    - Simplify trapping code
>> ---
>>   Documentation/virt/kvm/api.rst  |  62 +++++++++++++++++++
>>   arch/x86/include/asm/kvm_host.h |   6 ++
>>   arch/x86/kvm/emulate.c          |  18 +++++-
>>   arch/x86/kvm/x86.c              | 106 ++++++++++++++++++++++++++++++--
>>   include/trace/events/kvm.h      |   2 +-
>>   include/uapi/linux/kvm.h        |  10 +++
>>   6 files changed, 197 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 320788f81a05..79c3e2fdfae4 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
> 
> The new exit reasons should probably be mentioned here (around line 4866):
> 
> .. note::
> 
>        For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
>        KVM_EXIT_EPR the corresponding
> 
> operations are complete (and guest state is consistent) only after userspace
> has re-entered the kernel with KVM_RUN.  The kernel side will first finish
> incomplete operations and then check for pending signals.  Userspace
> can re-enter the guest with an unmasked signal pending to complete
> pending operations.

Great catch, thanks! Updated to also include the two new exit reasons.

> 
> Other than that, my remaining comments are all nits. Feel free to ignore them.
> 
>> +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)
> 
> Return bool rather than int?

I'm not a big fan of bool returning APIs unless they have an "is" in 
their name. In this case, the most readable path forward would probably 
be an enum:

enum kvm_msr_user_space_retval {
     KVM_MSR_IN_KERNEL,
     KVM_MSR_BOUNCE_TO_USER_SPACE,
};

and then use that in the checks. But that adds a lot of boiler plate for 
a fully internal, only a few dozen LOC big API. I don't think it's worth it.

> 
>> +{
>> +       if (!vcpu->kvm->arch.user_space_msr_enabled)
>> +               return 0;
>> +
>> +       vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
>> +       vcpu->run->msr.error = 0;
> 
> Should we clear 'pad' in case anyone can think of a reason to use this
> space to extend the API in the future?

It can't hurt I guess.

> 
>> +       vcpu->run->msr.index = index;
>> +       vcpu->arch.pending_user_msr = true;
>> +       vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;
> 
> complete_userspace_io could perhaps be renamed to
> complete_userspace_emulation (in a separate commit).

I think the complicated part of complete_userspace_io is to know it 
exists and understand how it works. Once you grasp these two bits, the 
name is just an artifact and IMHO easy enough to apply "beyond I/O".

> 
>> +
>> +       return 1;
>> +}
>> +
>> +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)
> 
> Return bool rather than int?

Same replies as above :). I did get fed up with the amount of 
duplication though and created a generalized function in v4 that gets 
called by kvm_get/set_msr_user_space() to ensure that all fields are 
always set.

> 
>> +{
>> +       if (!vcpu->kvm->arch.user_space_msr_enabled)
>> +               return 0;
>> +
>> +       vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
>> +       vcpu->run->msr.error = 0;
> 
> Same question about 'pad' as above.
> 
>> +       vcpu->run->msr.index = index;
>> +       vcpu->run->msr.data = data;
>> +       vcpu->arch.pending_user_msr = true;
>> +       vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;
>> +
>> +       return 1;
>> +}
>> +
> 
> Reviewed-by: Jim Mattson <jmattson@google.com>

Thanks a bunch for the review :)


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space
  2020-07-31 21:49 ` [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space Alexander Graf
  2020-07-31 23:36   ` Jim Mattson
@ 2020-08-03 11:27   ` Vitaly Kuznetsov
  2020-08-03 11:34     ` Alexander Graf
  1 sibling, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2020-08-03 11:27 UTC (permalink / raw)
  To: Alexander Graf, Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel, KarimAllah Raslan, Aaron Lewis, kvm, linux-doc,
	linux-kernel

Alexander Graf <graf@amazon.com> writes:

> MSRs are weird. Some of them are normal control registers, such as EFER.
> Some however are registers that really are model specific, not very
> interesting to virtualization workloads, and not performance critical.
> Others again are really just windows into package configuration.
>
> Out of these MSRs, only the first category is necessary to implement in
> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
> certain CPU models and MSRs that contain information on the package level
> are much better suited for user space to process. However, over time we have
> accumulated a lot of MSRs that are not the first category, but still handled
> by in-kernel KVM code.
>
> This patch adds a generic interface to handle WRMSR and RDMSR from user
> space. With this, any future MSR that is part of the latter categories can
> be handled in user space.
>
> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
> something that applies per-VM rather than on the full system. That way you
> can run productive VMs in parallel to experimental ones where you don't care
> about proper MSR handling.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
>
> ---
>
> v1 -> v2:
>
>   - s/ETRAP_TO_USER_SPACE/ENOENT/g
>   - deflect all #GP injection events to user space, not just unknown MSRs.
>     That was we can also deflect allowlist errors later
>   - fix emulator case
>
> v2 -> v3:
>
>   - return r if r == X86EMUL_IO_NEEDED
>   - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
>   - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
>   - Use complete_userspace_io logic instead of reply field
>   - Simplify trapping code
> ---
>  Documentation/virt/kvm/api.rst  |  62 +++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h |   6 ++
>  arch/x86/kvm/emulate.c          |  18 +++++-
>  arch/x86/kvm/x86.c              | 106 ++++++++++++++++++++++++++++++--
>  include/trace/events/kvm.h      |   2 +-
>  include/uapi/linux/kvm.h        |  10 +++
>  6 files changed, 197 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 320788f81a05..79c3e2fdfae4 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5155,6 +5155,35 @@ Note that KVM does not skip the faulting instruction as it does for
>  KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
>  if it decides to decode and emulate the instruction.
>  
> +::
> +
> +		/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
> +		struct {
> +			__u8 error;
> +			__u8 pad[3];
> +			__u32 index;
> +			__u64 data;
> +		} msr;
> +
> +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
> +enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
> +will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
> +exit for writes.
> +
> +For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
> +wants to read. To respond to this request with a successful read, user space
> +writes the respective data into the "data" field and must continue guest
> +execution to ensure the read data is transferred into guest register state.
> +
> +If the RDMSR request was unsuccessful, user space indicates that with a "1" in
> +the "error" field. This will inject a #GP into the guest when the VCPU is
> +executed again.
> +
> +For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
> +wants to write. Once finished processing the event, user space must continue
> +vCPU execution. If the MSR write was unsuccessful, user space also sets the
> +"error" field to "1".
> +
>  ::
>  
>  		/* Fix the size of the union. */
> @@ -5844,6 +5873,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
>  the maximum halt time to specified on a per-VM basis, effectively overriding
>  the module parameter for the target VM.
>  
> +7.21 KVM_CAP_X86_USER_SPACE_MSR
> +-------------------------------
> +
> +:Architectures: x86
> +:Target: VM
> +:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise
> +:Returns: 0 on success; -1 on error
> +
> +This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
> +into user space.
> +
> +When a guest requests to read or write an MSR, KVM may not implement all MSRs
> +that are relevant to a respective system. It also does not differentiate by
> +CPU type.
> +
> +To allow more fine grained control over MSR handling, user space may enable
> +this capability. With it enabled, MSR accesses that would usually trigger
> +a #GP event inside the guest by KVM will instead trigger KVM_EXIT_X86_RDMSR
> +and KVM_EXIT_X86_WRMSR exit notifications which user space can then handle to
> +implement model specific MSR handling and/or user notifications to inform
> +a user that an MSR was not handled.
> +
>  8. Other capabilities.
>  ======================
>  
> @@ -6151,3 +6202,14 @@ KVM can therefore start protected VMs.
>  This capability governs the KVM_S390_PV_COMMAND ioctl and the
>  KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
>  guests when the state change is invalid.
> +
> +8.24 KVM_CAP_X86_USER_SPACE_MSR
> +----------------------------
> +
> +:Architectures: x86
> +
> +This capability indicates that KVM supports deflection of MSR reads and
> +writes to user space. It can be enabled on a VM level. If enabled, MSR
> +accesses that would usually trigger a #GP by KVM into the guest will
> +instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
> +KVM_EXIT_X86_WRMSR exit notifications.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..809eed0dbdea 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -829,6 +829,9 @@ struct kvm_vcpu_arch {
>  
>  	/* AMD MSRC001_0015 Hardware Configuration */
>  	u64 msr_hwcr;
> +
> +	/* User space is handling an MSR request */
> +	bool pending_user_msr;
>  };
>  
>  struct kvm_lpage_info {
> @@ -1002,6 +1005,9 @@ struct kvm_arch {
>  	bool guest_can_read_msr_platform_info;
>  	bool exception_payload_enabled;
>  
> +	/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
> +	bool user_space_msr_enabled;
> +
>  	struct kvm_pmu_event_filter *pmu_event_filter;
>  	struct task_struct *nx_lpage_recovery_thread;
>  };
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index d0e2825ae617..744ab9c92b73 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -3689,11 +3689,18 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)
>  
>  static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
>  {
> +	u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
>  	u64 msr_data;
> +	int r;
>  
>  	msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX)
>  		| ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
> -	if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data))
> +	r = ctxt->ops->set_msr(ctxt, msr_index, msr_data);
> +
> +	if (r == X86EMUL_IO_NEEDED)
> +		return r;
> +
> +	if (r)
>  		return emulate_gp(ctxt, 0);
>  
>  	return X86EMUL_CONTINUE;
> @@ -3701,9 +3708,16 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
>  
>  static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
>  {
> +	u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
>  	u64 msr_data;
> +	int r;
> +
> +	r = ctxt->ops->get_msr(ctxt, msr_index, &msr_data);
> +
> +	if (r == X86EMUL_IO_NEEDED)
> +		return r;
>  
> -	if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data))
> +	if (r)
>  		return emulate_gp(ctxt, 0);
>  
>  	*reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..24c72250f6df 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1549,12 +1549,75 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_msr);
>  
> +static int complete_emulated_msr(struct kvm_vcpu *vcpu, bool is_read)
> +{
> +	BUG_ON(!vcpu->arch.pending_user_msr);
> +
> +	if (vcpu->run->msr.error) {
> +		kvm_inject_gp(vcpu, 0);
> +	} else if (is_read) {
> +		kvm_rax_write(vcpu, (u32)vcpu->run->msr.data);
> +		kvm_rdx_write(vcpu, vcpu->run->msr.data >> 32);
> +	}
> +
> +	return kvm_skip_emulated_instruction(vcpu);
> +}
> +
> +static int complete_emulated_rdmsr(struct kvm_vcpu *vcpu)
> +{
> +	return complete_emulated_msr(vcpu, true);
> +}
> +
> +static int complete_emulated_wrmsr(struct kvm_vcpu *vcpu)
> +{
> +	return complete_emulated_msr(vcpu, false);
> +}
> +
> +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)
> +{
> +	if (!vcpu->kvm->arch.user_space_msr_enabled)
> +		return 0;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
> +	vcpu->run->msr.error = 0;
> +	vcpu->run->msr.index = index;
> +	vcpu->arch.pending_user_msr = true;
> +	vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;
> +
> +	return 1;
> +}
> +
> +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)
> +{
> +	if (!vcpu->kvm->arch.user_space_msr_enabled)
> +		return 0;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
> +	vcpu->run->msr.error = 0;
> +	vcpu->run->msr.index = index;
> +	vcpu->run->msr.data = data;
> +	vcpu->arch.pending_user_msr = true;
> +	vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;

I'm probably missing something but where do we reset
vcpu->arch.pending_user_msr? Shouldn't it be done in
complete_emulated_msr()?

> +
> +	return 1;
> +}
> +
>  int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu)
>  {
>  	u32 ecx = kvm_rcx_read(vcpu);
>  	u64 data;
> +	int r;
> +
> +	r = kvm_get_msr(vcpu, ecx, &data);
>  
> -	if (kvm_get_msr(vcpu, ecx, &data)) {
> +	/* MSR read failed? See if we should ask user space */
> +	if (r && kvm_get_msr_user_space(vcpu, ecx)) {
> +		/* Bounce to user space */
> +		return 0;
> +	}
> +
> +	/* MSR read failed? Inject a #GP */
> +	if (r) {
>  		trace_kvm_msr_read_ex(ecx);
>  		kvm_inject_gp(vcpu, 0);
>  		return 1;
> @@ -1572,8 +1635,18 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu)
>  {
>  	u32 ecx = kvm_rcx_read(vcpu);
>  	u64 data = kvm_read_edx_eax(vcpu);
> +	int r;
> +
> +	r = kvm_set_msr(vcpu, ecx, data);
>  
> -	if (kvm_set_msr(vcpu, ecx, data)) {
> +	/* MSR write failed? See if we should ask user space */
> +	if (r && kvm_set_msr_user_space(vcpu, ecx, data)) {
> +		/* Bounce to user space */
> +		return 0;
> +	}
> +
> +	/* MSR write failed? Inject a #GP */
> +	if (r) {
>  		trace_kvm_msr_write_ex(ecx, data);
>  		kvm_inject_gp(vcpu, 0);
>  		return 1;
> @@ -3476,6 +3549,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_MSR_PLATFORM_INFO:
>  	case KVM_CAP_EXCEPTION_PAYLOAD:
>  	case KVM_CAP_SET_GUEST_DEBUG:
> +	case KVM_CAP_X86_USER_SPACE_MSR:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SYNC_REGS:
> @@ -4990,6 +5064,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		kvm->arch.exception_payload_enabled = cap->args[0];
>  		r = 0;
>  		break;
> +	case KVM_CAP_X86_USER_SPACE_MSR:
> +		kvm->arch.user_space_msr_enabled = cap->args[0];
> +		r = 0;
> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> @@ -6319,13 +6397,33 @@ static void emulator_set_segment(struct x86_emulate_ctxt *ctxt, u16 selector,
>  static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
>  			    u32 msr_index, u64 *pdata)
>  {
> -	return kvm_get_msr(emul_to_vcpu(ctxt), msr_index, pdata);
> +	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> +	int r;
> +
> +	r = kvm_get_msr(vcpu, msr_index, pdata);
> +
> +	if (r && kvm_get_msr_user_space(vcpu, msr_index)) {
> +		/* Bounce to user space */
> +		return X86EMUL_IO_NEEDED;
> +	}
> +
> +	return r;
>  }
>  
>  static int emulator_set_msr(struct x86_emulate_ctxt *ctxt,
>  			    u32 msr_index, u64 data)
>  {
> -	return kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
> +	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> +	int r;
> +
> +	r = kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
> +
> +	if (r && kvm_set_msr_user_space(vcpu, msr_index, data)) {
> +		/* Bounce to user space */
> +		return X86EMUL_IO_NEEDED;
> +	}
> +
> +	return r;
>  }
>  
>  static u64 emulator_get_smbase(struct x86_emulate_ctxt *ctxt)
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 9417a34aad08..26cfb0fa8e7e 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -17,7 +17,7 @@
>  	ERSN(NMI), ERSN(INTERNAL_ERROR), ERSN(OSI), ERSN(PAPR_HCALL),	\
>  	ERSN(S390_UCONTROL), ERSN(WATCHDOG), ERSN(S390_TSCH), ERSN(EPR),\
>  	ERSN(SYSTEM_EVENT), ERSN(S390_STSI), ERSN(IOAPIC_EOI),          \
> -	ERSN(HYPERV), ERSN(ARM_NISV)
> +	ERSN(HYPERV), ERSN(ARM_NISV), ERSN(X86_RDMSR), ERSN(X86_WRMSR)
>  
>  TRACE_EVENT(kvm_userspace_exit,
>  	    TP_PROTO(__u32 reason, int errno),
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4fdf30316582..13fc7de1eb50 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -248,6 +248,8 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
>  #define KVM_EXIT_ARM_NISV         28
> +#define KVM_EXIT_X86_RDMSR        29
> +#define KVM_EXIT_X86_WRMSR        30
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -412,6 +414,13 @@ struct kvm_run {
>  			__u64 esr_iss;
>  			__u64 fault_ipa;
>  		} arm_nisv;
> +		/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
> +		struct {
> +			__u8 error;
> +			__u8 pad[3];
> +			__u32 index;
> +			__u64 data;
> +		} msr;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -1031,6 +1040,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_SECURE_GUEST 181
>  #define KVM_CAP_HALT_POLL 182
>  #define KVM_CAP_ASYNC_PF_INT 183
> +#define KVM_CAP_X86_USER_SPACE_MSR 184
>  
>  #ifdef KVM_CAP_IRQ_ROUTING

-- 
Vitaly


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space
  2020-08-03 11:27   ` Vitaly Kuznetsov
@ 2020-08-03 11:34     ` Alexander Graf
  0 siblings, 0 replies; 11+ messages in thread
From: Alexander Graf @ 2020-08-03 11:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel, KarimAllah Raslan, Aaron Lewis, kvm, linux-doc,
	linux-kernel



On 03.08.20 13:27, Vitaly Kuznetsov wrote:
> Alexander Graf <graf@amazon.com> writes:
> 
>> MSRs are weird. Some of them are normal control registers, such as EFER.
>> Some however are registers that really are model specific, not very
>> interesting to virtualization workloads, and not performance critical.
>> Others again are really just windows into package configuration.
>>
>> Out of these MSRs, only the first category is necessary to implement in
>> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
>> certain CPU models and MSRs that contain information on the package level
>> are much better suited for user space to process. However, over time we have
>> accumulated a lot of MSRs that are not the first category, but still handled
>> by in-kernel KVM code.
>>
>> This patch adds a generic interface to handle WRMSR and RDMSR from user
>> space. With this, any future MSR that is part of the latter categories can
>> be handled in user space.
>>
>> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
>> something that applies per-VM rather than on the full system. That way you
>> can run productive VMs in parallel to experimental ones where you don't care
>> about proper MSR handling.
>>
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>
>> ---
>>
>> v1 -> v2:
>>
>>    - s/ETRAP_TO_USER_SPACE/ENOENT/g
>>    - deflect all #GP injection events to user space, not just unknown MSRs.
>>      That was we can also deflect allowlist errors later
>>    - fix emulator case
>>
>> v2 -> v3:
>>
>>    - return r if r == X86EMUL_IO_NEEDED
>>    - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
>>    - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
>>    - Use complete_userspace_io logic instead of reply field
>>    - Simplify trapping code
>> ---
>>   Documentation/virt/kvm/api.rst  |  62 +++++++++++++++++++
>>   arch/x86/include/asm/kvm_host.h |   6 ++
>>   arch/x86/kvm/emulate.c          |  18 +++++-
>>   arch/x86/kvm/x86.c              | 106 ++++++++++++++++++++++++++++++--
>>   include/trace/events/kvm.h      |   2 +-
>>   include/uapi/linux/kvm.h        |  10 +++
>>   6 files changed, 197 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 320788f81a05..79c3e2fdfae4 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -5155,6 +5155,35 @@ Note that KVM does not skip the faulting instruction as it does for
>>   KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
>>   if it decides to decode and emulate the instruction.
>>
>> +::
>> +
>> +             /* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
>> +             struct {
>> +                     __u8 error;
>> +                     __u8 pad[3];
>> +                     __u32 index;
>> +                     __u64 data;
>> +             } msr;
>> +
>> +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
>> +enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
>> +will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
>> +exit for writes.
>> +
>> +For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
>> +wants to read. To respond to this request with a successful read, user space
>> +writes the respective data into the "data" field and must continue guest
>> +execution to ensure the read data is transferred into guest register state.
>> +
>> +If the RDMSR request was unsuccessful, user space indicates that with a "1" in
>> +the "error" field. This will inject a #GP into the guest when the VCPU is
>> +executed again.
>> +
>> +For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
>> +wants to write. Once finished processing the event, user space must continue
>> +vCPU execution. If the MSR write was unsuccessful, user space also sets the
>> +"error" field to "1".
>> +
>>   ::
>>
>>                /* Fix the size of the union. */
>> @@ -5844,6 +5873,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
>>   the maximum halt time to specified on a per-VM basis, effectively overriding
>>   the module parameter for the target VM.
>>
>> +7.21 KVM_CAP_X86_USER_SPACE_MSR
>> +-------------------------------
>> +
>> +:Architectures: x86
>> +:Target: VM
>> +:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise
>> +:Returns: 0 on success; -1 on error
>> +
>> +This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
>> +into user space.
>> +
>> +When a guest requests to read or write an MSR, KVM may not implement all MSRs
>> +that are relevant to a respective system. It also does not differentiate by
>> +CPU type.
>> +
>> +To allow more fine grained control over MSR handling, user space may enable
>> +this capability. With it enabled, MSR accesses that would usually trigger
>> +a #GP event inside the guest by KVM will instead trigger KVM_EXIT_X86_RDMSR
>> +and KVM_EXIT_X86_WRMSR exit notifications which user space can then handle to
>> +implement model specific MSR handling and/or user notifications to inform
>> +a user that an MSR was not handled.
>> +
>>   8. Other capabilities.
>>   ======================
>>
>> @@ -6151,3 +6202,14 @@ KVM can therefore start protected VMs.
>>   This capability governs the KVM_S390_PV_COMMAND ioctl and the
>>   KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
>>   guests when the state change is invalid.
>> +
>> +8.24 KVM_CAP_X86_USER_SPACE_MSR
>> +----------------------------
>> +
>> +:Architectures: x86
>> +
>> +This capability indicates that KVM supports deflection of MSR reads and
>> +writes to user space. It can be enabled on a VM level. If enabled, MSR
>> +accesses that would usually trigger a #GP by KVM into the guest will
>> +instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
>> +KVM_EXIT_X86_WRMSR exit notifications.
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index be5363b21540..809eed0dbdea 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -829,6 +829,9 @@ struct kvm_vcpu_arch {
>>
>>        /* AMD MSRC001_0015 Hardware Configuration */
>>        u64 msr_hwcr;
>> +
>> +     /* User space is handling an MSR request */
>> +     bool pending_user_msr;
>>   };
>>
>>   struct kvm_lpage_info {
>> @@ -1002,6 +1005,9 @@ struct kvm_arch {
>>        bool guest_can_read_msr_platform_info;
>>        bool exception_payload_enabled;
>>
>> +     /* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
>> +     bool user_space_msr_enabled;
>> +
>>        struct kvm_pmu_event_filter *pmu_event_filter;
>>        struct task_struct *nx_lpage_recovery_thread;
>>   };
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index d0e2825ae617..744ab9c92b73 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -3689,11 +3689,18 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)
>>
>>   static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
>>   {
>> +     u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
>>        u64 msr_data;
>> +     int r;
>>
>>        msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX)
>>                | ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
>> -     if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data))
>> +     r = ctxt->ops->set_msr(ctxt, msr_index, msr_data);
>> +
>> +     if (r == X86EMUL_IO_NEEDED)
>> +             return r;
>> +
>> +     if (r)
>>                return emulate_gp(ctxt, 0);
>>
>>        return X86EMUL_CONTINUE;
>> @@ -3701,9 +3708,16 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
>>
>>   static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
>>   {
>> +     u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
>>        u64 msr_data;
>> +     int r;
>> +
>> +     r = ctxt->ops->get_msr(ctxt, msr_index, &msr_data);
>> +
>> +     if (r == X86EMUL_IO_NEEDED)
>> +             return r;
>>
>> -     if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data))
>> +     if (r)
>>                return emulate_gp(ctxt, 0);
>>
>>        *reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 88c593f83b28..24c72250f6df 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1549,12 +1549,75 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_set_msr);
>>
>> +static int complete_emulated_msr(struct kvm_vcpu *vcpu, bool is_read)
>> +{
>> +     BUG_ON(!vcpu->arch.pending_user_msr);
>> +
>> +     if (vcpu->run->msr.error) {
>> +             kvm_inject_gp(vcpu, 0);
>> +     } else if (is_read) {
>> +             kvm_rax_write(vcpu, (u32)vcpu->run->msr.data);
>> +             kvm_rdx_write(vcpu, vcpu->run->msr.data >> 32);
>> +     }
>> +
>> +     return kvm_skip_emulated_instruction(vcpu);
>> +}
>> +
>> +static int complete_emulated_rdmsr(struct kvm_vcpu *vcpu)
>> +{
>> +     return complete_emulated_msr(vcpu, true);
>> +}
>> +
>> +static int complete_emulated_wrmsr(struct kvm_vcpu *vcpu)
>> +{
>> +     return complete_emulated_msr(vcpu, false);
>> +}
>> +
>> +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)
>> +{
>> +     if (!vcpu->kvm->arch.user_space_msr_enabled)
>> +             return 0;
>> +
>> +     vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
>> +     vcpu->run->msr.error = 0;
>> +     vcpu->run->msr.index = index;
>> +     vcpu->arch.pending_user_msr = true;
>> +     vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;
>> +
>> +     return 1;
>> +}
>> +
>> +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)
>> +{
>> +     if (!vcpu->kvm->arch.user_space_msr_enabled)
>> +             return 0;
>> +
>> +     vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
>> +     vcpu->run->msr.error = 0;
>> +     vcpu->run->msr.index = index;
>> +     vcpu->run->msr.data = data;
>> +     vcpu->arch.pending_user_msr = true;
>> +     vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;
> 
> I'm probably missing something but where do we reset
> vcpu->arch.pending_user_msr? Shouldn't it be done in
> complete_emulated_msr()?

It's even worse than that: We don't need it at all. I'll remove it for v4.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation
  2020-07-31 21:49 ` [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation Alexander Graf
@ 2020-08-03 11:37   ` Vitaly Kuznetsov
  2020-08-03 20:50     ` Alexander Graf
  0 siblings, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2020-08-03 11:37 UTC (permalink / raw)
  To: Alexander Graf, Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel, KarimAllah Raslan, Aaron Lewis, kvm, linux-doc,
	linux-kernel

Alexander Graf <graf@amazon.com> writes:

> It's not desireable to have all MSRs always handled by KVM kernel space. Some
> MSRs would be useful to handle in user space to either emulate behavior (like
> uCode updates) or differentiate whether they are valid based on the CPU model.
>
> To allow user space to specify which MSRs it wants to see handled by KVM,
> this patch introduces a new ioctl to push allow lists of bitmaps into
> KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
> With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
> denied MSR events to user space to operate on.
>
> If no allowlist is populated, MSR handling stays identical to before.
>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> Signed-off-by: Alexander Graf <graf@amazon.com>
>
> ---
>
> v2 -> v3:
>
>   - document flags for KVM_X86_ADD_MSR_ALLOWLIST
>   - generalize exit path, always unlock when returning
>   - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
>   - Add KVM_X86_CLEAR_MSR_ALLOWLIST
> ---
>  Documentation/virt/kvm/api.rst  |  91 +++++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h |  10 +++
>  arch/x86/include/uapi/asm/kvm.h |  15 ++++
>  arch/x86/kvm/x86.c              | 135 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h        |   5 ++
>  5 files changed, 256 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 79c3e2fdfae4..d611ddd326fc 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4697,6 +4697,82 @@ KVM_PV_VM_VERIFY
>    Verify the integrity of the unpacked image. Only if this succeeds,
>    KVM is allowed to start protected VCPUs.
>  
> +4.126 KVM_X86_ADD_MSR_ALLOWLIST
> +-------------------------------
> +
> +:Capability: KVM_CAP_X86_MSR_ALLOWLIST
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_msr_allowlist
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +  struct kvm_msr_allowlist {
> +         __u32 flags;
> +         __u32 nmsrs; /* number of msrs in bitmap */
> +         __u32 base;  /* base address for the MSRs bitmap */
> +         __u32 pad;
> +
> +         __u8 bitmap[0]; /* a set bit allows that the operation set in flags */
> +  };
> +
> +flags values:
> +
> +KVM_MSR_ALLOW_READ
> +
> +  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
> +  indicates that a read should immediately fail, while a 1 indicates that
> +  a read should be handled by the normal KVM MSR emulation logic.
> +
> +KVM_MSR_ALLOW_WRITE
> +
> +  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
> +  indicates that a write should immediately fail, while a 1 indicates that
> +  a write should be handled by the normal KVM MSR emulation logic.
> +
> +KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE
> +

Should we probably say what KVM_MSR_ALLOW_READ/KVM_MSR_ALLOW_WRITE are
equal to? (1 << 0, 1 << 1)? 

> +  Filter booth read and write accesses to MSRs using the given bitmap. A 0
> +  in the bitmap indicates that both reads and writes should immediately fail,
> +  while a 1 indicates that reads and writes should be handled by the normal
> +  KVM MSR emulation logic.
> +
> +This ioctl allows user space to define a set of bitmaps of MSR ranges to
> +specify whether a certain MSR access is allowed or not.
> +
> +If this ioctl has never been invoked, MSR accesses are not guarded and the
> +old KVM in-kernel emulation behavior is fully preserved.
> +
> +As soon as the first allow list was specified, only allowed MSR accesses
> +are permitted inside of KVM's MSR code.
> +
> +Each allowlist specifies a range of MSRs to potentially allow access on.
> +The range goes from MSR index [base .. base+nmsrs]. The flags field
> +indicates whether reads, writes or both reads and writes are permitted
> +by setting a 1 bit in the bitmap for the corresponding MSR index.
> +
> +If an MSR access is not permitted through the allow list, it generates a
> +#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
> +allows user space to deflect and potentially handle various MSR accesses
> +into user space.
> +
> +4.124 KVM_X86_CLEAR_MSR_ALLOWLIST
> +---------------------------------
> +
> +:Capability: KVM_CAP_X86_MSR_ALLOWLIST
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: none
> +:Returns: 0
> +
> +This ioctl resets all internal MSR allow lists. After this call, no allow
> +list is present and the guest would execute as if no allow lists were set,
> +so all MSRs are considered allowed and thus handled by the in-kernel MSR
> +emulation logic.
> +
> +No vCPU may be in running state when calling this ioctl.
> +
>  
>  5. The kvm_run structure
>  ========================
> @@ -6213,3 +6289,18 @@ writes to user space. It can be enabled on a VM level. If enabled, MSR
>  accesses that would usually trigger a #GP by KVM into the guest will
>  instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
>  KVM_EXIT_X86_WRMSR exit notifications.
> +
> +8.25 KVM_CAP_X86_MSR_ALLOWLIST
> +------------------------------
> +
> +:Architectures: x86
> +
> +This capability indicates that KVM supports emulation of only select MSR
> +registers. With this capability exposed, KVM exports two new VM ioctls:
> +KVM_X86_ADD_MSR_ALLOWLIST which user space can call to specify bitmaps of MSR
> +ranges that KVM should emulate in kernel space and KVM_X86_CLEAR_MSR_ALLOWLIST
> +which user space can call to remove all MSR allow lists from the VM context.
> +
> +In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
> +trap and emulate MSRs that are outside of the scope of KVM as well as
> +limit the attack surface on KVM's MSR emulation code.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 809eed0dbdea..21358ed4e590 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -904,6 +904,13 @@ struct kvm_hv {
>  	struct kvm_hv_syndbg hv_syndbg;
>  };
>  
> +struct msr_bitmap_range {
> +	u32 flags;
> +	u32 nmsrs;
> +	u32 base;
> +	unsigned long *bitmap;
> +};
> +
>  enum kvm_irqchip_mode {
>  	KVM_IRQCHIP_NONE,
>  	KVM_IRQCHIP_KERNEL,       /* created with KVM_CREATE_IRQCHIP */
> @@ -1008,6 +1015,9 @@ struct kvm_arch {
>  	/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
>  	bool user_space_msr_enabled;
>  
> +	struct msr_bitmap_range msr_allowlist_ranges[10];
> +	int msr_allowlist_ranges_count;
> +
>  	struct kvm_pmu_event_filter *pmu_event_filter;
>  	struct task_struct *nx_lpage_recovery_thread;
>  };
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 0780f97c1850..c33fb1d72d52 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -192,6 +192,21 @@ struct kvm_msr_list {
>  	__u32 indices[0];
>  };
>  
> +#define KVM_MSR_ALLOW_READ  (1 << 0)
> +#define KVM_MSR_ALLOW_WRITE (1 << 1)
> +
> +/* Maximum size of the of the bitmap in bytes */
> +#define KVM_MSR_ALLOWLIST_MAX_LEN 0x600
> +
> +/* for KVM_X86_ADD_MSR_ALLOWLIST */
> +struct kvm_msr_allowlist {
> +	__u32 flags;
> +	__u32 nmsrs; /* number of msrs in bitmap */
> +	__u32 base;  /* base address for the MSRs bitmap */
> +	__u32 pad;
> +
> +	__u8 bitmap[0]; /* a set bit allows that the operation set in flags */
> +};
>  
>  struct kvm_cpuid_entry {
>  	__u32 function;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 24c72250f6df..7a2be00a3512 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1472,6 +1472,29 @@ void kvm_enable_efer_bits(u64 mask)
>  }
>  EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
>  
> +static bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
> +{
> +	struct msr_bitmap_range *ranges = vcpu->kvm->arch.msr_allowlist_ranges;
> +	u32 count = vcpu->kvm->arch.msr_allowlist_ranges_count;
> +	u32 i;
> +
> +	/* MSR allowlist not set up, allow everything */
> +	if (!count)
> +		return true;
> +
> +	for (i = 0; i < count; i++) {
> +		u32 start = ranges[i].base;
> +		u32 end = start + ranges[i].nmsrs;
> +		int flags = ranges[i].flags;

u32 flags?

> +		unsigned long *bitmap = ranges[i].bitmap;
> +
> +		if ((index >= start) && (index < end) && (flags & type))
> +			return !!test_bit(index - start, bitmap);
> +	}
> +
> +	return false;
> +}
> +
>  /*
>   * Write @data into the MSR specified by @index.  Select MSR specific fault
>   * checks are bypassed if @host_initiated is %true.
> @@ -1483,6 +1506,9 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>  {
>  	struct msr_data msr;
>  
> +	if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_ALLOW_WRITE))
> +		return -ENOENT;
> +
>  	switch (index) {
>  	case MSR_FS_BASE:
>  	case MSR_GS_BASE:
> @@ -1528,6 +1554,9 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
>  	struct msr_data msr;
>  	int ret;
>  
> +	if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_ALLOW_READ))
> +		return -ENOENT;
> +
>  	msr.index = index;
>  	msr.host_initiated = host_initiated;
>  
> @@ -3550,6 +3579,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_EXCEPTION_PAYLOAD:
>  	case KVM_CAP_SET_GUEST_DEBUG:
>  	case KVM_CAP_X86_USER_SPACE_MSR:
> +	case KVM_CAP_X86_MSR_ALLOWLIST:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SYNC_REGS:
> @@ -5075,6 +5105,101 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  	return r;
>  }
>  
> +static bool msr_range_overlaps(struct kvm *kvm, struct msr_bitmap_range *range)
> +{
> +	struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
> +	u32 i, count = kvm->arch.msr_allowlist_ranges_count;
> +
> +	for (i = 0; i < count; i++) {
> +		u32 start = max(range->base, ranges[i].base);
> +		u32 end = min(range->base + range->nmsrs,
> +			      ranges[i].base + ranges[i].nmsrs);
> +
> +		if ((start < end) && (range->flags & ranges[i].flags))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static int kvm_vm_ioctl_add_msr_allowlist(struct kvm *kvm, void __user *argp)
> +{
> +	struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
> +	struct kvm_msr_allowlist __user *user_msr_allowlist = argp;
> +	struct msr_bitmap_range range;
> +	struct kvm_msr_allowlist kernel_msr_allowlist;
> +	unsigned long *bitmap = NULL;
> +	size_t bitmap_size;
> +	int r = 0;
> +
> +	if (copy_from_user(&kernel_msr_allowlist, user_msr_allowlist,
> +			   sizeof(kernel_msr_allowlist))) {
> +		r = -EFAULT;
> +		goto out;
> +	}
> +
> +	bitmap_size = BITS_TO_LONGS(kernel_msr_allowlist.nmsrs) * sizeof(long);
> +	if (bitmap_size > KVM_MSR_ALLOWLIST_MAX_LEN) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	bitmap = memdup_user(user_msr_allowlist->bitmap, bitmap_size);
> +	if (IS_ERR(bitmap)) {
> +		r = PTR_ERR(bitmap);
> +		goto out;
> +	}
> +
> +	range = (struct msr_bitmap_range) {
> +		.flags = kernel_msr_allowlist.flags,
> +		.base = kernel_msr_allowlist.base,
> +		.nmsrs = kernel_msr_allowlist.nmsrs,
> +		.bitmap = bitmap,
> +	};
> +
> +	if (range.flags & ~(KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE)) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Protect from concurrent calls to this function that could trigger
> +	 * a TOCTOU violation on kvm->arch.msr_allowlist_ranges_count.
> +	 */
> +	mutex_lock(&kvm->lock);
> +
> +	if (kvm->arch.msr_allowlist_ranges_count >=
> +	    ARRAY_SIZE(kvm->arch.msr_allowlist_ranges)) {
> +		r = -E2BIG;
> +		goto out_locked;
> +	}
> +
> +	if (msr_range_overlaps(kvm, &range)) {
> +		r = -EINVAL;
> +		goto out_locked;
> +	}
> +
> +	/* Everything ok, add this range identifier to our global pool */
> +	ranges[kvm->arch.msr_allowlist_ranges_count++] = range;
> +
> +out_locked:
> +	mutex_unlock(&kvm->lock);
> +out:
> +	if (r)
> +		kfree(bitmap);
> +
> +	return r;
> +}
> +
> +static int kvm_vm_ioctl_clear_msr_allowlist(struct kvm *kvm)
> +{
> +	mutex_lock(&kvm->lock);
> +	kvm->arch.msr_allowlist_ranges_count = 0;
> +	mutex_unlock(&kvm->lock);

Are we also supposed to kfree() bitmaps here?

> +
> +	return 0;
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  		       unsigned int ioctl, unsigned long arg)
>  {
> @@ -5381,6 +5506,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
>  	case KVM_SET_PMU_EVENT_FILTER:
>  		r = kvm_vm_ioctl_set_pmu_event_filter(kvm, argp);
>  		break;
> +	case KVM_X86_ADD_MSR_ALLOWLIST:
> +		r = kvm_vm_ioctl_add_msr_allowlist(kvm, argp);
> +		break;
> +	case KVM_X86_CLEAR_MSR_ALLOWLIST:
> +		r = kvm_vm_ioctl_clear_msr_allowlist(kvm);
> +		break;
>  	default:
>  		r = -ENOTTY;
>  	}
> @@ -10086,6 +10217,8 @@ void kvm_arch_pre_destroy_vm(struct kvm *kvm)
>  
>  void kvm_arch_destroy_vm(struct kvm *kvm)
>  {
> +	int i;
> +
>  	if (current->mm == kvm->mm) {
>  		/*
>  		 * Free memory regions allocated on behalf of userspace,
> @@ -10102,6 +10235,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	}
>  	if (kvm_x86_ops.vm_destroy)
>  		kvm_x86_ops.vm_destroy(kvm);
> +	for (i = 0; i < kvm->arch.msr_allowlist_ranges_count; i++)
> +		kfree(kvm->arch.msr_allowlist_ranges[i].bitmap);
>  	kvm_pic_destroy(kvm);
>  	kvm_ioapic_destroy(kvm);
>  	kvm_free_vcpus(kvm);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 13fc7de1eb50..4d6bb06e0fb1 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1041,6 +1041,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_HALT_POLL 182
>  #define KVM_CAP_ASYNC_PF_INT 183
>  #define KVM_CAP_X86_USER_SPACE_MSR 184
> +#define KVM_CAP_X86_MSR_ALLOWLIST 185
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1542,6 +1543,10 @@ struct kvm_pv_cmd {
>  /* Available with KVM_CAP_S390_PROTECTED */
>  #define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
>  
> +/* Available with KVM_CAP_X86_MSR_ALLOWLIST */
> +#define KVM_X86_ADD_MSR_ALLOWLIST	_IOW(KVMIO,  0xc6, struct kvm_msr_allowlist)
> +#define KVM_X86_CLEAR_MSR_ALLOWLIST	_IO(KVMIO,  0xc7)
> +
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>  	/* Guest initialization commands */

-- 
Vitaly


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation
  2020-08-03 11:37   ` Vitaly Kuznetsov
@ 2020-08-03 20:50     ` Alexander Graf
  2020-08-03 21:23       ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Graf @ 2020-08-03 20:50 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Paolo Bonzini
  Cc: Jonathan Corbet, Sean Christopherson, Wanpeng Li, Jim Mattson,
	Joerg Roedel, KarimAllah Raslan, Aaron Lewis, kvm, linux-doc,
	linux-kernel



On 03.08.20 13:37, Vitaly Kuznetsov wrote:
> 
> Alexander Graf <graf@amazon.com> writes:
> 
>> It's not desireable to have all MSRs always handled by KVM kernel space. Some
>> MSRs would be useful to handle in user space to either emulate behavior (like
>> uCode updates) or differentiate whether they are valid based on the CPU model.
>>
>> To allow user space to specify which MSRs it wants to see handled by KVM,
>> this patch introduces a new ioctl to push allow lists of bitmaps into
>> KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
>> With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
>> denied MSR events to user space to operate on.
>>
>> If no allowlist is populated, MSR handling stays identical to before.
>>
>> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>
>> ---
>>
>> v2 -> v3:
>>
>>    - document flags for KVM_X86_ADD_MSR_ALLOWLIST
>>    - generalize exit path, always unlock when returning
>>    - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
>>    - Add KVM_X86_CLEAR_MSR_ALLOWLIST
>> ---
>>   Documentation/virt/kvm/api.rst  |  91 +++++++++++++++++++++
>>   arch/x86/include/asm/kvm_host.h |  10 +++
>>   arch/x86/include/uapi/asm/kvm.h |  15 ++++
>>   arch/x86/kvm/x86.c              | 135 ++++++++++++++++++++++++++++++++
>>   include/uapi/linux/kvm.h        |   5 ++
>>   5 files changed, 256 insertions(+)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 79c3e2fdfae4..d611ddd326fc 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -4697,6 +4697,82 @@ KVM_PV_VM_VERIFY
>>     Verify the integrity of the unpacked image. Only if this succeeds,
>>     KVM is allowed to start protected VCPUs.
>>
>> +4.126 KVM_X86_ADD_MSR_ALLOWLIST
>> +-------------------------------
>> +
>> +:Capability: KVM_CAP_X86_MSR_ALLOWLIST
>> +:Architectures: x86
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_msr_allowlist
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> +  struct kvm_msr_allowlist {
>> +         __u32 flags;
>> +         __u32 nmsrs; /* number of msrs in bitmap */
>> +         __u32 base;  /* base address for the MSRs bitmap */
>> +         __u32 pad;
>> +
>> +         __u8 bitmap[0]; /* a set bit allows that the operation set in flags */
>> +  };
>> +
>> +flags values:
>> +
>> +KVM_MSR_ALLOW_READ
>> +
>> +  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
>> +  indicates that a read should immediately fail, while a 1 indicates that
>> +  a read should be handled by the normal KVM MSR emulation logic.
>> +
>> +KVM_MSR_ALLOW_WRITE
>> +
>> +  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
>> +  indicates that a write should immediately fail, while a 1 indicates that
>> +  a write should be handled by the normal KVM MSR emulation logic.
>> +
>> +KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE
>> +
> 
> Should we probably say what KVM_MSR_ALLOW_READ/KVM_MSR_ALLOW_WRITE are
> equal to? (1 << 0, 1 << 1)?
> 
>> +  Filter booth read and write accesses to MSRs using the given bitmap. A 0
>> +  in the bitmap indicates that both reads and writes should immediately fail,
>> +  while a 1 indicates that reads and writes should be handled by the normal
>> +  KVM MSR emulation logic.
>> +
>> +This ioctl allows user space to define a set of bitmaps of MSR ranges to
>> +specify whether a certain MSR access is allowed or not.
>> +
>> +If this ioctl has never been invoked, MSR accesses are not guarded and the
>> +old KVM in-kernel emulation behavior is fully preserved.
>> +
>> +As soon as the first allow list was specified, only allowed MSR accesses
>> +are permitted inside of KVM's MSR code.
>> +
>> +Each allowlist specifies a range of MSRs to potentially allow access on.
>> +The range goes from MSR index [base .. base+nmsrs]. The flags field
>> +indicates whether reads, writes or both reads and writes are permitted
>> +by setting a 1 bit in the bitmap for the corresponding MSR index.
>> +
>> +If an MSR access is not permitted through the allow list, it generates a
>> +#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
>> +allows user space to deflect and potentially handle various MSR accesses
>> +into user space.
>> +
>> +4.124 KVM_X86_CLEAR_MSR_ALLOWLIST
>> +---------------------------------
>> +
>> +:Capability: KVM_CAP_X86_MSR_ALLOWLIST
>> +:Architectures: x86
>> +:Type: vcpu ioctl
>> +:Parameters: none
>> +:Returns: 0
>> +
>> +This ioctl resets all internal MSR allow lists. After this call, no allow
>> +list is present and the guest would execute as if no allow lists were set,
>> +so all MSRs are considered allowed and thus handled by the in-kernel MSR
>> +emulation logic.
>> +
>> +No vCPU may be in running state when calling this ioctl.
>> +
>>
>>   5. The kvm_run structure
>>   ========================
>> @@ -6213,3 +6289,18 @@ writes to user space. It can be enabled on a VM level. If enabled, MSR
>>   accesses that would usually trigger a #GP by KVM into the guest will
>>   instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
>>   KVM_EXIT_X86_WRMSR exit notifications.
>> +
>> +8.25 KVM_CAP_X86_MSR_ALLOWLIST
>> +------------------------------
>> +
>> +:Architectures: x86
>> +
>> +This capability indicates that KVM supports emulation of only select MSR
>> +registers. With this capability exposed, KVM exports two new VM ioctls:
>> +KVM_X86_ADD_MSR_ALLOWLIST which user space can call to specify bitmaps of MSR
>> +ranges that KVM should emulate in kernel space and KVM_X86_CLEAR_MSR_ALLOWLIST
>> +which user space can call to remove all MSR allow lists from the VM context.
>> +
>> +In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
>> +trap and emulate MSRs that are outside of the scope of KVM as well as
>> +limit the attack surface on KVM's MSR emulation code.
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 809eed0dbdea..21358ed4e590 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -904,6 +904,13 @@ struct kvm_hv {
>>        struct kvm_hv_syndbg hv_syndbg;
>>   };
>>
>> +struct msr_bitmap_range {
>> +     u32 flags;
>> +     u32 nmsrs;
>> +     u32 base;
>> +     unsigned long *bitmap;
>> +};
>> +
>>   enum kvm_irqchip_mode {
>>        KVM_IRQCHIP_NONE,
>>        KVM_IRQCHIP_KERNEL,       /* created with KVM_CREATE_IRQCHIP */
>> @@ -1008,6 +1015,9 @@ struct kvm_arch {
>>        /* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
>>        bool user_space_msr_enabled;
>>
>> +     struct msr_bitmap_range msr_allowlist_ranges[10];
>> +     int msr_allowlist_ranges_count;
>> +
>>        struct kvm_pmu_event_filter *pmu_event_filter;
>>        struct task_struct *nx_lpage_recovery_thread;
>>   };
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 0780f97c1850..c33fb1d72d52 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -192,6 +192,21 @@ struct kvm_msr_list {
>>        __u32 indices[0];
>>   };
>>
>> +#define KVM_MSR_ALLOW_READ  (1 << 0)
>> +#define KVM_MSR_ALLOW_WRITE (1 << 1)
>> +
>> +/* Maximum size of the of the bitmap in bytes */
>> +#define KVM_MSR_ALLOWLIST_MAX_LEN 0x600
>> +
>> +/* for KVM_X86_ADD_MSR_ALLOWLIST */
>> +struct kvm_msr_allowlist {
>> +     __u32 flags;
>> +     __u32 nmsrs; /* number of msrs in bitmap */
>> +     __u32 base;  /* base address for the MSRs bitmap */
>> +     __u32 pad;
>> +
>> +     __u8 bitmap[0]; /* a set bit allows that the operation set in flags */
>> +};
>>
>>   struct kvm_cpuid_entry {
>>        __u32 function;
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 24c72250f6df..7a2be00a3512 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1472,6 +1472,29 @@ void kvm_enable_efer_bits(u64 mask)
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
>>
>> +static bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
>> +{
>> +     struct msr_bitmap_range *ranges = vcpu->kvm->arch.msr_allowlist_ranges;
>> +     u32 count = vcpu->kvm->arch.msr_allowlist_ranges_count;
>> +     u32 i;
>> +
>> +     /* MSR allowlist not set up, allow everything */
>> +     if (!count)
>> +             return true;
>> +
>> +     for (i = 0; i < count; i++) {
>> +             u32 start = ranges[i].base;
>> +             u32 end = start + ranges[i].nmsrs;
>> +             int flags = ranges[i].flags;
> 
> u32 flags?

Yes, much better :).

> 
>> +             unsigned long *bitmap = ranges[i].bitmap;
>> +
>> +             if ((index >= start) && (index < end) && (flags & type))
>> +                     return !!test_bit(index - start, bitmap);
>> +     }
>> +
>> +     return false;
>> +}
>> +
>>   /*
>>    * Write @data into the MSR specified by @index.  Select MSR specific fault
>>    * checks are bypassed if @host_initiated is %true.
>> @@ -1483,6 +1506,9 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>>   {
>>        struct msr_data msr;
>>
>> +     if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_ALLOW_WRITE))
>> +             return -ENOENT;
>> +
>>        switch (index) {
>>        case MSR_FS_BASE:
>>        case MSR_GS_BASE:
>> @@ -1528,6 +1554,9 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
>>        struct msr_data msr;
>>        int ret;
>>
>> +     if (!host_initiated && !kvm_msr_allowed(vcpu, index, KVM_MSR_ALLOW_READ))
>> +             return -ENOENT;
>> +
>>        msr.index = index;
>>        msr.host_initiated = host_initiated;
>>
>> @@ -3550,6 +3579,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>        case KVM_CAP_EXCEPTION_PAYLOAD:
>>        case KVM_CAP_SET_GUEST_DEBUG:
>>        case KVM_CAP_X86_USER_SPACE_MSR:
>> +     case KVM_CAP_X86_MSR_ALLOWLIST:
>>                r = 1;
>>                break;
>>        case KVM_CAP_SYNC_REGS:
>> @@ -5075,6 +5105,101 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>        return r;
>>   }
>>
>> +static bool msr_range_overlaps(struct kvm *kvm, struct msr_bitmap_range *range)
>> +{
>> +     struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
>> +     u32 i, count = kvm->arch.msr_allowlist_ranges_count;
>> +
>> +     for (i = 0; i < count; i++) {
>> +             u32 start = max(range->base, ranges[i].base);
>> +             u32 end = min(range->base + range->nmsrs,
>> +                           ranges[i].base + ranges[i].nmsrs);
>> +
>> +             if ((start < end) && (range->flags & ranges[i].flags))
>> +                     return true;
>> +     }
>> +
>> +     return false;
>> +}
>> +
>> +static int kvm_vm_ioctl_add_msr_allowlist(struct kvm *kvm, void __user *argp)
>> +{
>> +     struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
>> +     struct kvm_msr_allowlist __user *user_msr_allowlist = argp;
>> +     struct msr_bitmap_range range;
>> +     struct kvm_msr_allowlist kernel_msr_allowlist;
>> +     unsigned long *bitmap = NULL;
>> +     size_t bitmap_size;
>> +     int r = 0;
>> +
>> +     if (copy_from_user(&kernel_msr_allowlist, user_msr_allowlist,
>> +                        sizeof(kernel_msr_allowlist))) {
>> +             r = -EFAULT;
>> +             goto out;
>> +     }
>> +
>> +     bitmap_size = BITS_TO_LONGS(kernel_msr_allowlist.nmsrs) * sizeof(long);
>> +     if (bitmap_size > KVM_MSR_ALLOWLIST_MAX_LEN) {
>> +             r = -EINVAL;
>> +             goto out;
>> +     }
>> +
>> +     bitmap = memdup_user(user_msr_allowlist->bitmap, bitmap_size);
>> +     if (IS_ERR(bitmap)) {
>> +             r = PTR_ERR(bitmap);
>> +             goto out;
>> +     }
>> +
>> +     range = (struct msr_bitmap_range) {
>> +             .flags = kernel_msr_allowlist.flags,
>> +             .base = kernel_msr_allowlist.base,
>> +             .nmsrs = kernel_msr_allowlist.nmsrs,
>> +             .bitmap = bitmap,
>> +     };
>> +
>> +     if (range.flags & ~(KVM_MSR_ALLOW_READ | KVM_MSR_ALLOW_WRITE)) {
>> +             r = -EINVAL;
>> +             goto out;
>> +     }
>> +
>> +     /*
>> +      * Protect from concurrent calls to this function that could trigger
>> +      * a TOCTOU violation on kvm->arch.msr_allowlist_ranges_count.
>> +      */
>> +     mutex_lock(&kvm->lock);
>> +
>> +     if (kvm->arch.msr_allowlist_ranges_count >=
>> +         ARRAY_SIZE(kvm->arch.msr_allowlist_ranges)) {
>> +             r = -E2BIG;
>> +             goto out_locked;
>> +     }
>> +
>> +     if (msr_range_overlaps(kvm, &range)) {
>> +             r = -EINVAL;
>> +             goto out_locked;
>> +     }
>> +
>> +     /* Everything ok, add this range identifier to our global pool */
>> +     ranges[kvm->arch.msr_allowlist_ranges_count++] = range;
>> +
>> +out_locked:
>> +     mutex_unlock(&kvm->lock);
>> +out:
>> +     if (r)
>> +             kfree(bitmap);
>> +
>> +     return r;
>> +}
>> +
>> +static int kvm_vm_ioctl_clear_msr_allowlist(struct kvm *kvm)
>> +{
>> +     mutex_lock(&kvm->lock);
>> +     kvm->arch.msr_allowlist_ranges_count = 0;
>> +     mutex_unlock(&kvm->lock);
> 
> Are we also supposed to kfree() bitmaps here?

Phew. Yes, because without the kfree() we're leaking memory. 
Unfortunately if I just put in a kfree() here, we may allow a 
concurrently executing vCPU to access already free'd memory.

So I'll also add locking around the range check. Let's hope it won't 
regress performance too much.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation
  2020-08-03 20:50     ` Alexander Graf
@ 2020-08-03 21:23       ` Sean Christopherson
  0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2020-08-03 21:23 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Vitaly Kuznetsov, Paolo Bonzini, Jonathan Corbet, Wanpeng Li,
	Jim Mattson, Joerg Roedel, KarimAllah Raslan, Aaron Lewis, kvm,
	linux-doc, linux-kernel

On Mon, Aug 03, 2020 at 10:50:53PM +0200, Alexander Graf wrote:
> 
> On 03.08.20 13:37, Vitaly Kuznetsov wrote:
> >>+static int kvm_vm_ioctl_clear_msr_allowlist(struct kvm *kvm)
> >>+{
> >>+     mutex_lock(&kvm->lock);
> >>+     kvm->arch.msr_allowlist_ranges_count = 0;
> >>+     mutex_unlock(&kvm->lock);
> >
> >Are we also supposed to kfree() bitmaps here?
> 
> Phew. Yes, because without the kfree() we're leaking memory. Unfortunately
> if I just put in a kfree() here, we may allow a concurrently executing vCPU
> to access already free'd memory.
> 
> So I'll also add locking around the range check. Let's hope it won't regress
> performance too much.

What about using KVM's SRCU to protect the list?  The only thing I'm not 100%
on is whether holding kvm->lock across synchronize_srcu() is safe from a lock
inversion perspective.  I'm pretty sure KVM doesn't try to acquire kvm->lock
after grabbing SRCU, but that's hard to audit and there aren't any existing
flows that invoke synchronize_srcu() while holding kvm->lock.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 21:49 [PATCH v3 0/3] Allow user space to restrict and augment MSR emulation Alexander Graf
2020-07-31 21:49 ` [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space Alexander Graf
2020-07-31 23:36   ` Jim Mattson
2020-08-03 10:08     ` Alexander Graf
2020-08-03 11:27   ` Vitaly Kuznetsov
2020-08-03 11:34     ` Alexander Graf
2020-07-31 21:49 ` [PATCH v3 2/3] KVM: x86: Introduce allow list for MSR emulation Alexander Graf
2020-08-03 11:37   ` Vitaly Kuznetsov
2020-08-03 20:50     ` Alexander Graf
2020-08-03 21:23       ` Sean Christopherson
2020-07-31 21:49 ` [PATCH v3 3/3] KVM: selftests: Add test for user space MSR handling Alexander Graf

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git