All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND RFC 0/2] Paravirtualized Control Register pinning
@ 2019-12-20 19:26 John Andersen
  2019-12-20 19:27 ` [RESEND RFC 1/2] KVM: X86: Add CR pin MSRs John Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: John Andersen @ 2019-12-20 19:26 UTC (permalink / raw)
  To: tglx, mingo, bp, x86, pbonzini
  Cc: hpa, sean.j.christopherson, vkuznets, wanpengli, jmattson, joro,
	linux-kernel, kvm, John Andersen

Paravirtualized Control Register pinning is a strengthened version of
existing protections on the Write Protect, Supervisor Mode Execution /
Access Protection, and User-Mode Instruction Prevention bits. The
existing protections prevent native_write_cr*() functions from writing
values which disable those bits. This patchset prevents any guest
writes to control registers from disabling pinned bits, not just writes
from native_write_cr*(). This stops attackers within the guest from
using ROP to disable protection bits.

https://web.archive.org/web/20171029060939/http://www.blackbunny.io/linux-kernel-x86-64-bypass-smep-kaslr-kptr_restric/

The protection is implemented by adding MSRs to KVM which contain the
bits that are allowed to be pinned, and the bits which are pinned. The
guest or userspace can enable bit pinning by reading MSRs to check
which bits are allowed to be pinned, and then writing MSRs to set which
bits they want pinned.

Other hypervisors such as HyperV have implemented similar protections
for Control Registers and MSRs; which security researchers have found
effective.

https://www.abatchy.com/2018/01/kernel-exploitation-4

We add a CR pin feature bit to the KVM cpuid, read only MSRs which
guests use to identify which bits they may request be pinned, and
CR pinned MSRs which contain the pinned bits. Guests can request that
KVM pin bits within control register 0 or 4 via the CR pinned MSRs.
Writes to the MSRs fail if they include bits that aren't allowed to be
pinned. Host userspace may clear or modify pinned bits at any time.
Once pinned bits are set, the guest may pin more allowed bits, but may
never clear pinned bits.

In the event that the guest vcpu attempts to disable any of the pinned
bits, the vcpu that issued the write is sent a general protection
fault, and the register is left unchanged.

Pinning is not active when running in SMM. Entering SMM disables pinned
bits, writes to control registers within SMM would therefore trigger
general protection faults if pinning was enforced.

The guest may never read pinned bits. If an attacker were to read the
CR pinned MSRs, they might decide to preform another attack which would
not cause a general protection fault.

Should userspace expose the CR pining CPUID feature bit, it must zero CR
pinned MSRs on reboot. If it does not, it runs the risk of having the
guest enable pinning and subsequently cause general protection faults on
next boot due to early boot code setting control registers to values
which do not contain the pinned bits.

When running with KVM guest support and paravirtualized CR pinning
enabled, paravirtualized and existing pinning are setup at the same
point on the boot CPU. Non-boot CPUs setup pinning upon identification.

Guests using the kexec system call currently do not support
paravirtualized control register pinning. This is due to early boot
code writing known good values to control registers, these values do
not contain the protected bits. This is due to CPU feature
identification being done at a later time, when the kernel properly
checks if it can enable protections.

Most distributions enable kexec. However, kexec could be made boot time
disableable. In this case if a user has disabled kexec at boot time
the guest will request that paravirtualized control register pinning
be enabled. This would expand the userbase to users of major
distributions.

Paravirtualized CR pinning will likely be incompatible with kexec for
the foreseeable future. Early boot code could possibly be changed to
not clear protected bits. However, a kernel that requests CR bits be
pinned can't know if the kernel it's kexecing has been updated to not
clear protected bits. This would result in the kernel being kexec'd
almost immediately receiving a general protection fault.

Security conscious kernel configurations disable kexec already, per KSPP
guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
Termina, and others using KVM to virtualize Linux will benefit from
this protection.

The usage of SMM in SeaBIOS was explored as a way to communicate to KVM
that a reboot has occurred and it should zero the pinned bits. When
using QEMU and SeaBIOS, SMM initialization occurs on reboot. However,
prior to SMM initialization, BIOS writes zero values to CR0, causing a
general protection fault to be sent to the guest before SMM can signal
that the machine has booted.

Pinning of sensitive CR bits has already been implemented to protect
against exploits directly calling native_write_cr*(). The current
protection cannot stop ROP attacks which jump directly to a MOV CR
instruction. Guests running with paravirtualized CR pinning are now
protected against the use of ROP to disable CR bits. The same bits that
are being pinned natively may be pinned via the CR pinned MSRs. These
bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.

Future patches could protect bits in MSRs in a similar fashion. The NXE
bit of the EFER MSR is a prime candidate.

John Andersen (2):
  KVM: X86: Add CR pin MSRs
  X86: Use KVM CR pin MSRs

 Documentation/virt/kvm/msr.txt       | 38 +++++++++++++++++++++++
 arch/x86/Kconfig                     |  9 ++++++
 arch/x86/include/asm/kvm_host.h      |  2 ++
 arch/x86/include/asm/kvm_para.h      | 10 +++++++
 arch/x86/include/uapi/asm/kvm_para.h |  5 ++++
 arch/x86/kernel/cpu/common.c         |  5 ++++
 arch/x86/kernel/kvm.c                | 17 +++++++++++
 arch/x86/kvm/cpuid.c                 |  3 +-
 arch/x86/kvm/x86.c                   | 45 ++++++++++++++++++++++++++++
 9 files changed, 133 insertions(+), 1 deletion(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RESEND RFC 1/2] KVM: X86: Add CR pin MSRs
  2019-12-20 19:26 [RESEND RFC 0/2] Paravirtualized Control Register pinning John Andersen
@ 2019-12-20 19:27 ` John Andersen
  2019-12-20 19:27 ` [RESEND RFC 2/2] X86: Use KVM " John Andersen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: John Andersen @ 2019-12-20 19:27 UTC (permalink / raw)
  To: tglx, mingo, bp, x86, pbonzini
  Cc: hpa, sean.j.christopherson, vkuznets, wanpengli, jmattson, joro,
	linux-kernel, kvm, John Andersen

Add a CR pin feature bit to the KVM cpuid. Add read only MSRs to KVM
which guests use to identify which bits they may request be pinned. Add
CR pinned MSRs to KVM. Allow guests to request that KVM pin certain
bits within control register 0 or 4 via the CR pinned MSRs. Writes to
the MSRs fail if they include bits which aren't allowed. Host userspace
may clear or modify pinned bits at any time. Once pinned bits are set,
the guest may pin additional allowed bits, but not clear any. The guest
may never read pinned bits. If an attacker were to read the CR pinned
MSRs, they might decide to preform another attack which would not cause
a general protection fault.

In the event that the guest vCPU attempts to disable any of the pinned
bits, send that vCPU a general protection fault, and leave the register
unchanged. Entering SMM unconditionally clears various CR0/4 bits, some
of which may be pinned by the OS. To avoid triggering shutdown on SMIs,
pinning isn't enforced when the vCPU is running in SMM.

Should userspace expose the CR pinning CPUID feature bit, it must zero
CR pinned MSRs on reboot. If it does not, it runs the risk of having the
guest enable pinning and subsequently cause general protection faults on
next boot due to early boot code setting control registers to values
which do not contain the pinned bits. Userspace is responsible for
migrating the contents of the CR* pinned MSRs. If userspace fails to
migrate the MSRs the protection will no longer be active.

Pinning of sensitive CR bits has already been implemented to protect
against exploits directly calling native_write_cr*(). The current
protection cannot stop ROP attacks which jump directly to a MOV CR
instruction.

https://web.archive.org/web/20171029060939/http://www.blackbunny.io/linux-kernel-x86-64-bypass-smep-kaslr-kptr_restric/

Guests running with paravirtualized CR pinning are now protected against
the use of ROP to disable CR bits. The same bits that are being pinned
natively may be pinned via the CR pinned MSRs. These bits are WP in CR0,
and SMEP, SMAP, and UMIP in CR4.

Other hypervisors such as HyperV have implemented similar protections
for Control Registers and MSRs; which security researchers have found
effective.

https://www.abatchy.com/2018/01/kernel-exploitation-4

Future patches could implement similar MSR and hypercall combinations
to protect bits in MSRs. The NXE bit of the EFER MSR is a prime
candidate.

Patches for QEMU are required to expose the CR pin cpuid feature bit. As
well as clear the MSRs on reboot and enable migration.

https://github.com/qemu/qemu/commit/e7a0ff8a8dcde1ef2b83a9d93129614f512752ae
https://github.com/qemu/qemu/commit/7e8c770c91616ae8d2d6b15bcc2865be594c8852

Signed-off-by: John Andersen <john.s.andersen@intel.com>
---
 Documentation/virt/kvm/msr.txt       | 38 +++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h      |  2 ++
 arch/x86/include/uapi/asm/kvm_para.h |  5 ++++
 arch/x86/kvm/cpuid.c                 |  3 +-
 arch/x86/kvm/x86.c                   | 45 ++++++++++++++++++++++++++++
 5 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/msr.txt b/Documentation/virt/kvm/msr.txt
index df1f4338b3ca..c78efd5bcfc1 100644
--- a/Documentation/virt/kvm/msr.txt
+++ b/Documentation/virt/kvm/msr.txt
@@ -282,3 +282,41 @@ MSR_KVM_POLL_CONTROL: 0x4b564d05
 	KVM guests can request the host not to poll on HLT, for example if
 	they are performing polling themselves.
 
+MSR_KVM_CR0_PIN_ALLOWED: 0x4b564d06
+MSR_KVM_CR4_PIN_ALLOWED: 0x4b564d07
+	Read only registers informing the guest which bits may be pinned for
+	each control register respectively via the CR pinned MSRs.
+
+	data: Bits which may be pinned.
+
+	Attempting to pin bits other than these will result in a failure when
+	writing to the respective CR pinned MSR.
+
+	Bits which are allowed to be pinned are WP for CR0 and SMEP, SMAP, and
+	UMIP for CR4.
+
+MSR_KVM_CR0_PINNED: 0x4b564d08
+MSR_KVM_CR4_PINNED: 0x4b564d09
+	Used to configure pinned bits in control registers
+
+	data: Bits to be pinned.
+
+	Fails if data contains bits which are not allowed to be pinned. Bits
+	which are allowed to be pinned can be found by reading the CR pin
+	allowed MSRs.
+
+	The MSRs are read/write for host userspace, and write-only for the
+	guest.
+
+	Once set to a non-zero value, the guest cannot clear any of the bits
+	that have been pinned to 1. The guest can set more bits to 1, so long
+	as those bits appear in the allowed MSR.
+
+	Host userspace may clear or change pinned bits at any point. Host
+	userspace must clear pinned bits on reboot.
+
+	The MSR enables bit pinning for control registers. Pinning is active
+	when the guest is not in SMM. If the guest attempts to write values to
+	cr* where bits differ from pinned bits, the write will fail and the
+	guest will be sent a general protection fault.
+
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b79cd6aa4075..ee8da4191920 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -562,10 +562,12 @@ struct kvm_vcpu_arch {
 
 	unsigned long cr0;
 	unsigned long cr0_guest_owned_bits;
+	unsigned long cr0_pinned;
 	unsigned long cr2;
 	unsigned long cr3;
 	unsigned long cr4;
 	unsigned long cr4_guest_owned_bits;
+	unsigned long cr4_pinned;
 	unsigned long cr8;
 	u32 pkru;
 	u32 hflags;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..e6c61e455adf 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,6 +31,7 @@
 #define KVM_FEATURE_PV_SEND_IPI	11
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
+#define KVM_FEATURE_CR_PIN		14
 
 #define KVM_HINTS_REALTIME      0
 
@@ -50,6 +51,10 @@
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_CR0_PIN_ALLOWED	0x4b564d06
+#define MSR_KVM_CR4_PIN_ALLOWED	0x4b564d07
+#define MSR_KVM_CR0_PINNED	0x4b564d08
+#define MSR_KVM_CR4_PINNED	0x4b564d09
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index cfafa320a8cf..19fb49753442 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -712,7 +712,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function,
 			     (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
-			     (1 << KVM_FEATURE_PV_SCHED_YIELD);
+			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
+			     (1 << KVM_FEATURE_CR_PIN);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ed167e039e5..eb1640ada8b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -753,6 +753,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE))
 		return 1;
 
+	if (!is_smm(vcpu) && (cr0 ^ old_cr0) & vcpu->arch.cr0_pinned)
+		return 1;
+
 	if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
 #ifdef CONFIG_X86_64
 		if ((vcpu->arch.efer & EFER_LME)) {
@@ -916,6 +919,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	if (kvm_valid_cr4(vcpu, cr4))
 		return 1;
 
+	if (!is_smm(vcpu) && (cr4 ^ old_cr4) & vcpu->arch.cr4_pinned)
+		return 1;
+
 	if (is_long_mode(vcpu)) {
 		if (!(cr4 & X86_CR4_PAE))
 			return 1;
@@ -1234,6 +1240,10 @@ static const u32 emulated_msrs_all[] = {
 
 	MSR_K7_HWCR,
 	MSR_KVM_POLL_CONTROL,
+	MSR_KVM_CR0_PIN_ALLOWED,
+	MSR_KVM_CR4_PIN_ALLOWED,
+	MSR_KVM_CR0_PINNED,
+	MSR_KVM_CR4_PINNED,
 };
 
 static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -2621,6 +2631,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
 }
 
+#define KVM_CR0_PIN_ALLOWED	(X86_CR0_WP)
+#define KVM_CR4_PIN_ALLOWED	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	bool pr = false;
@@ -2811,6 +2824,22 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vcpu->arch.msr_kvm_poll_control = data;
 		break;
 
+	case MSR_KVM_CR0_PINNED:
+		if (data & ~KVM_CR0_PIN_ALLOWED)
+			return 1;
+		if (msr_info->host_initiated)
+			vcpu->arch.cr0_pinned = data;
+		else
+			vcpu->arch.cr0_pinned |= data;
+		break;
+	case MSR_KVM_CR4_PINNED:
+		if (data & ~KVM_CR4_PIN_ALLOWED)
+			return 1;
+		if (msr_info->host_initiated)
+			vcpu->arch.cr4_pinned = data;
+		else
+			vcpu->arch.cr4_pinned |= data;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3054,6 +3083,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_POLL_CONTROL:
 		msr_info->data = vcpu->arch.msr_kvm_poll_control;
 		break;
+	case MSR_KVM_CR0_PIN_ALLOWED:
+		msr_info->data = KVM_CR0_PIN_ALLOWED;
+		break;
+	case MSR_KVM_CR4_PIN_ALLOWED:
+		msr_info->data = KVM_CR4_PIN_ALLOWED;
+		break;
+	case MSR_KVM_CR0_PINNED:
+		if (!msr_info->host_initiated)
+			return 1;
+		msr_info->data = vcpu->arch.cr0_pinned;
+		break;
+	case MSR_KVM_CR4_PINNED:
+		if (!msr_info->host_initiated)
+			return 1;
+		msr_info->data = vcpu->arch.cr4_pinned;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RESEND RFC 2/2] X86: Use KVM CR pin MSRs
  2019-12-20 19:26 [RESEND RFC 0/2] Paravirtualized Control Register pinning John Andersen
  2019-12-20 19:27 ` [RESEND RFC 1/2] KVM: X86: Add CR pin MSRs John Andersen
@ 2019-12-20 19:27 ` John Andersen
  2019-12-23  7:39   ` Andy Lutomirski
  2019-12-24  6:45   ` kbuild test robot
  2019-12-21 13:59 ` [RESEND RFC 0/2] Paravirtualized Control Register pinning Paolo Bonzini
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: John Andersen @ 2019-12-20 19:27 UTC (permalink / raw)
  To: tglx, mingo, bp, x86, pbonzini
  Cc: hpa, sean.j.christopherson, vkuznets, wanpengli, jmattson, joro,
	linux-kernel, kvm, John Andersen

Strengthen existing control register pinning when running
paravirtualized under KVM. Check which bits KVM supports pinning for
each control register and only pin supported bits which are already
pinned via the existing native protection. Write to KVM CR0 and CR4
pinned MSRs to enable pinning.

Initiate KVM assisted pinning directly following the setup of native
pinning on boot CPU. For non-boot CPUs initiate paravirtualized pinning
on CPU identification.

Identification of non-boot CPUs takes place after the boot CPU has setup
native CR pinning. Therefore, non-boot CPUs access pinned bits setup by
the boot CPU and request that those be pinned. All CPUs request
paravirtualized pinning of the same bits which are already pinned
natively.

Guests using the kexec system call currently do not support
paravirtualized control register pinning. This is due to early boot
code writing known good values to control registers, these values do
not contain the protected bits. This is due to CPU feature
identification being done at a later time, when the kernel properly
checks if it can enable protections.

Signed-off-by: John Andersen <john.s.andersen@intel.com>
---
 arch/x86/Kconfig                |  9 +++++++++
 arch/x86/include/asm/kvm_para.h | 10 ++++++++++
 arch/x86/kernel/cpu/common.c    |  5 +++++
 arch/x86/kernel/kvm.c           | 17 +++++++++++++++++
 4 files changed, 41 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ef85139553f..f5c61e3bd0c9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -839,6 +839,15 @@ config PARAVIRT_TIME_ACCOUNTING
 config PARAVIRT_CLOCK
 	bool
 
+config PARAVIRT_CR_PIN
+       bool "Paravirtual bit pinning for CR0 and CR4"
+       depends on KVM_GUEST && !KEXEC
+       help
+         Select this option to have the virtualised guest request that the
+         hypervisor disallow it from disabling protections set in control
+         registers. The hypervisor will prevent exploits from disabling
+         features such as SMEP, SMAP, UMIP, and WP.
+
 config JAILHOUSE_GUEST
 	bool "Jailhouse non-root cell support"
 	depends on X86_64 && PCI
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9b4df6eaa11a..a7b48e43d2dc 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -102,6 +102,16 @@ static inline void kvm_spinlock_init(void)
 }
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
+#ifdef CONFIG_PARAVIRT_CR_PIN
+void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
+				   unsigned long cr4_pinned_bits);
+#else
+static inline void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
+						 unsigned long cr4_pinned_bits)
+{
+}
+#endif /* CONFIG_PARAVIRT_CR_PIN */
+
 #else /* CONFIG_KVM_GUEST */
 #define kvm_async_pf_task_wait(T, I) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fffe21945374..e6112abb7115 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -20,6 +20,7 @@
 #include <linux/smp.h>
 #include <linux/io.h>
 #include <linux/syscore_ops.h>
+#include <linux/kvm_para.h>
 
 #include <asm/stackprotector.h>
 #include <asm/perf_event.h>
@@ -435,6 +436,8 @@ static void __init setup_cr_pinning(void)
 	mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
 	cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
 	static_key_enable(&cr_pinning.key);
+
+	kvm_setup_paravirt_cr_pinning(X86_CR0_WP, cr4_pinned_bits);
 }
 
 /*
@@ -1597,6 +1600,8 @@ void identify_secondary_cpu(struct cpuinfo_x86 *c)
 	mtrr_ap_init();
 	validate_apic_and_package_id(c);
 	x86_spec_ctrl_setup_ap();
+
+	kvm_setup_paravirt_cr_pinning(X86_CR0_WP, cr4_pinned_bits);
 }
 
 static __init int setup_noclflush(char *arg)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 32ef1ee733b7..b8404cd9f318 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -858,6 +858,23 @@ void __init kvm_spinlock_init(void)
 
 #endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
+#ifdef CONFIG_PARAVIRT_CR_PIN
+void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
+				   unsigned long cr4_pinned_bits)
+{
+	u64 mask;
+
+	if (!kvm_para_has_feature(KVM_FEATURE_CR_PIN))
+		return;
+
+	rdmsrl(MSR_KVM_CR0_PIN_ALLOWED, mask);
+	wrmsrl(MSR_KVM_CR0_PINNED, cr0_pinned_bits & mask);
+
+	rdmsrl(MSR_KVM_CR4_PIN_ALLOWED, mask);
+	wrmsrl(MSR_KVM_CR4_PINNED, cr4_pinned_bits & mask);
+}
+#endif
+
 #ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
 
 static void kvm_disable_host_haltpoll(void *i)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-20 19:26 [RESEND RFC 0/2] Paravirtualized Control Register pinning John Andersen
  2019-12-20 19:27 ` [RESEND RFC 1/2] KVM: X86: Add CR pin MSRs John Andersen
  2019-12-20 19:27 ` [RESEND RFC 2/2] X86: Use KVM " John Andersen
@ 2019-12-21 13:59 ` Paolo Bonzini
  2019-12-23 17:28   ` Andersen, John S
  2019-12-23 14:30 ` Liran Alon
  2019-12-23 14:48 ` Liran Alon
  4 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2019-12-21 13:59 UTC (permalink / raw)
  To: John Andersen, tglx, mingo, bp, x86
  Cc: hpa, sean.j.christopherson, vkuznets, wanpengli, jmattson, joro,
	linux-kernel, kvm

On 20/12/19 20:26, John Andersen wrote:
> Paravirtualized CR pinning will likely be incompatible with kexec for
> the foreseeable future. Early boot code could possibly be changed to
> not clear protected bits. However, a kernel that requests CR bits be
> pinned can't know if the kernel it's kexecing has been updated to not
> clear protected bits. This would result in the kernel being kexec'd
> almost immediately receiving a general protection fault.
> 
> Security conscious kernel configurations disable kexec already, per KSPP
> guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
> Termina, and others using KVM to virtualize Linux will benefit from
> this protection.
> 
> The usage of SMM in SeaBIOS was explored as a way to communicate to KVM
> that a reboot has occurred and it should zero the pinned bits. When
> using QEMU and SeaBIOS, SMM initialization occurs on reboot. However,
> prior to SMM initialization, BIOS writes zero values to CR0, causing a
> general protection fault to be sent to the guest before SMM can signal
> that the machine has booted.

SMM is optional; I think it makes sense to leave it to userspace to
reset pinning (including for the case of triple faults), while INIT
which is handled within KVM would keep it active.

> Pinning of sensitive CR bits has already been implemented to protect
> against exploits directly calling native_write_cr*(). The current
> protection cannot stop ROP attacks which jump directly to a MOV CR
> instruction. Guests running with paravirtualized CR pinning are now
> protected against the use of ROP to disable CR bits. The same bits that
> are being pinned natively may be pinned via the CR pinned MSRs. These
> bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.
> 
> Future patches could protect bits in MSRs in a similar fashion. The NXE
> bit of the EFER MSR is a prime candidate.

Please include patches for either kvm-unit-tests or
tools/testing/selftests/kvm that test the functionality.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 2/2] X86: Use KVM CR pin MSRs
  2019-12-20 19:27 ` [RESEND RFC 2/2] X86: Use KVM " John Andersen
@ 2019-12-23  7:39   ` Andy Lutomirski
  2019-12-23 12:06     ` Borislav Petkov
  2019-12-24 21:18     ` Andersen, John S
  2019-12-24  6:45   ` kbuild test robot
  1 sibling, 2 replies; 22+ messages in thread
From: Andy Lutomirski @ 2019-12-23  7:39 UTC (permalink / raw)
  To: John Andersen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML,
	Paolo Bonzini, H. Peter Anvin, Christopherson, Sean J,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, LKML,
	kvm list

On Fri, Dec 20, 2019 at 11:27 AM John Andersen
<john.s.andersen@intel.com> wrote:
>
> Strengthen existing control register pinning when running
> paravirtualized under KVM. Check which bits KVM supports pinning for
> each control register and only pin supported bits which are already
> pinned via the existing native protection. Write to KVM CR0 and CR4
> pinned MSRs to enable pinning.
>
> Initiate KVM assisted pinning directly following the setup of native
> pinning on boot CPU. For non-boot CPUs initiate paravirtualized pinning
> on CPU identification.
>
> Identification of non-boot CPUs takes place after the boot CPU has setup
> native CR pinning. Therefore, non-boot CPUs access pinned bits setup by
> the boot CPU and request that those be pinned. All CPUs request
> paravirtualized pinning of the same bits which are already pinned
> natively.
>
> Guests using the kexec system call currently do not support
> paravirtualized control register pinning. This is due to early boot
> code writing known good values to control registers, these values do
> not contain the protected bits. This is due to CPU feature
> identification being done at a later time, when the kernel properly
> checks if it can enable protections.

Is hibernation supported?  How about suspend-to-RAM?

FWIW, I think that handling these details through Kconfig is the wrong
choice.  Distribution kernels should enable this, and they're not
going to turn off kexec.  Arguably kexec should be made to work --
there is no fundamental reason that kexec should need to fiddle with
CR0.WP, for example.  But a boot option could also work as a
short-term option.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 2/2] X86: Use KVM CR pin MSRs
  2019-12-23  7:39   ` Andy Lutomirski
@ 2019-12-23 12:06     ` Borislav Petkov
  2019-12-24 21:18     ` Andersen, John S
  1 sibling, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2019-12-23 12:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Andersen, Thomas Gleixner, Ingo Molnar, X86 ML,
	Paolo Bonzini, H. Peter Anvin, Christopherson, Sean J,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, LKML,
	kvm list

On Sun, Dec 22, 2019 at 11:39:19PM -0800, Andy Lutomirski wrote:
> FWIW, I think that handling these details through Kconfig is the wrong
> choice.  Distribution kernels should enable this, and they're not
> going to turn off kexec.

Nope, the other way around is way likely.

> Arguably kexec should be made to work -- there is no fundamental
> reason that kexec should need to fiddle with CR0.WP, for example. But
> a boot option could also work as a short-term option.

The problem with short-term solutions is that they become immutable
once people start using them. So it better be done right from the very
beginning, before it gets exposed.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-20 19:26 [RESEND RFC 0/2] Paravirtualized Control Register pinning John Andersen
                   ` (2 preceding siblings ...)
  2019-12-21 13:59 ` [RESEND RFC 0/2] Paravirtualized Control Register pinning Paolo Bonzini
@ 2019-12-23 14:30 ` Liran Alon
  2019-12-24 22:56   ` Liran Alon
  2019-12-25  2:04   ` Andy Lutomirski
  2019-12-23 14:48 ` Liran Alon
  4 siblings, 2 replies; 22+ messages in thread
From: Liran Alon @ 2019-12-23 14:30 UTC (permalink / raw)
  To: John Andersen
  Cc: Thomas Gleixner, mingo, bp, x86, pbonzini, hpa,
	sean.j.christopherson, vkuznets, wanpengli, jmattson, joro,
	linux-kernel, kvm



> On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com> wrote:
> 
> Paravirtualized Control Register pinning is a strengthened version of
> existing protections on the Write Protect, Supervisor Mode Execution /
> Access Protection, and User-Mode Instruction Prevention bits. The
> existing protections prevent native_write_cr*() functions from writing
> values which disable those bits. This patchset prevents any guest
> writes to control registers from disabling pinned bits, not just writes
> from native_write_cr*(). This stops attackers within the guest from
> using ROP to disable protection bits.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e= 
> 
> The protection is implemented by adding MSRs to KVM which contain the
> bits that are allowed to be pinned, and the bits which are pinned. The
> guest or userspace can enable bit pinning by reading MSRs to check
> which bits are allowed to be pinned, and then writing MSRs to set which
> bits they want pinned.
> 
> Other hypervisors such as HyperV have implemented similar protections
> for Control Registers and MSRs; which security researchers have found
> effective.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e= 
> 

I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.

Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).

Higher VTL controls execution of lower VTL by the following VSM mechanisms:
1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).

VBS use above mentioned mechanisms as follows:
a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
(HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
    HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.

To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.

Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.

Once Linux will have security features implemented with this new KVM API, we could also consider implementing them on top of other similar hypervisor APIs
such as Hyper-V VSM. To achieve, for example, Linux being more secure when running on Microsoft Azure compute instances.

Therefore, I see this patch as a short-term solution to quickly gain real security value on a very specific issue.
But if we are serious about improving Linux security using Virtualization, we should have this more broad discussion.

-Liran


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-20 19:26 [RESEND RFC 0/2] Paravirtualized Control Register pinning John Andersen
                   ` (3 preceding siblings ...)
  2019-12-23 14:30 ` Liran Alon
@ 2019-12-23 14:48 ` Liran Alon
  2019-12-23 17:09   ` Paolo Bonzini
  2019-12-24 19:44   ` Andersen, John S
  4 siblings, 2 replies; 22+ messages in thread
From: Liran Alon @ 2019-12-23 14:48 UTC (permalink / raw)
  To: John Andersen
  Cc: tglx, mingo, bp, x86, pbonzini, hpa, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, linux-kernel, kvm



> On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com> wrote:
> 
> Pinning is not active when running in SMM. Entering SMM disables pinned
> bits, writes to control registers within SMM would therefore trigger
> general protection faults if pinning was enforced.

For compatibility reasons, it’s reasonable that pinning won’t be active when running in SMM.
However, I do think we should not allow vSMM code to change pinned values when returning back from SMM.
This would prevent a vulnerable vSMI handler from modifying vSMM state-area to modify CR4 when running outside of vSMM.
I believe in this case it’s legit to just forcibly restore original CR0/CR4 pinned values. Ignoring vSMM changes.

> 
> The guest may never read pinned bits. If an attacker were to read the
> CR pinned MSRs, they might decide to preform another attack which would
> not cause a general protection fault.

I disagree with this statement.
An attacker knows what is the system it is attacking and can deduce by that which bits it pinned…
Therefore, protecting from guest reading these is not important at all.

> 
> Should userspace expose the CR pining CPUID feature bit, it must zero CR
> pinned MSRs on reboot. If it does not, it runs the risk of having the
> guest enable pinning and subsequently cause general protection faults on
> next boot due to early boot code setting control registers to values
> which do not contain the pinned bits.

Why reset CR pinned MSRs by userspace instead of KVM INIT handling?

> 
> When running with KVM guest support and paravirtualized CR pinning
> enabled, paravirtualized and existing pinning are setup at the same
> point on the boot CPU. Non-boot CPUs setup pinning upon identification.
> 
> Guests using the kexec system call currently do not support
> paravirtualized control register pinning. This is due to early boot
> code writing known good values to control registers, these values do
> not contain the protected bits. This is due to CPU feature
> identification being done at a later time, when the kernel properly
> checks if it can enable protections.
> 
> Most distributions enable kexec. However, kexec could be made boot time
> disableable. In this case if a user has disabled kexec at boot time
> the guest will request that paravirtualized control register pinning
> be enabled. This would expand the userbase to users of major
> distributions.
> 
> Paravirtualized CR pinning will likely be incompatible with kexec for
> the foreseeable future. Early boot code could possibly be changed to
> not clear protected bits. However, a kernel that requests CR bits be
> pinned can't know if the kernel it's kexecing has been updated to not
> clear protected bits. This would result in the kernel being kexec'd
> almost immediately receiving a general protection fault.

Instead of disabling kexec entirely, I think it makes more sense to invent
some generic mechanism in which new kernel can describe to old kernel
a set of flags that specifies which features hand-over it supports. One of them
being pinned CRs.

For example, isn’t this also relevant for IOMMU DMA protection?
i.e. Doesn’t old kernel need to know if it should disable or enable IOMMU DMAR
before kexec to new kernel? Similar to EDK2 IOMMU DMA protection hand-over?

-Liran


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 14:48 ` Liran Alon
@ 2019-12-23 17:09   ` Paolo Bonzini
  2019-12-23 17:27     ` Andersen, John S
  2019-12-23 17:28     ` Liran Alon
  2019-12-24 19:44   ` Andersen, John S
  1 sibling, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2019-12-23 17:09 UTC (permalink / raw)
  To: Liran Alon, John Andersen
  Cc: tglx, mingo, bp, x86, hpa, sean.j.christopherson, vkuznets,
	wanpengli, jmattson, joro, linux-kernel, kvm

On 23/12/19 15:48, Liran Alon wrote:
>> Should userspace expose the CR pining CPUID feature bit, it must zero CR
>> pinned MSRs on reboot. If it does not, it runs the risk of having the
>> guest enable pinning and subsequently cause general protection faults on
>> next boot due to early boot code setting control registers to values
>> which do not contain the pinned bits.
> 
> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?

Most MSRs are not reset by INIT, are they?

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 17:09   ` Paolo Bonzini
@ 2019-12-23 17:27     ` Andersen, John S
  2019-12-23 17:28     ` Liran Alon
  1 sibling, 0 replies; 22+ messages in thread
From: Andersen, John S @ 2019-12-23 17:27 UTC (permalink / raw)
  To: liran.alon, pbonzini
  Cc: jmattson, joro, bp, x86, vkuznets, hpa, mingo, tglx, kvm,
	Christopherson, Sean J, wanpengli, linux-kernel

On Mon, 2019-12-23 at 18:09 +0100, Paolo Bonzini wrote:
> On 23/12/19 15:48, Liran Alon wrote:
> > > Should userspace expose the CR pining CPUID feature bit, it must
> > > zero CR
> > > pinned MSRs on reboot. If it does not, it runs the risk of having
> > > the
> > > guest enable pinning and subsequently cause general protection
> > > faults on
> > > next boot due to early boot code setting control registers to
> > > values
> > > which do not contain the pinned bits.
> > 
> > Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
> 
> Most MSRs are not reset by INIT, are they?
> 

As far as I can tell, KVM doesn't know if the guest is rebooted.
Userspace uses the sregs and set MSRs ioctls to reset state.
kvm_vcpu_reset is called on non-boot CPUs. kvm_vcpu_init isn't called
on reboot.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 17:09   ` Paolo Bonzini
  2019-12-23 17:27     ` Andersen, John S
@ 2019-12-23 17:28     ` Liran Alon
  2019-12-23 17:46       ` Paolo Bonzini
  1 sibling, 1 reply; 22+ messages in thread
From: Liran Alon @ 2019-12-23 17:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: John Andersen, tglx, mingo, bp, x86, hpa, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, linux-kernel, kvm



> On 23 Dec 2019, at 19:09, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 23/12/19 15:48, Liran Alon wrote:
>>> Should userspace expose the CR pining CPUID feature bit, it must zero CR
>>> pinned MSRs on reboot. If it does not, it runs the risk of having the
>>> guest enable pinning and subsequently cause general protection faults on
>>> next boot due to early boot code setting control registers to values
>>> which do not contain the pinned bits.
>> 
>> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
> 
> Most MSRs are not reset by INIT, are they?
> 
> Paolo
> 

MSR_KVM_SYSTEM_TIME saved in vcpu->arch.time is reset at kvmclock_reset() which is called by kvm_vcpu_reset() (KVM INIT handler).
In addition, vmx_vcpu_reset(), called from kvm_vcpu_reset(), also resets multiple MSRs such as: MSR_IA32_SPEC_CTRL (vmx->spec_ctrl) and MSR_IA32_UMWAIT_CONTROL (msr_ia32_umwait_control).

Having said that, I see indeed that most of MSRs are being set by QEMU in kvm_put_msrs() when level >= KVM_PUT_RESET_STATE.
When is triggered by qemu_system_reset() -> cpu_synchronize_all_post_reset -> cpu_synchronize_post_reset() -> kvm_cpu_synchronize_post_reset().

So given current design, OK I agree with you that CR pinned MSRs should be zeroed by userspace VMM.

It does though seems kinda weird to me that part of CPU state is initialised on KVM INIT handler and part of it in userspace VMM.
It could lead to inconsistent (i.e. diverging from spec) CPU behaviour.

-Liran

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-21 13:59 ` [RESEND RFC 0/2] Paravirtualized Control Register pinning Paolo Bonzini
@ 2019-12-23 17:28   ` Andersen, John S
  0 siblings, 0 replies; 22+ messages in thread
From: Andersen, John S @ 2019-12-23 17:28 UTC (permalink / raw)
  To: tglx, pbonzini, mingo, bp, x86
  Cc: jmattson, joro, vkuznets, hpa, linux-kernel, kvm, wanpengli,
	Christopherson, Sean J

On Sat, 2019-12-21 at 14:59 +0100, Paolo Bonzini wrote:
> On 20/12/19 20:26, John Andersen wrote:
> > Paravirtualized CR pinning will likely be incompatible with kexec
> > for
> > the foreseeable future. Early boot code could possibly be changed
> > to
> > not clear protected bits. However, a kernel that requests CR bits
> > be
> > pinned can't know if the kernel it's kexecing has been updated to
> > not
> > clear protected bits. This would result in the kernel being kexec'd
> > almost immediately receiving a general protection fault.
> > 
> > Security conscious kernel configurations disable kexec already, per
> > KSPP
> > guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
> > Termina, and others using KVM to virtualize Linux will benefit from
> > this protection.
> > 
> > The usage of SMM in SeaBIOS was explored as a way to communicate to
> > KVM
> > that a reboot has occurred and it should zero the pinned bits. When
> > using QEMU and SeaBIOS, SMM initialization occurs on reboot.
> > However,
> > prior to SMM initialization, BIOS writes zero values to CR0,
> > causing a
> > general protection fault to be sent to the guest before SMM can
> > signal
> > that the machine has booted.
> 
> SMM is optional; I think it makes sense to leave it to userspace to
> reset pinning (including for the case of triple faults), while INIT
> which is handled within KVM would keep it active.
> 
> > Pinning of sensitive CR bits has already been implemented to
> > protect
> > against exploits directly calling native_write_cr*(). The current
> > protection cannot stop ROP attacks which jump directly to a MOV CR
> > instruction. Guests running with paravirtualized CR pinning are now
> > protected against the use of ROP to disable CR bits. The same bits
> > that
> > are being pinned natively may be pinned via the CR pinned MSRs.
> > These
> > bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.
> > 
> > Future patches could protect bits in MSRs in a similar fashion. The
> > NXE
> > bit of the EFER MSR is a prime candidate.
> 
> Please include patches for either kvm-unit-tests or
> tools/testing/selftests/kvm that test the functionality.
> 

Will do

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 17:28     ` Liran Alon
@ 2019-12-23 17:46       ` Paolo Bonzini
  2019-12-23 22:49         ` Liran Alon
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2019-12-23 17:46 UTC (permalink / raw)
  To: Liran Alon
  Cc: John Andersen, tglx, mingo, bp, x86, hpa, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, linux-kernel, kvm

On 23/12/19 18:28, Liran Alon wrote:
>>> Why reset CR pinned MSRs by userspace instead of KVM INIT
>>> handling?
>> Most MSRs are not reset by INIT, are they?
>> 
>> Paolo
>> 
> MSR_KVM_SYSTEM_TIME saved in vcpu->arch.time is reset at
> kvmclock_reset() which is called by kvm_vcpu_reset() (KVM INIT
> handler). In addition, vmx_vcpu_reset(), called from
> kvm_vcpu_reset(), also resets multiple MSRs such as:
> MSR_IA32_SPEC_CTRL (vmx->spec_ctrl) and MSR_IA32_UMWAIT_CONTROL
> (msr_ia32_umwait_control).

These probably can be removed, since they are zero at startup and at
least SPEC_CTRL is documented[1] to be unaffected by INIT.  However, I
couldn't find information about UMWAIT_CONTROL.

> Having said that, I see indeed that most of MSRs are being set by
> QEMU in kvm_put_msrs() when level >= KVM_PUT_RESET_STATE. When is
> triggered by qemu_system_reset() -> cpu_synchronize_all_post_reset ->
> cpu_synchronize_post_reset() -> kvm_cpu_synchronize_post_reset().
> 
> So given current design, OK I agree with you that CR pinned MSRs
> should be zeroed by userspace VMM.
> 
> It does though seems kinda weird to me that part of CPU state is
> initialised on KVM INIT handler and part of it in userspace VMM. It
> could lead to inconsistent (i.e. diverging from spec) CPU behaviour.

The reason for that is the even on real hardware INIT does not touch
most MSRs:

  9.1 Initialization overview

  Asserting the INIT# pin on the processor invokes a similar response to
  a hardware reset. The major difference is that during an INIT, the
  internal caches, MSRs, MTRRs, and x87 FPU state are left unchanged
  (although, the TLBs and BTB are invalidated as with a hardware reset).
  An INIT provides a method for switching from protected to real-address
  mode while maintaining the contents of the internal caches.

Paolo

[1]
https://software.intel.com/security-software-guidance/api-app/sites/default/files/336996-Speculative-Execution-Side-Channel-Mitigations.pdf


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 17:46       ` Paolo Bonzini
@ 2019-12-23 22:49         ` Liran Alon
  0 siblings, 0 replies; 22+ messages in thread
From: Liran Alon @ 2019-12-23 22:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: John Andersen, tglx, mingo, bp, x86, hpa, sean.j.christopherson,
	vkuznets, wanpengli, jmattson, joro, linux-kernel, kvm



> On 23 Dec 2019, at 19:46, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 23/12/19 18:28, Liran Alon wrote:
>>>> Why reset CR pinned MSRs by userspace instead of KVM INIT
>>>> handling?
>>> Most MSRs are not reset by INIT, are they?
>>> 
>>> Paolo
>>> 
>> MSR_KVM_SYSTEM_TIME saved in vcpu->arch.time is reset at
>> kvmclock_reset() which is called by kvm_vcpu_reset() (KVM INIT
>> handler). In addition, vmx_vcpu_reset(), called from
>> kvm_vcpu_reset(), also resets multiple MSRs such as:
>> MSR_IA32_SPEC_CTRL (vmx->spec_ctrl) and MSR_IA32_UMWAIT_CONTROL
>> (msr_ia32_umwait_control).
> 
> These probably can be removed, since they are zero at startup and at
> least SPEC_CTRL is documented[1] to be unaffected by INIT.  However, I
> couldn't find information about UMWAIT_CONTROL.
> 
>> Having said that, I see indeed that most of MSRs are being set by
>> QEMU in kvm_put_msrs() when level >= KVM_PUT_RESET_STATE. When is
>> triggered by qemu_system_reset() -> cpu_synchronize_all_post_reset ->
>> cpu_synchronize_post_reset() -> kvm_cpu_synchronize_post_reset().
>> 
>> So given current design, OK I agree with you that CR pinned MSRs
>> should be zeroed by userspace VMM.
>> 
>> It does though seems kinda weird to me that part of CPU state is
>> initialised on KVM INIT handler and part of it in userspace VMM. It
>> could lead to inconsistent (i.e. diverging from spec) CPU behaviour.
> 
> The reason for that is the even on real hardware INIT does not touch
> most MSRs:
> 
>  9.1 Initialization overview
> 
>  Asserting the INIT# pin on the processor invokes a similar response to
>  a hardware reset. The major difference is that during an INIT, the
>  internal caches, MSRs, MTRRs, and x87 FPU state are left unchanged
>  (although, the TLBs and BTB are invalidated as with a hardware reset).
>  An INIT provides a method for switching from protected to real-address
>  mode while maintaining the contents of the internal caches.
> 
> Paolo
> 
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__software.intel.com_security-2Dsoftware-2Dguidance_api-2Dapp_sites_default_files_336996-2DSpeculative-2DExecution-2DSide-2DChannel-2DMitigations.pdf&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=MRhvCoZcvqLvHMhI3y3iSkpiAfitrNoCgPtWaTXhzNQ&s=66DAue_Ojp-a_msxZ2W9bLtQWJB4W1p9F9GrBBQ8dnw&e= 
> 

Oh right. That’s true.
There is a difference between power-up and asserting RESET# pin to asserting INIT# pin or receiving INIT IPI.
The first does reset all the processor state while the latter reset only part of it as indeed Intel SDM section 9.1 describes.

So basically because KVM is only aware of INIT IPI but not on power-up and asserting RESET# pin events,
then the job of emulating the latter events is currently implemented in userspace VMM.

Having said that, one could argue that because KVM implements the vCPU, it should just expose a relevant ioctl
for userspace VMM to trigger “full vCPU state reset” on power-up or asserting RESET# pin events.
i.e. Making userspace VMM only the one which emulates the peripheral hardware that triggers the event,
but not implementing the vCPU emulation that responds to this event. I think this would have resulted in a cleaner design.

But we are diverting from original patch-series discussion so if we want to discuss this further, let’s open a separate email thread on that. :)

-Liran


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 2/2] X86: Use KVM CR pin MSRs
  2019-12-20 19:27 ` [RESEND RFC 2/2] X86: Use KVM " John Andersen
  2019-12-23  7:39   ` Andy Lutomirski
@ 2019-12-24  6:45   ` kbuild test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2019-12-24  6:45 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]

Hi John,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on kvm/linux-next]
[also build test ERROR on tip/x86/core tip/auto-latest v5.5-rc3 next-20191219]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/John-Andersen/KVM-X86-Add-CR-pin-MSRs/20191224-032618
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git linux-next
config: x86_64-defconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/x86/kernel/cpu/common.c: In function 'setup_cr_pinning':
>> arch/x86/kernel/cpu/common.c:440:2: error: implicit declaration of function 'kvm_setup_paravirt_cr_pinning'; did you mean 'setup_cr_pinning'? [-Werror=implicit-function-declaration]
     kvm_setup_paravirt_cr_pinning(X86_CR0_WP, cr4_pinned_bits);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     setup_cr_pinning
   cc1: some warnings being treated as errors

vim +440 arch/x86/kernel/cpu/common.c

   426	
   427	/*
   428	 * Once CPU feature detection is finished (and boot params have been
   429	 * parsed), record any of the sensitive CR bits that are set, and
   430	 * enable CR pinning.
   431	 */
   432	static void __init setup_cr_pinning(void)
   433	{
   434		unsigned long mask;
   435	
   436		mask = (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP);
   437		cr4_pinned_bits = this_cpu_read(cpu_tlbstate.cr4) & mask;
   438		static_key_enable(&cr_pinning.key);
   439	
 > 440		kvm_setup_paravirt_cr_pinning(X86_CR0_WP, cr4_pinned_bits);
   441	}
   442	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28644 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 14:48 ` Liran Alon
  2019-12-23 17:09   ` Paolo Bonzini
@ 2019-12-24 19:44   ` Andersen, John S
  2019-12-24 20:35     ` Liran Alon
  1 sibling, 1 reply; 22+ messages in thread
From: Andersen, John S @ 2019-12-24 19:44 UTC (permalink / raw)
  To: liran.alon
  Cc: jmattson, joro, bp, x86, vkuznets, hpa, mingo, tglx, kvm,
	pbonzini, Christopherson, Sean J, wanpengli, linux-kernel

On Mon, 2019-12-23 at 16:48 +0200, Liran Alon wrote:
> > On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com>
> > wrote:
> > 
> > Pinning is not active when running in SMM. Entering SMM disables
> > pinned
> > bits, writes to control registers within SMM would therefore
> > trigger
> > general protection faults if pinning was enforced.
> 
> For compatibility reasons, it’s reasonable that pinning won’t be
> active when running in SMM.
> However, I do think we should not allow vSMM code to change pinned
> values when returning back from SMM.
> This would prevent a vulnerable vSMI handler from modifying vSMM
> state-area to modify CR4 when running outside of vSMM.
> I believe in this case it’s legit to just forcibly restore original
> CR0/CR4 pinned values. Ignoring vSMM changes.
> 

In em_rsm could we just OR with the value of the PINNED MSRs right
before the final return?

> > The guest may never read pinned bits. If an attacker were to read
> > the
> > CR pinned MSRs, they might decide to preform another attack which
> > would
> > not cause a general protection fault.
> 
> I disagree with this statement.
> An attacker knows what is the system it is attacking and can deduce
> by that which bits it pinned…
> Therefore, protecting from guest reading these is not important at
> all.
> 

Sure, I'll make it readable.

> > Should userspace expose the CR pining CPUID feature bit, it must
> > zero CR
> > pinned MSRs on reboot. If it does not, it runs the risk of having
> > the
> > guest enable pinning and subsequently cause general protection
> > faults on
> > next boot due to early boot code setting control registers to
> > values
> > which do not contain the pinned bits.
> 
> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
> 
> > When running with KVM guest support and paravirtualized CR pinning
> > enabled, paravirtualized and existing pinning are setup at the same
> > point on the boot CPU. Non-boot CPUs setup pinning upon
> > identification.
> > 
> > Guests using the kexec system call currently do not support
> > paravirtualized control register pinning. This is due to early boot
> > code writing known good values to control registers, these values
> > do
> > not contain the protected bits. This is due to CPU feature
> > identification being done at a later time, when the kernel properly
> > checks if it can enable protections.
> > 
> > Most distributions enable kexec. However, kexec could be made boot
> > time
> > disableable. In this case if a user has disabled kexec at boot time
> > the guest will request that paravirtualized control register
> > pinning
> > be enabled. This would expand the userbase to users of major
> > distributions.
> > 
> > Paravirtualized CR pinning will likely be incompatible with kexec
> > for
> > the foreseeable future. Early boot code could possibly be changed
> > to
> > not clear protected bits. However, a kernel that requests CR bits
> > be
> > pinned can't know if the kernel it's kexecing has been updated to
> > not
> > clear protected bits. This would result in the kernel being kexec'd
> > almost immediately receiving a general protection fault.
> 
> Instead of disabling kexec entirely, I think it makes more sense to
> invent
> some generic mechanism in which new kernel can describe to old kernel
> a set of flags that specifies which features hand-over it supports.
> One of them
> being pinned CRs.
> 
> For example, isn’t this also relevant for IOMMU DMA protection?
> i.e. Doesn’t old kernel need to know if it should disable or enable
> IOMMU DMAR
> before kexec to new kernel? Similar to EDK2 IOMMU DMA protection
> hand-over?

Great idea.

Making kexec work will require changes to these files and maybe more:

arch/x86/boot/compressed/head_64.S
arch/x86/kernel/head_64.S
arch/x86/kernel/relocate_kernel_64.S

Which my previous attempts showed different results when running
virtualized vs. unvirtualized. Specificity different behavior with SMAP
and UMIP bits.

This would be a longer process though. As validating that everything
still works in both the VM and on physical hosts will be required. As
it stands this patchset could pick up a fairly large userbase via the
virtualized container projects. Should we pursue kexec in this patchset
or a later one?

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-24 19:44   ` Andersen, John S
@ 2019-12-24 20:35     ` Liran Alon
  2019-12-24 21:17       ` Andersen, John S
  0 siblings, 1 reply; 22+ messages in thread
From: Liran Alon @ 2019-12-24 20:35 UTC (permalink / raw)
  To: Andersen, John S
  Cc: jmattson, joro, bp, x86, vkuznets, hpa, mingo, tglx, kvm,
	pbonzini, Christopherson, Sean J, wanpengli, linux-kernel



> On 24 Dec 2019, at 21:44, Andersen, John S <john.s.andersen@intel.com> wrote:
> 
> On Mon, 2019-12-23 at 16:48 +0200, Liran Alon wrote:
>>> On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com>
>>> wrote:
>>> 
>>> Pinning is not active when running in SMM. Entering SMM disables
>>> pinned
>>> bits, writes to control registers within SMM would therefore
>>> trigger
>>> general protection faults if pinning was enforced.
>> 
>> For compatibility reasons, it’s reasonable that pinning won’t be
>> active when running in SMM.
>> However, I do think we should not allow vSMM code to change pinned
>> values when returning back from SMM.
>> This would prevent a vulnerable vSMI handler from modifying vSMM
>> state-area to modify CR4 when running outside of vSMM.
>> I believe in this case it’s legit to just forcibly restore original
>> CR0/CR4 pinned values. Ignoring vSMM changes.
>> 
> 
> In em_rsm could we just OR with the value of the PINNED MSRs right
> before the final return?

Not exactly.

If I understand correctly, the proposed mechanism should also allow pinning specific
system registers (e.g. CR0/CR4) bits to either being cleared or being set. Not necessarily being set.
As kvm_set_cr0() & kvm_set_cr4() were modified to only check if pinned bit changed value.

Therefore, you should modify enter_smm() to save current pinned bits values
and then silently restore their values on em_rsm().

> 
>>> The guest may never read pinned bits. If an attacker were to read
>>> the
>>> CR pinned MSRs, they might decide to preform another attack which
>>> would
>>> not cause a general protection fault.
>> 
>> I disagree with this statement.
>> An attacker knows what is the system it is attacking and can deduce
>> by that which bits it pinned…
>> Therefore, protecting from guest reading these is not important at
>> all.
>> 
> 
> Sure, I'll make it readable.
> 
>>> Should userspace expose the CR pining CPUID feature bit, it must
>>> zero CR
>>> pinned MSRs on reboot. If it does not, it runs the risk of having
>>> the
>>> guest enable pinning and subsequently cause general protection
>>> faults on
>>> next boot due to early boot code setting control registers to
>>> values
>>> which do not contain the pinned bits.
>> 
>> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
>> 
>>> When running with KVM guest support and paravirtualized CR pinning
>>> enabled, paravirtualized and existing pinning are setup at the same
>>> point on the boot CPU. Non-boot CPUs setup pinning upon
>>> identification.
>>> 
>>> Guests using the kexec system call currently do not support
>>> paravirtualized control register pinning. This is due to early boot
>>> code writing known good values to control registers, these values
>>> do
>>> not contain the protected bits. This is due to CPU feature
>>> identification being done at a later time, when the kernel properly
>>> checks if it can enable protections.
>>> 
>>> Most distributions enable kexec. However, kexec could be made boot
>>> time
>>> disableable. In this case if a user has disabled kexec at boot time
>>> the guest will request that paravirtualized control register
>>> pinning
>>> be enabled. This would expand the userbase to users of major
>>> distributions.
>>> 
>>> Paravirtualized CR pinning will likely be incompatible with kexec
>>> for
>>> the foreseeable future. Early boot code could possibly be changed
>>> to
>>> not clear protected bits. However, a kernel that requests CR bits
>>> be
>>> pinned can't know if the kernel it's kexecing has been updated to
>>> not
>>> clear protected bits. This would result in the kernel being kexec'd
>>> almost immediately receiving a general protection fault.
>> 
>> Instead of disabling kexec entirely, I think it makes more sense to
>> invent
>> some generic mechanism in which new kernel can describe to old kernel
>> a set of flags that specifies which features hand-over it supports.
>> One of them
>> being pinned CRs.
>> 
>> For example, isn’t this also relevant for IOMMU DMA protection?
>> i.e. Doesn’t old kernel need to know if it should disable or enable
>> IOMMU DMAR
>> before kexec to new kernel? Similar to EDK2 IOMMU DMA protection
>> hand-over?
> 
> Great idea.
> 
> Making kexec work will require changes to these files and maybe more:
> 
> arch/x86/boot/compressed/head_64.S
> arch/x86/kernel/head_64.S
> arch/x86/kernel/relocate_kernel_64.S
> 
> Which my previous attempts showed different results when running
> virtualized vs. unvirtualized. Specificity different behavior with SMAP
> and UMIP bits.

I didn’t understand why there should be different results when running virtualized or not.

What I suggested is to just add metadata in a vmlinux ELF section that will be designated to
describe vmlinux handover capabilities.

Then, kexec functionality can be modified to read this section before loading & jumping
to new kernel vmlinux.

In the context of this patch-set, this section will specify a flag of if new vmlinux supports
CR0/CR4 pinning hand-over. If not and current kernel already pinned these values,
kexec should just return failure.

Just for example, in the context of IOMMU DMA protection hand-over, kexec will
use another flag to new of new vmlinux supports IOMMU DMA protection hand-over,
and if not, make sure to disable IOMMU before jumping to new vmlinux.

> 
> This would be a longer process though. As validating that everything
> still works in both the VM and on physical hosts will be required. As
> it stands this patchset could pick up a fairly large userbase via the
> virtualized container projects. Should we pursue kexec in this patchset
> or a later one?

In my opinion, this should be handled as part of this patch-series.
Otherwise, you basically introduce a kernel change that breaks existing functionality
in unpredictable manner to user.

i.e. It’s ok of kernel would haven’t allowed to use system registers pinning functionality
unless user also configured at boot-time that it disables kexec. But it’s not ok for
kernel behaviour to change such that kexec suddenly crashes if kernel was upgraded
to now use system registers pinning.

My personal opinion though is that kexec should first be enhanced to read hand-over metadata
as I described. And only then apply this patch-series which also modifies kexec to define and use
one of the hand-over bits as I mentioned above. But I would like to hear additional opinions on this.

In addition, I would like to understand Linux security maintainers point of view on the comment
I mentioned in another reply regarding how this mechanism is implemented in Hyper-V.

-Liran

> 
> Thanks,
> John


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-24 20:35     ` Liran Alon
@ 2019-12-24 21:17       ` Andersen, John S
  0 siblings, 0 replies; 22+ messages in thread
From: Andersen, John S @ 2019-12-24 21:17 UTC (permalink / raw)
  To: liran.alon
  Cc: jmattson, joro, bp, x86, vkuznets, hpa, mingo, tglx, kvm,
	pbonzini, wanpengli, Christopherson, Sean J, linux-kernel

On Tue, 2019-12-24 at 22:35 +0200, Liran Alon wrote:
> > On 24 Dec 2019, at 21:44, Andersen, John S <
> > john.s.andersen@intel.com> wrote:
> > In em_rsm could we just OR with the value of the PINNED MSRs right
> > before the final return?
> 
> Not exactly.
> 
> If I understand correctly, the proposed mechanism should also allow
> pinning specific
> system registers (e.g. CR0/CR4) bits to either being cleared or being
> set. Not necessarily being set.
> As kvm_set_cr0() & kvm_set_cr4() were modified to only check if
> pinned bit changed value.
> 
> Therefore, you should modify enter_smm() to save current pinned bits
> values
> and then silently restore their values on em_rsm().
> 

That's true. Sounds good.

> > 
> > Making kexec work will require changes to these files and maybe
> > more:
> > 
> > arch/x86/boot/compressed/head_64.S
> > arch/x86/kernel/head_64.S
> > arch/x86/kernel/relocate_kernel_64.S
> > 
> > Which my previous attempts showed different results when running
> > virtualized vs. unvirtualized. Specificity different behavior with
> > SMAP
> > and UMIP bits.
> 
> I didn’t understand why there should be different results when
> running virtualized or not.
> 

I think it's either a bug in KVM's enforcement of SMAP and UMIP, or a
bug in the hardware I'm on. If I do the same as bellow but with SMAP
and UMIP, the physical host doesn't boot, but the virtualized host
boots fine and can kexec without issue.

diff --git a/arch/x86/boot/compressed/head_64.S
b/arch/x86/boot/compressed/head_64.S
index 537f7d2bfb17a..dc50634fa674e 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -126,7 +126,7 @@ ENTRY(startup_32)
 
 	/* Enable PAE mode */
 	movl	%cr4, %eax
-	orl	$X86_CR4_PAE, %eax
+	orl	$(X86_CR4_PAE | X86_CR4_SMEP), %eax
 	movl	%eax, %cr4
 
  /*
@@ -614,10 +614,10 @@ ENTRY(trampoline_32bit_src)
 	popl	%ecx
 
 	/* Enable PAE and LA57 (if required) paging modes */
-	movl	$X86_CR4_PAE, %eax
+	movl	$(X86_CR4_PAE | X86_CR4_SMEP), %eax
 	cmpl	$0, %edx
 	jz	1f
-	orl	$X86_CR4_LA57, %eax
+	orl	$(X86_CR4_LA57 | X86_CR4_SMEP), %eax
 1:
 	movl	%eax, %cr4
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index f3d3e9646a99b..d57ce48a40b5f 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -122,7 +122,7 @@ ENTRY(secondary_startup_64)
 1:
 
 	/* Enable PAE mode, PGE and LA57 */
-	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+	movl	$(X86_CR4_PAE | X86_CR4_PGE | X86_CR4_SMEP), %ecx
 #ifdef CONFIG_X86_5LEVEL
 	testl	$1, __pgtable_l5_enabled(%rip)
 	jz	1f
diff --git a/arch/x86/kernel/relocate_kernel_64.S
b/arch/x86/kernel/relocate_kernel_64.S
index b0e558b473f9b..38acd530622cc 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -129,7 +129,7 @@ identity_mapped:
 	 *  - physical address extension enabled
 	 *  - 5-level paging, if it was enabled before
 	 */
-	movl	$X86_CR4_PAE, %eax
+	movl	$(X86_CR4_PAE | X86_CR4_SMEP), %eax
 	testq	$X86_CR4_LA57, %r13
 	jz	1f
 	orl	$X86_CR4_LA57, %eax


I realize this isn't the right way to do this. But it was just my first
attempt at it. We should probably just change it to be something along
the lines of, if the bit was already set, keep it set. I probably just
need to go further down that road.


> What I suggested is to just add metadata in a vmlinux ELF section
> that will be designated to
> describe vmlinux handover capabilities.
> 
> Then, kexec functionality can be modified to read this section before
> loading & jumping
> to new kernel vmlinux.
> 
> In the context of this patch-set, this section will specify a flag of
> if new vmlinux supports
> CR0/CR4 pinning hand-over. If not and current kernel already pinned
> these values,
> kexec should just return failure.
> 
> Just for example, in the context of IOMMU DMA protection hand-over,
> kexec will
> use another flag to new of new vmlinux supports IOMMU DMA protection
> hand-over,
> and if not, make sure to disable IOMMU before jumping to new vmlinux.
> 

Ah okay I'm sorry I misunderstood.

> > This would be a longer process though. As validating that
> > everything
> > still works in both the VM and on physical hosts will be required.
> > As
> > it stands this patchset could pick up a fairly large userbase via
> > the
> > virtualized container projects. Should we pursue kexec in this
> > patchset
> > or a later one?
> 
> In my opinion, this should be handled as part of this patch-series.
> Otherwise, you basically introduce a kernel change that breaks
> existing functionality
> in unpredictable manner to user.
> 
> i.e. It’s ok of kernel would haven’t allowed to use system registers
> pinning functionality
> unless user also configured at boot-time that it disables kexec. But
> it’s not ok for
> kernel behaviour to change such that kexec suddenly crashes if kernel
> was upgraded
> to now use system registers pinning.
> 
> My personal opinion though is that kexec should first be enhanced to
> read hand-over metadata
> as I described. And only then apply this patch-series which also
> modifies kexec to define and use
> one of the hand-over bits as I mentioned above. But I would like to
> hear additional opinions on this.
> 

I will check this out. Thanks.

> In addition, I would like to understand Linux security maintainers
> point of view on the comment
> I mentioned in another reply regarding how this mechanism is
> implemented in Hyper-V.
> 

I agree. I did see Hyper-Vs interface definitions and didn't understand
it to be that simple. It seemed like there was a significant amount of
pre-existing hypercall infrastructure that it was a part of. My
understanding was that we'd have to replicate that, and then only
implement the subset that we care about. It seemed like overkill to get
this enabled. The goal here has been to get this CR protection in as a
simple interface. It would of course be nice to have all those
protections as well though.

Thanks,
John


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 2/2] X86: Use KVM CR pin MSRs
  2019-12-23  7:39   ` Andy Lutomirski
  2019-12-23 12:06     ` Borislav Petkov
@ 2019-12-24 21:18     ` Andersen, John S
  1 sibling, 0 replies; 22+ messages in thread
From: Andersen, John S @ 2019-12-24 21:18 UTC (permalink / raw)
  To: luto
  Cc: jmattson, joro, bp, x86, vkuznets, hpa, mingo, tglx, kvm,
	pbonzini, Christopherson, Sean J, wanpengli, linux-kernel

On Sun, 2019-12-22 at 23:39 -0800, Andy Lutomirski wrote:
> On Fri, Dec 20, 2019 at 11:27 AM John Andersen
> <john.s.andersen@intel.com> wrote:
> > Strengthen existing control register pinning when running
> > paravirtualized under KVM. Check which bits KVM supports pinning
> > for
> > each control register and only pin supported bits which are already
> > pinned via the existing native protection. Write to KVM CR0 and CR4
> > pinned MSRs to enable pinning.
> > 
> > Initiate KVM assisted pinning directly following the setup of
> > native
> > pinning on boot CPU. For non-boot CPUs initiate paravirtualized
> > pinning
> > on CPU identification.
> > 
> > Identification of non-boot CPUs takes place after the boot CPU has
> > setup
> > native CR pinning. Therefore, non-boot CPUs access pinned bits
> > setup by
> > the boot CPU and request that those be pinned. All CPUs request
> > paravirtualized pinning of the same bits which are already pinned
> > natively.
> > 
> > Guests using the kexec system call currently do not support
> > paravirtualized control register pinning. This is due to early boot
> > code writing known good values to control registers, these values
> > do
> > not contain the protected bits. This is due to CPU feature
> > identification being done at a later time, when the kernel properly
> > checks if it can enable protections.
> 
> Is hibernation supported?  How about suspend-to-RAM?
> 

Something is writing to CR4 during resume which is breaking
hibernation. Unfortunately I hadn't been able to get my hibernation
test working before sending this out. I will investigate.

> FWIW, I think that handling these details through Kconfig is the
> wrong
> choice.  Distribution kernels should enable this, and they're not
> going to turn off kexec.  Arguably kexec should be made to work --
> there is no fundamental reason that kexec should need to fiddle with
> CR0.WP, for example.  But a boot option could also work as a
> short-term option.

Given the situation with hibernation. I think I'll implement the kexec
discovery Liran suggested, and then investigate the hibernate situation
further.

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 14:30 ` Liran Alon
@ 2019-12-24 22:56   ` Liran Alon
  2019-12-25  2:04   ` Andy Lutomirski
  1 sibling, 0 replies; 22+ messages in thread
From: Liran Alon @ 2019-12-24 22:56 UTC (permalink / raw)
  To: Liran Alon, kernel-hardening
  Cc: John Andersen, Thomas Gleixner, mingo, bp, x86, pbonzini, hpa,
	sean.j.christopherson, vkuznets, wanpengli, jmattson, joro,
	linux-kernel, kvm

+kernel-hardening mailing list.

> On 23 Dec 2019, at 16:30, Liran Alon <liran.alon@oracle.com> wrote:
> 
> 
> 
>> On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com> wrote:
>> 
>> Paravirtualized Control Register pinning is a strengthened version of
>> existing protections on the Write Protect, Supervisor Mode Execution /
>> Access Protection, and User-Mode Instruction Prevention bits. The
>> existing protections prevent native_write_cr*() functions from writing
>> values which disable those bits. This patchset prevents any guest
>> writes to control registers from disabling pinned bits, not just writes
>> from native_write_cr*(). This stops attackers within the guest from
>> using ROP to disable protection bits.
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e= 
>> 
>> The protection is implemented by adding MSRs to KVM which contain the
>> bits that are allowed to be pinned, and the bits which are pinned. The
>> guest or userspace can enable bit pinning by reading MSRs to check
>> which bits are allowed to be pinned, and then writing MSRs to set which
>> bits they want pinned.
>> 
>> Other hypervisors such as HyperV have implemented similar protections
>> for Control Registers and MSRs; which security researchers have found
>> effective.
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e= 
>> 
> 
> I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.
> 
> Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
> The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
> the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).
> 
> Higher VTL controls execution of lower VTL by the following VSM mechanisms:
> 1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
> 2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).
> 
> VBS use above mentioned mechanisms as follows:
> a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
> b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
> (HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
> c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
> d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
> e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
>    HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.
> 
> To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
> Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.
> 
> Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
> Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
> Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
> Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.
> 
> Once Linux will have security features implemented with this new KVM API, we could also consider implementing them on top of other similar hypervisor APIs
> such as Hyper-V VSM. To achieve, for example, Linux being more secure when running on Microsoft Azure compute instances.
> 
> Therefore, I see this patch as a short-term solution to quickly gain real security value on a very specific issue.
> But if we are serious about improving Linux security using Virtualization, we should have this more broad discussion.
> 
> -Liran
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-23 14:30 ` Liran Alon
  2019-12-24 22:56   ` Liran Alon
@ 2019-12-25  2:04   ` Andy Lutomirski
  2019-12-25 13:05     ` Liran Alon
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2019-12-25  2:04 UTC (permalink / raw)
  To: Liran Alon
  Cc: John Andersen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	X86 ML, Paolo Bonzini, H. Peter Anvin, Christopherson, Sean J,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, LKML,
	kvm list

On Mon, Dec 23, 2019 at 6:31 AM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com> wrote:
> >
> > Paravirtualized Control Register pinning is a strengthened version of
> > existing protections on the Write Protect, Supervisor Mode Execution /
> > Access Protection, and User-Mode Instruction Prevention bits. The
> > existing protections prevent native_write_cr*() functions from writing
> > values which disable those bits. This patchset prevents any guest
> > writes to control registers from disabling pinned bits, not just writes
> > from native_write_cr*(). This stops attackers within the guest from
> > using ROP to disable protection bits.
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e=
> >
> > The protection is implemented by adding MSRs to KVM which contain the
> > bits that are allowed to be pinned, and the bits which are pinned. The
> > guest or userspace can enable bit pinning by reading MSRs to check
> > which bits are allowed to be pinned, and then writing MSRs to set which
> > bits they want pinned.
> >
> > Other hypervisors such as HyperV have implemented similar protections
> > for Control Registers and MSRs; which security researchers have found
> > effective.
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e=
> >
>
> I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.
>
> Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
> The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
> the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).
>
> Higher VTL controls execution of lower VTL by the following VSM mechanisms:
> 1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
> 2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).
>
> VBS use above mentioned mechanisms as follows:
> a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
> b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
> (HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
> c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
> d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
> e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
>     HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.
>
> To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
> Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.
>
> Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
> Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
> Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
> Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.

How about having KVM implement the VSM API directly?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning
  2019-12-25  2:04   ` Andy Lutomirski
@ 2019-12-25 13:05     ` Liran Alon
  0 siblings, 0 replies; 22+ messages in thread
From: Liran Alon @ 2019-12-25 13:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Andersen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	X86 ML, Paolo Bonzini, H. Peter Anvin, Christopherson, Sean J,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, LKML,
	kvm list



> On 25 Dec 2019, at 4:04, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Dec 23, 2019 at 6:31 AM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>> 
>>> On 20 Dec 2019, at 21:26, John Andersen <john.s.andersen@intel.com> wrote:
>>> 
>>> Paravirtualized Control Register pinning is a strengthened version of
>>> existing protections on the Write Protect, Supervisor Mode Execution /
>>> Access Protection, and User-Mode Instruction Prevention bits. The
>>> existing protections prevent native_write_cr*() functions from writing
>>> values which disable those bits. This patchset prevents any guest
>>> writes to control registers from disabling pinned bits, not just writes
>>> from native_write_cr*(). This stops attackers within the guest from
>>> using ROP to disable protection bits.
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e=
>>> 
>>> The protection is implemented by adding MSRs to KVM which contain the
>>> bits that are allowed to be pinned, and the bits which are pinned. The
>>> guest or userspace can enable bit pinning by reading MSRs to check
>>> which bits are allowed to be pinned, and then writing MSRs to set which
>>> bits they want pinned.
>>> 
>>> Other hypervisors such as HyperV have implemented similar protections
>>> for Control Registers and MSRs; which security researchers have found
>>> effective.
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e=
>>> 
>> 
>> I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.
>> 
>> Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
>> The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
>> the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).
>> 
>> Higher VTL controls execution of lower VTL by the following VSM mechanisms:
>> 1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
>> 2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).
>> 
>> VBS use above mentioned mechanisms as follows:
>> a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
>> b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
>> (HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
>> c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
>> d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
>> e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
>>    HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.
>> 
>> To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
>> Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.
>> 
>> Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
>> Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
>> Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
>> Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.
> 
> How about having KVM implement the VSM API directly?

Hyper-V VSM API is tightly coupled to the rest of Hyper-V PV interface. Therefore, KVM could only implement VSM API as-is
as part of it’s Hyper-V PV interface emulation implementation. Because we don’t wish to expose Hyper-V PV interface by default
to all KVM guests, KVM should have it’s own variant providing similar capabilities.

In addition, in my opinion there are some bad design choices in VSM API I haven’t mentioned in my previous message. Which KVM
VSM-like API would maybe want to do differently to avoid those mistakes. For example, VSM API by design assumes that a given VTL
is superior and control every aspect of all VTLs lower than it. In Windows VBS, this have caused securekernel (Sk) running in VTL1 to
be part of TCB and therefore significantly enlarge it. In contrast, for example, to QubesOS where OS is split to security-domains that
each have well-defined capabilities but none have full capabilities as VTL1 have in VBS. Therefore, it preserves only the hypervisor
in TCB as it should.

Having said that, I am already working on a patch-series to enhance KVM Hyper-V PV interface implementation to also include VSM.
As I have mentioned in recent KVM Forum, I wish to do so to make modern Windows OS with VBS support running on top of KVM,
to not need to run Hyper-V inside the KVM guest (i.e. Leading to nested-virtualization workload). When Windows detect it’s already
running as a guest on top of Hyper-V with VSM support, it uses underlying hypervisor VSM API to implement VBS. Without loading
Hyper-V inside the guest. Therefore, improving performance & semantics of Windows VBS guests on top of KVM.

Note though, that my work of implementing VSM in KVM Hyper-V PV interface implementation isn’t related to the discussion here.
Which is: How should Linux be modified to take advantage of a VSM-like API to implement security mitigations features as those
I described above that Windows VBS implement on top of such API. Deciding on the design of those features, will also guideline
what should be the KVM PV VSM-like API we should implement.

-Liran


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-12-25 13:06 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-20 19:26 [RESEND RFC 0/2] Paravirtualized Control Register pinning John Andersen
2019-12-20 19:27 ` [RESEND RFC 1/2] KVM: X86: Add CR pin MSRs John Andersen
2019-12-20 19:27 ` [RESEND RFC 2/2] X86: Use KVM " John Andersen
2019-12-23  7:39   ` Andy Lutomirski
2019-12-23 12:06     ` Borislav Petkov
2019-12-24 21:18     ` Andersen, John S
2019-12-24  6:45   ` kbuild test robot
2019-12-21 13:59 ` [RESEND RFC 0/2] Paravirtualized Control Register pinning Paolo Bonzini
2019-12-23 17:28   ` Andersen, John S
2019-12-23 14:30 ` Liran Alon
2019-12-24 22:56   ` Liran Alon
2019-12-25  2:04   ` Andy Lutomirski
2019-12-25 13:05     ` Liran Alon
2019-12-23 14:48 ` Liran Alon
2019-12-23 17:09   ` Paolo Bonzini
2019-12-23 17:27     ` Andersen, John S
2019-12-23 17:28     ` Liran Alon
2019-12-23 17:46       ` Paolo Bonzini
2019-12-23 22:49         ` Liran Alon
2019-12-24 19:44   ` Andersen, John S
2019-12-24 20:35     ` Liran Alon
2019-12-24 21:17       ` Andersen, John S

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.