All of lore.kernel.org
 help / color / mirror / Atom feed
* [kvm PATCH v5 0/4] shrink vcpu_vmx down to order 2
@ 2018-10-31 13:26 Marc Orr
  2018-10-31 13:26 ` [kvm PATCH v5 1/4] kvm: x86: Use task structs fpu field for user Marc Orr
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Marc Orr @ 2018-10-31 13:26 UTC (permalink / raw)
  To: kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm, pbonzini,
	rkrcmar, willy, sean.j.christopherson, dave.hansen, kernellwp
  Cc: Marc Orr

Compared to the last version of these patches, I've acted on Dave
Hansen's suggestions to get rid of redundant fpu storage and move it out
of the kvm_vcpu_arch struct.

For now, I've left the vmalloc patches in the series, but we might end
up dropping them. Thus, I've have not responded to Sean Christopherson's
review on those patches yet.

Marc Orr (4):
  kvm: x86: Use task structs fpu field for user
  kvm: x86: Dynamically allocate guest_fpu
  kvm: vmx: refactor vmx_msrs struct for vmalloc
  kvm: vmx: use vmalloc() to allocate vcpus

 arch/x86/include/asm/kvm_host.h |  10 ++--
 arch/x86/kvm/svm.c              |  10 ++++
 arch/x86/kvm/vmx.c              | 102 +++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |  49 ++++++++++-----
 virt/kvm/kvm_main.c             |  28 +++++----
 5 files changed, 159 insertions(+), 40 deletions(-)

-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [kvm PATCH v5 1/4] kvm: x86: Use task structs fpu field for user
  2018-10-31 13:26 [kvm PATCH v5 0/4] shrink vcpu_vmx down to order 2 Marc Orr
@ 2018-10-31 13:26 ` Marc Orr
  2018-10-31 13:26 ` [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu Marc Orr
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Marc Orr @ 2018-10-31 13:26 UTC (permalink / raw)
  To: kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm, pbonzini,
	rkrcmar, willy, sean.j.christopherson, dave.hansen, kernellwp
  Cc: Marc Orr

Previously, x86's instantiation of 'struct kvm_vcpu_arch' added an fpu
field to save/restore fpu-related architectural state, which will differ
from kvm's fpu state. However, this is redundant to the 'struct fpu'
field, called fpu, embedded in the task struct, via the thread field.
Thus, this patch removes the user_fpu field from the kvm_vcpu_arch
struct and replaces it with the task struct's fpu field.

This change is significant because the fpu struct is actually quite
large. For example, on the system used to develop this patch, this
change reduces the size of the vcpu_vmx struct from 23680 bytes down to
19520 bytes, when building the kernel with kvmconfig. This reduction in
the size of the vcpu_vmx struct moves us closer to being able to
allocate the struct at order 2, rather than order 3.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
---
 arch/x86/include/asm/kvm_host.h | 7 +++----
 arch/x86/kvm/x86.c              | 4 ++--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55e51ff7e421..ebb1d7a755d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -601,16 +601,15 @@ struct kvm_vcpu_arch {
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
-	 * In vcpu_run, we switch between the user and guest FPU contexts.
-	 * While running a VCPU, the VCPU thread will have the guest FPU
-	 * context.
+	 * In vcpu_run, we switch between the user, maintained in the
+	 * task_struct struct, and guest FPU contexts. While running a VCPU,
+	 * the VCPU thread will have the guest FPU context.
 	 *
 	 * Note that while the PKRU state lives inside the fpu registers,
 	 * it is switched out separately at VMENTER and VMEXIT time. The
 	 * "guest_fpu" state here contains the guest FPU context, with the
 	 * host PRKU bits.
 	 */
-	struct fpu user_fpu;
 	struct fpu guest_fpu;
 
 	u64 xcr0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bdcb5babfb68..ff77514f7367 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7999,7 +7999,7 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
 static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
-	copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
+	copy_fpregs_to_fpstate(&current->thread.fpu);
 	/* PKRU is separately restored in kvm_x86_ops->run.  */
 	__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
 				~XFEATURE_MASK_PKRU);
@@ -8012,7 +8012,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
 	copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
-	copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
+	copy_kernel_to_fpregs(&current->thread.fpu.state);
 	preempt_enable();
 	++vcpu->stat.fpu_reload;
 	trace_kvm_fpu(0);
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 13:26 [kvm PATCH v5 0/4] shrink vcpu_vmx down to order 2 Marc Orr
  2018-10-31 13:26 ` [kvm PATCH v5 1/4] kvm: x86: Use task structs fpu field for user Marc Orr
@ 2018-10-31 13:26 ` Marc Orr
  2018-10-31 14:11   ` Dave Hansen
  2018-10-31 13:26 ` [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc Marc Orr
  2018-10-31 13:26 ` [kvm PATCH v5 4/4] kvm: vmx: use vmalloc() to allocate vcpus Marc Orr
  3 siblings, 1 reply; 15+ messages in thread
From: Marc Orr @ 2018-10-31 13:26 UTC (permalink / raw)
  To: kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm, pbonzini,
	rkrcmar, willy, sean.j.christopherson, dave.hansen, kernellwp
  Cc: Marc Orr

Previously, the guest_fpu field was embedded in the kvm_vcpu_arch
struct. Unfortunately, the field is quite large, (e.g., 4352 bytes on my
current setup). This bloats the kvm_vcpu_arch struct for x86 into an
order 3 memory allocation, which can become a problem on overcommitted
machines. Thus, this patch moves the fpu state outside of the
kvm_vcpu_arch struct.

With this patch applied, the kvm_vcpu_arch struct is reduced to 15168
bytes for vmx on my setup when building the kernel with kvmconfig.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/svm.c              | 10 ++++++++
 arch/x86/kvm/vmx.c              | 10 ++++++++
 arch/x86/kvm/x86.c              | 45 +++++++++++++++++++++++----------
 4 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ebb1d7a755d4..c8a2a263f91f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -610,7 +610,7 @@ struct kvm_vcpu_arch {
 	 * "guest_fpu" state here contains the guest FPU context, with the
 	 * host PRKU bits.
 	 */
-	struct fpu guest_fpu;
+	struct fpu *guest_fpu;
 
 	u64 xcr0;
 	u64 guest_supported_xcr0;
@@ -1194,6 +1194,7 @@ struct kvm_arch_async_pf {
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
+extern struct kmem_cache *x86_fpu_cache;
 
 #define __KVM_HAVE_ARCH_VM_ALLOC
 static inline struct kvm *kvm_arch_alloc_vm(void)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f416f5c7f2ae..ac0c52ca22c6 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2121,6 +2121,13 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id)
 		goto out;
 	}
 
+	svm->vcpu.arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, GFP_KERNEL);
+	if (!svm->vcpu.arch.guest_fpu) {
+		printk(KERN_ERR "kvm: failed to allocate vcpu's fpu\n");
+		err = -ENOMEM;
+		goto free_partial_svm;
+	}
+
 	err = kvm_vcpu_init(&svm->vcpu, kvm, id);
 	if (err)
 		goto free_svm;
@@ -2180,6 +2187,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id)
 uninit:
 	kvm_vcpu_uninit(&svm->vcpu);
 free_svm:
+	kmem_cache_free(x86_fpu_cache, svm->vcpu.arch.guest_fpu);
+free_partial_svm:
 	kmem_cache_free(kvm_vcpu_cache, svm);
 out:
 	return ERR_PTR(err);
@@ -2194,6 +2203,7 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
 	__free_page(virt_to_page(svm->nested.hsave));
 	__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
 	kvm_vcpu_uninit(vcpu);
+	kmem_cache_free(x86_fpu_cache, svm->vcpu.arch.guest_fpu);
 	kmem_cache_free(kvm_vcpu_cache, svm);
 	/*
 	 * The vmcb page can be recycled, causing a false negative in
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index abeeb45d1c33..4078cf15a4b0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11476,6 +11476,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 	free_loaded_vmcs(vmx->loaded_vmcs);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
+	kmem_cache_free(x86_fpu_cache, vmx->vcpu.arch.guest_fpu);
 	kmem_cache_free(kvm_vcpu_cache, vmx);
 }
 
@@ -11489,6 +11490,13 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	if (!vmx)
 		return ERR_PTR(-ENOMEM);
 
+	vmx->vcpu.arch.guest_fpu = kmem_cache_zalloc(x86_fpu_cache, GFP_KERNEL);
+	if (!vmx->vcpu.arch.guest_fpu) {
+		printk(KERN_ERR "kvm: failed to allocate vcpu's fpu\n");
+		err = -ENOMEM;
+		goto free_partial_vcpu;
+	}
+
 	vmx->vpid = allocate_vpid();
 
 	err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
@@ -11576,6 +11584,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	kvm_vcpu_uninit(&vmx->vcpu);
 free_vcpu:
 	free_vpid(vmx->vpid);
+	kmem_cache_free(x86_fpu_cache, vmx->vcpu.arch.guest_fpu);
+free_partial_vcpu:
 	kmem_cache_free(kvm_vcpu_cache, vmx);
 	return ERR_PTR(err);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff77514f7367..420516f0749a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -213,6 +213,9 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 
 u64 __read_mostly host_xcr0;
 
+struct kmem_cache *x86_fpu_cache;
+EXPORT_SYMBOL_GPL(x86_fpu_cache);
+
 static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt);
 
 static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
@@ -3635,7 +3638,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 
 static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu.state.xsave;
+	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
 	u64 xstate_bv = xsave->header.xfeatures;
 	u64 valid;
 
@@ -3677,7 +3680,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 
 static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu.state.xsave;
+	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
 	u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
 	u64 valid;
 
@@ -3725,7 +3728,7 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 		fill_xsave((u8 *) guest_xsave->region, vcpu);
 	} else {
 		memcpy(guest_xsave->region,
-			&vcpu->arch.guest_fpu.state.fxsave,
+			&vcpu->arch.guest_fpu->state.fxsave,
 			sizeof(struct fxregs_state));
 		*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
 			XFEATURE_MASK_FPSSE;
@@ -3755,7 +3758,7 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 		if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
 			mxcsr & ~mxcsr_feature_mask)
 			return -EINVAL;
-		memcpy(&vcpu->arch.guest_fpu.state.fxsave,
+		memcpy(&vcpu->arch.guest_fpu->state.fxsave,
 			guest_xsave->region, sizeof(struct fxregs_state));
 	}
 	return 0;
@@ -6819,10 +6822,23 @@ int kvm_arch_init(void *opaque)
 	}
 
 	r = -ENOMEM;
+	x86_fpu_cache = kmem_cache_create_usercopy(
+				"x86_fpu",
+				sizeof(struct fpu),
+				__alignof__(struct fpu),
+				SLAB_ACCOUNT,
+				offsetof(struct fpu, state),
+				sizeof_field(struct fpu, state),
+				NULL);
+	if (!x86_fpu_cache) {
+		printk(KERN_ERR "kvm: failed to allocate cache for x86 fpu\n");
+		goto out;
+	}
+
 	shared_msrs = alloc_percpu(struct kvm_shared_msrs);
 	if (!shared_msrs) {
 		printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
-		goto out;
+		goto out_free_x86_fpu_cache;
 	}
 
 	r = kvm_mmu_module_init();
@@ -6855,6 +6871,8 @@ int kvm_arch_init(void *opaque)
 
 out_free_percpu:
 	free_percpu(shared_msrs);
+out_free_x86_fpu_cache:
+	kmem_cache_destroy(x86_fpu_cache);
 out:
 	return r;
 }
@@ -6878,6 +6896,7 @@ void kvm_arch_exit(void)
 	kvm_x86_ops = NULL;
 	kvm_mmu_module_exit();
 	free_percpu(shared_msrs);
+	kmem_cache_destroy(x86_fpu_cache);
 }
 
 int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
@@ -8001,7 +8020,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 	preempt_disable();
 	copy_fpregs_to_fpstate(&current->thread.fpu);
 	/* PKRU is separately restored in kvm_x86_ops->run.  */
-	__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
+	__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu->state,
 				~XFEATURE_MASK_PKRU);
 	preempt_enable();
 	trace_kvm_fpu(1);
@@ -8011,7 +8030,7 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
-	copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
+	copy_fpregs_to_fpstate(vcpu->arch.guest_fpu);
 	copy_kernel_to_fpregs(&current->thread.fpu.state);
 	preempt_enable();
 	++vcpu->stat.fpu_reload;
@@ -8506,7 +8525,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu.state.fxsave;
+	fxsave = &vcpu->arch.guest_fpu->state.fxsave;
 	memcpy(fpu->fpr, fxsave->st_space, 128);
 	fpu->fcw = fxsave->cwd;
 	fpu->fsw = fxsave->swd;
@@ -8526,7 +8545,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 
 	vcpu_load(vcpu);
 
-	fxsave = &vcpu->arch.guest_fpu.state.fxsave;
+	fxsave = &vcpu->arch.guest_fpu->state.fxsave;
 
 	memcpy(fxsave->st_space, fpu->fpr, 128);
 	fxsave->cwd = fpu->fcw;
@@ -8582,9 +8601,9 @@ static int sync_regs(struct kvm_vcpu *vcpu)
 
 static void fx_init(struct kvm_vcpu *vcpu)
 {
-	fpstate_init(&vcpu->arch.guest_fpu.state);
+	fpstate_init(&vcpu->arch.guest_fpu->state);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		vcpu->arch.guest_fpu.state.xsave.header.xcomp_bv =
+		vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
 
 	/*
@@ -8708,11 +8727,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		 */
 		if (init_event)
 			kvm_put_guest_fpu(vcpu);
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
+		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
 					XFEATURE_MASK_BNDREGS);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
+		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
 					XFEATURE_MASK_BNDCSR);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc
  2018-10-31 13:26 [kvm PATCH v5 0/4] shrink vcpu_vmx down to order 2 Marc Orr
  2018-10-31 13:26 ` [kvm PATCH v5 1/4] kvm: x86: Use task structs fpu field for user Marc Orr
  2018-10-31 13:26 ` [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu Marc Orr
@ 2018-10-31 13:26 ` Marc Orr
  2018-10-31 14:12   ` Dave Hansen
  2018-10-31 13:26 ` [kvm PATCH v5 4/4] kvm: vmx: use vmalloc() to allocate vcpus Marc Orr
  3 siblings, 1 reply; 15+ messages in thread
From: Marc Orr @ 2018-10-31 13:26 UTC (permalink / raw)
  To: kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm, pbonzini,
	rkrcmar, willy, sean.j.christopherson, dave.hansen, kernellwp
  Cc: Marc Orr

Previously, the vmx_msrs struct relied being aligned within a struct
that is backed by the direct map (e.g., memory allocated with kalloc()).
Specifically, this enabled the virtual addresses associated with the
struct to be translated to physical addresses. However, we'd like to
refactor the host struct, vcpu_vmx, to be allocated with vmalloc(), so
that allocation will succeed when contiguous physical memory is scarce.

Thus, this patch refactors how vmx_msrs is declared and allocated, to
ensure that it can be mapped to the physical address space, even when
vmx_msrs resides within in a vmalloc()'d struct.

Signed-off-by: Marc Orr <marcorr@google.com>
---
 arch/x86/kvm/vmx.c | 57 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4078cf15a4b0..315cf4b5f262 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -970,8 +970,25 @@ static inline int pi_test_sn(struct pi_desc *pi_desc)
 
 struct vmx_msrs {
 	unsigned int		nr;
-	struct vmx_msr_entry	val[NR_AUTOLOAD_MSRS];
+	struct vmx_msr_entry	*val;
 };
+struct kmem_cache *vmx_msr_entry_cache;
+
+/*
+ * To prevent vmx_msr_entry array from crossing a page boundary, require:
+ * sizeof(*vmx_msrs.vmx_msr_entry.val) to be a power of two. This is guaranteed
+ * through compile-time asserts that:
+ *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) is a power of two
+ *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) <= PAGE_SIZE
+ *   - The allocation of vmx_msrs.vmx_msr_entry.val is aligned to its size.
+ */
+#define CHECK_POWER_OF_TWO(val) \
+	BUILD_BUG_ON_MSG(!((val) && !((val) & ((val) - 1))), \
+	#val " is not a power of two.")
+#define CHECK_INTRA_PAGE(val) do { \
+		CHECK_POWER_OF_TWO(val); \
+		BUILD_BUG_ON(!(val <= PAGE_SIZE)); \
+	} while (0)
 
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
@@ -11497,6 +11514,19 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 		goto free_partial_vcpu;
 	}
 
+	vmx->msr_autoload.guest.val =
+		kmem_cache_zalloc(vmx_msr_entry_cache, GFP_KERNEL);
+	if (!vmx->msr_autoload.guest.val) {
+		err = -ENOMEM;
+		goto free_fpu;
+	}
+	vmx->msr_autoload.host.val =
+		kmem_cache_zalloc(vmx_msr_entry_cache, GFP_KERNEL);
+	if (!vmx->msr_autoload.host.val) {
+		err = -ENOMEM;
+		goto free_msr_autoload_guest;
+	}
+
 	vmx->vpid = allocate_vpid();
 
 	err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
@@ -11584,6 +11614,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	kvm_vcpu_uninit(&vmx->vcpu);
 free_vcpu:
 	free_vpid(vmx->vpid);
+	kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.host.val);
+free_msr_autoload_guest:
+	kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.guest.val);
+free_fpu:
 	kmem_cache_free(x86_fpu_cache, vmx->vcpu.arch.guest_fpu);
 free_partial_vcpu:
 	kmem_cache_free(kvm_vcpu_cache, vmx);
@@ -15163,6 +15197,10 @@ module_exit(vmx_exit);
 static int __init vmx_init(void)
 {
 	int r;
+	size_t vmx_msr_entry_size =
+		sizeof(struct vmx_msr_entry) * NR_AUTOLOAD_MSRS;
+
+	CHECK_INTRA_PAGE(vmx_msr_entry_size);
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	/*
@@ -15194,9 +15232,21 @@ static int __init vmx_init(void)
 #endif
 
 	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
-		     __alignof__(struct vcpu_vmx), THIS_MODULE);
+		__alignof__(struct vcpu_vmx), THIS_MODULE);
 	if (r)
 		return r;
+	/*
+	 * A vmx_msr_entry array resides exclusively within the kernel. Thus,
+	 * use kmem_cache_create_usercopy(), with the usersize argument set to
+	 * ZERO, to blacklist copying vmx_msr_entry to/from user space.
+	 */
+	vmx_msr_entry_cache =
+		kmem_cache_create_usercopy("vmx_msr_entry", vmx_msr_entry_size,
+				  vmx_msr_entry_size, SLAB_ACCOUNT, 0, 0, NULL);
+	if (!vmx_msr_entry_cache) {
+		r = -ENOMEM;
+		goto out;
+	}
 
 	/*
 	 * Must be called after kvm_init() so enable_ept is properly set
@@ -15220,5 +15270,8 @@ static int __init vmx_init(void)
 	vmx_check_vmcs12_offsets();
 
 	return 0;
+out:
+	kvm_exit();
+	return r;
 }
 module_init(vmx_init);
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [kvm PATCH v5 4/4] kvm: vmx: use vmalloc() to allocate vcpus
  2018-10-31 13:26 [kvm PATCH v5 0/4] shrink vcpu_vmx down to order 2 Marc Orr
                   ` (2 preceding siblings ...)
  2018-10-31 13:26 ` [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc Marc Orr
@ 2018-10-31 13:26 ` Marc Orr
  3 siblings, 0 replies; 15+ messages in thread
From: Marc Orr @ 2018-10-31 13:26 UTC (permalink / raw)
  To: kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm, pbonzini,
	rkrcmar, willy, sean.j.christopherson, dave.hansen, kernellwp
  Cc: Marc Orr

Previously, vcpus were allocated through the kmem_cache_zalloc() API,
which requires the underlying physical memory to be contiguous.
Because the x86 vcpu struct, struct vcpu_vmx, is relatively large
(e.g., currently 47680 bytes on my setup), it can become hard to find
contiguous memory.

At the same time, the comments in the code indicate that the primary
reason for using the kmem_cache_zalloc() API is to align the memory
rather than to provide physical contiguity.

Thus, this patch updates the vcpu allocation logic for vmx to use the
vmalloc() API.

Signed-off-by: Marc Orr <marcorr@google.com>
---
 arch/x86/kvm/vmx.c  | 37 ++++++++++++++++++++++++++++++-------
 virt/kvm/kvm_main.c | 28 ++++++++++++++++------------
 2 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 315cf4b5f262..af651540ee45 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -898,7 +898,14 @@ struct nested_vmx {
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1
 
-/* Posted-Interrupt Descriptor */
+/*
+ * Posted-Interrupt Descriptor
+ *
+ * Note, the physical address of this structure is used by VMX. Furthermore, the
+ * translation code assumes that the entire pi_desc struct resides within a
+ * single page, which will be true because the struct is 64 bytes and 64-byte
+ * aligned.
+ */
 struct pi_desc {
 	u32 pir[8];     /* Posted interrupt requested */
 	union {
@@ -6633,6 +6640,14 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 	}
 
 	if (kvm_vcpu_apicv_active(&vmx->vcpu)) {
+		/*
+		 * Note, pi_desc is contained within a single
+		 * page because the struct is 64 bytes and 64-byte aligned.
+		 */
+		phys_addr_t pi_desc_phys =
+			page_to_phys(vmalloc_to_page(&vmx->pi_desc)) +
+			(u64)&vmx->pi_desc % PAGE_SIZE;
+
 		vmcs_write64(EOI_EXIT_BITMAP0, 0);
 		vmcs_write64(EOI_EXIT_BITMAP1, 0);
 		vmcs_write64(EOI_EXIT_BITMAP2, 0);
@@ -6641,7 +6656,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmcs_write16(GUEST_INTR_STATUS, 0);
 
 		vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR);
-		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+		vmcs_write64(POSTED_INTR_DESC_ADDR, pi_desc_phys);
 	}
 
 	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
@@ -11494,13 +11509,18 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(x86_fpu_cache, vmx->vcpu.arch.guest_fpu);
-	kmem_cache_free(kvm_vcpu_cache, vmx);
+	kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.guest.val);
+	kmem_cache_free(vmx_msr_entry_cache, vmx->msr_autoload.host.val);
+	vfree(vmx);
 }
 
 static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 {
 	int err;
-	struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
+	struct vcpu_vmx *vmx =
+		__vmalloc(sizeof(struct vcpu_vmx),
+			  GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT,
+			  PAGE_KERNEL);
 	unsigned long *msr_bitmap;
 	int cpu;
 
@@ -11620,7 +11640,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 free_fpu:
 	kmem_cache_free(x86_fpu_cache, vmx->vcpu.arch.guest_fpu);
 free_partial_vcpu:
-	kmem_cache_free(kvm_vcpu_cache, vmx);
+	vfree(vmx);
 	return ERR_PTR(err);
 }
 
@@ -15231,8 +15251,11 @@ static int __init vmx_init(void)
 	}
 #endif
 
-	r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
-		__alignof__(struct vcpu_vmx), THIS_MODULE);
+	/*
+	 * Disable kmem cache; vmalloc will be used instead
+	 * to avoid OOM'ing when memory is available but not contiguous.
+	 */
+	r = kvm_init(&vmx_x86_ops, 0, 0, THIS_MODULE);
 	if (r)
 		return r;
 	/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 786ade1843a2..8b979e7c3ecd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4038,18 +4038,22 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		goto out_free_2;
 	register_reboot_notifier(&kvm_reboot_notifier);
 
-	/* A kmem cache lets us meet the alignment requirements of fx_save. */
-	if (!vcpu_align)
-		vcpu_align = __alignof__(struct kvm_vcpu);
-	kvm_vcpu_cache =
-		kmem_cache_create_usercopy("kvm_vcpu", vcpu_size, vcpu_align,
-					   SLAB_ACCOUNT,
-					   offsetof(struct kvm_vcpu, arch),
-					   sizeof_field(struct kvm_vcpu, arch),
-					   NULL);
-	if (!kvm_vcpu_cache) {
-		r = -ENOMEM;
-		goto out_free_3;
+	/*
+	 * When vcpu_size is zero,
+	 * architecture-specific code manages its own vcpu allocation.
+	 */
+	kvm_vcpu_cache = NULL;
+	if (vcpu_size) {
+		if (!vcpu_align)
+			vcpu_align = __alignof__(struct kvm_vcpu);
+		kvm_vcpu_cache = kmem_cache_create_usercopy(
+			"kvm_vcpu", vcpu_size, vcpu_align, SLAB_ACCOUNT,
+			offsetof(struct kvm_vcpu, arch),
+			sizeof_field(struct kvm_vcpu, arch), NULL);
+		if (!kvm_vcpu_cache) {
+			r = -ENOMEM;
+			goto out_free_3;
+		}
 	}
 
 	r = kvm_async_pf_init();
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 13:26 ` [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu Marc Orr
@ 2018-10-31 14:11   ` Dave Hansen
  2018-10-31 21:13     ` Marc Orr
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2018-10-31 14:11 UTC (permalink / raw)
  To: Marc Orr, kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm,
	pbonzini, rkrcmar, willy, sean.j.christopherson, dave.hansen,
	kernellwp

On 10/31/18 6:26 AM, Marc Orr wrote:
>  	r = -ENOMEM;
> +	x86_fpu_cache = kmem_cache_create_usercopy(
> +				"x86_fpu",
> +				sizeof(struct fpu),
> +				__alignof__(struct fpu),
> +				SLAB_ACCOUNT,
> +				offsetof(struct fpu, state),
> +				sizeof_field(struct fpu, state),
> +				NULL);

We should basically never be using sizeof(struct fpu), anywhere.  As you
saw, it's about a page in size, but the actual hardware FPU structure
can be as small as ~500 bytes or as big as ~3k.  Doing it this way is a
pretty unnecessary waste of memory because sizeof(struct fpu) is sized
for the worst-case (largest) possible XSAVE buffer that we support on
*any* CPU.  It will also get way worse if anyone ever throws a bunch
more state into the XSAVE area and we need to make it way bigger.

If you want a kmem cache for this, I'd suggest creating a cache which is
the size of the host XSAVE buffer.  That can be found in a variable
called 'fpu_kernel_xstate_size'.  I'm assuming here that the guest FPU
is going to support a strict subset of host kernel XSAVE states.

The other alternative is to calculate the actual size of the XSAVE
buffer that the guest needs.  You can do that from the values that KVM
sets to limit guest XCR0 values (the name of the control field is
escaping me at the moment).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc
  2018-10-31 13:26 ` [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc Marc Orr
@ 2018-10-31 14:12   ` Dave Hansen
  2018-10-31 14:15     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2018-10-31 14:12 UTC (permalink / raw)
  To: Marc Orr, kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm,
	pbonzini, rkrcmar, willy, sean.j.christopherson, dave.hansen,
	kernellwp

On 10/31/18 6:26 AM, Marc Orr wrote:
> +/*
> + * To prevent vmx_msr_entry array from crossing a page boundary, require:
> + * sizeof(*vmx_msrs.vmx_msr_entry.val) to be a power of two. This is guaranteed
> + * through compile-time asserts that:
> + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) is a power of two
> + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) <= PAGE_SIZE
> + *   - The allocation of vmx_msrs.vmx_msr_entry.val is aligned to its size.
> + */

Why do we need to prevent them from crossing a page boundary?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc
  2018-10-31 14:12   ` Dave Hansen
@ 2018-10-31 14:15     ` Sean Christopherson
  2018-10-31 14:19       ` Marc Orr
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2018-10-31 14:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Marc Orr, kvm, jmattson, rientjes, konrad.wilk, linux-mm, akpm,
	pbonzini, rkrcmar, willy, dave.hansen, kernellwp

On Wed, Oct 31, 2018 at 07:12:16AM -0700, Dave Hansen wrote:
> On 10/31/18 6:26 AM, Marc Orr wrote:
> > +/*
> > + * To prevent vmx_msr_entry array from crossing a page boundary, require:
> > + * sizeof(*vmx_msrs.vmx_msr_entry.val) to be a power of two. This is guaranteed
> > + * through compile-time asserts that:
> > + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) is a power of two
> > + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) <= PAGE_SIZE
> > + *   - The allocation of vmx_msrs.vmx_msr_entry.val is aligned to its size.
> > + */
> 
> Why do we need to prevent them from crossing a page boundary?

The VMCS takes the physical address of the load/store lists.  I
requested that this information be added to the changelog.  Marc
deferred addressing my comments since there's a decent chance
patches 3/4 and 4/4 will be dropped in the end.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc
  2018-10-31 14:15     ` Sean Christopherson
@ 2018-10-31 14:19       ` Marc Orr
  0 siblings, 0 replies; 15+ messages in thread
From: Marc Orr @ 2018-10-31 14:19 UTC (permalink / raw)
  To: sean.j.christopherson
  Cc: dave.hansen, kvm, Jim Mattson, David Rientjes,
	Konrad Rzeszutek Wilk, linux-mm, akpm, pbonzini, rkrcmar, willy,
	dave.hansen, Wanpeng Li

On Wed, Oct 31, 2018 at 7:15 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Oct 31, 2018 at 07:12:16AM -0700, Dave Hansen wrote:
> > On 10/31/18 6:26 AM, Marc Orr wrote:
> > > +/*
> > > + * To prevent vmx_msr_entry array from crossing a page boundary, require:
> > > + * sizeof(*vmx_msrs.vmx_msr_entry.val) to be a power of two. This is guaranteed
> > > + * through compile-time asserts that:
> > > + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) is a power of two
> > > + *   - NR_AUTOLOAD_MSRS * sizeof(struct vmx_msr_entry) <= PAGE_SIZE
> > > + *   - The allocation of vmx_msrs.vmx_msr_entry.val is aligned to its size.
> > > + */
> >
> > Why do we need to prevent them from crossing a page boundary?
>
> The VMCS takes the physical address of the load/store lists.  I
> requested that this information be added to the changelog.  Marc
> deferred addressing my comments since there's a decent chance
> patches 3/4 and 4/4 will be dropped in the end.

Exactly. And the code (in these patches) to map these virtual address
to physical addresses operates at page granularity, and will break for
memory that spans a single page.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 14:11   ` Dave Hansen
@ 2018-10-31 21:13     ` Marc Orr
  2018-10-31 21:21       ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Orr @ 2018-10-31 21:13 UTC (permalink / raw)
  To: dave.hansen
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm, pbonzini, rkrcmar, willy, sean.j.christopherson,
	dave.hansen, Wanpeng Li

> We should basically never be using sizeof(struct fpu), anywhere.  As you
> saw, it's about a page in size, but the actual hardware FPU structure
> can be as small as ~500 bytes or as big as ~3k.  Doing it this way is a
> pretty unnecessary waste of memory because sizeof(struct fpu) is sized
> for the worst-case (largest) possible XSAVE buffer that we support on
> *any* CPU.  It will also get way worse if anyone ever throws a bunch
> more state into the XSAVE area and we need to make it way bigger.
>
> If you want a kmem cache for this, I'd suggest creating a cache which is
> the size of the host XSAVE buffer.  That can be found in a variable
> called 'fpu_kernel_xstate_size'.  I'm assuming here that the guest FPU
> is going to support a strict subset of host kernel XSAVE states.


This suggestion sounds good. Though, I have one uncertainty. KVM
explicitly cast guest_fpu.state as a fxregs_state in a few places
(e.g., the ioctls). Yet, I see a code path in
fpu__init_system_xstate_size_legacy() that sets fpu_kernel_xstate_size
to sizeof(struct fregs_state). Will this cause problems? You mentioned
that the fpu's state field is expected to range from ~500 bytes to
~3k, which implies that it should never get set to sizeof(struct
fregs_state). But I want to confirm.

>
>
> The other alternative is to calculate the actual size of the XSAVE
> buffer that the guest needs.  You can do that from the values that KVM
> sets to limit guest XCR0 values (the name of the control field is
> escaping me at the moment).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 21:13     ` Marc Orr
@ 2018-10-31 21:21       ` Dave Hansen
  2018-10-31 21:24         ` Marc Orr
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2018-10-31 21:21 UTC (permalink / raw)
  To: Marc Orr
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm, pbonzini, rkrcmar, willy, sean.j.christopherson,
	dave.hansen, Wanpeng Li

On 10/31/18 2:13 PM, Marc Orr wrote:
> KVM explicitly cast guest_fpu.state as a fxregs_state in a few
> places (e.g., the ioctls). Yet, I see a code path in 
> fpu__init_system_xstate_size_legacy() that sets
> fpu_kernel_xstate_size to sizeof(struct fregs_state). Will this cause
> problems? You mentioned that the fpu's state field is expected to
> range from ~500 bytes to ~3k, which implies that it should never get
> set to sizeof(struct fregs_state). But I want to confirm.

It can get set to sizeof(struct fregs_state) for systems where XSAVE is
not in use.  I was neglecting to mention those when I said the "~500
byte" number.

My point was that it can vary wildly and that any static allocation
scheme will waste lots of memory when we have small hardware-supported
buffers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 21:21       ` Dave Hansen
@ 2018-10-31 21:24         ` Marc Orr
  2018-10-31 21:30           ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Orr @ 2018-10-31 21:24 UTC (permalink / raw)
  To: dave.hansen
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm, pbonzini, rkrcmar, willy, sean.j.christopherson,
	dave.hansen, Wanpeng Li

> It can get set to sizeof(struct fregs_state) for systems where XSAVE is
> not in use.  I was neglecting to mention those when I said the "~500
> byte" number.
>
> My point was that it can vary wildly and that any static allocation
> scheme will waste lots of memory when we have small hardware-supported
> buffers.

Got it. Then I think we need to set the size for the kmem cache to
max(fpu_kernel_xstate_size, sizeof(fxregs_state)), unless I'm missing
something. I'll send out a version of the patch that does this in a
bit. Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 21:24         ` Marc Orr
@ 2018-10-31 21:30           ` Dave Hansen
  2018-10-31 21:39             ` Marc Orr
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2018-10-31 21:30 UTC (permalink / raw)
  To: Marc Orr
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm, pbonzini, rkrcmar, willy, sean.j.christopherson,
	dave.hansen, Wanpeng Li

On 10/31/18 2:24 PM, Marc Orr wrote:
>> It can get set to sizeof(struct fregs_state) for systems where XSAVE is
>> not in use.  I was neglecting to mention those when I said the "~500
>> byte" number.
>>
>> My point was that it can vary wildly and that any static allocation
>> scheme will waste lots of memory when we have small hardware-supported
>> buffers.
> 
> Got it. Then I think we need to set the size for the kmem cache to
> max(fpu_kernel_xstate_size, sizeof(fxregs_state)), unless I'm missing
> something. I'll send out a version of the patch that does this in a
> bit. Thanks!

Despite its name, fpu_kernel_xstate_size *should* always be the "size of
the hardware buffer we need to back 'struct fpu'".  That's true for all
of the various formats we support: XSAVE, fxregs, swregs, etc...

fpu__init_system_xstate_size_legacy() does that when XSAVE itself is not
in play.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 21:30           ` Dave Hansen
@ 2018-10-31 21:39             ` Marc Orr
  2018-10-31 21:44               ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Orr @ 2018-10-31 21:39 UTC (permalink / raw)
  To: dave.hansen
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm, pbonzini, rkrcmar, willy, sean.j.christopherson,
	dave.hansen, Wanpeng Li

On Wed, Oct 31, 2018 at 2:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/31/18 2:24 PM, Marc Orr wrote:
> >> It can get set to sizeof(struct fregs_state) for systems where XSAVE is
> >> not in use.  I was neglecting to mention those when I said the "~500
> >> byte" number.
> >>
> >> My point was that it can vary wildly and that any static allocation
> >> scheme will waste lots of memory when we have small hardware-supported
> >> buffers.
> >
> > Got it. Then I think we need to set the size for the kmem cache to
> > max(fpu_kernel_xstate_size, sizeof(fxregs_state)), unless I'm missing
> > something. I'll send out a version of the patch that does this in a
> > bit. Thanks!
>
> Despite its name, fpu_kernel_xstate_size *should* always be the "size of
> the hardware buffer we need to back 'struct fpu'".  That's true for all
> of the various formats we support: XSAVE, fxregs, swregs, etc...
>
> fpu__init_system_xstate_size_legacy() does that when XSAVE itself is not
> in play.

That makes sense. But my specific concern is the code I've copied
below, from arch/x86/kvm/x86.c. Notice on a system where
guest_fpu.state is a fregs_state, this code would generate garbage for
some fields. With the new code we're talking about, it will cause
memory corruption. But maybe it's not possible to run this code on a
system with an fregs_state, because such systems would predate VMX?

8382 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
8383 {
8384         struct fxregs_state *fxsave;
8385
8386         vcpu_load(vcpu);
8387
8388         fxsave = &vcpu->arch.guest_fpu->state.fxsave;
8389         memcpy(fpu->fpr, fxsave->st_space, 128);
8390         fpu->fcw = fxsave->cwd;
8391         fpu->fsw = fxsave->swd;
8392         fpu->ftwx = fxsave->twd;
8393         fpu->last_opcode = fxsave->fop;
8394         fpu->last_ip = fxsave->rip;
8395         fpu->last_dp = fxsave->rdp;
8396         memcpy(fpu->xmm, fxsave->xmm_space, sizeof fxsave->xmm_space);
8397
8398         vcpu_put(vcpu);
8399         return 0;
8400 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu
  2018-10-31 21:39             ` Marc Orr
@ 2018-10-31 21:44               ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2018-10-31 21:44 UTC (permalink / raw)
  To: Marc Orr
  Cc: kvm, Jim Mattson, David Rientjes, Konrad Rzeszutek Wilk,
	linux-mm, akpm, pbonzini, rkrcmar, willy, sean.j.christopherson,
	dave.hansen, Wanpeng Li

On 10/31/18 2:39 PM, Marc Orr wrote:
> That makes sense. But my specific concern is the code I've copied
> below, from arch/x86/kvm/x86.c. Notice on a system where
> guest_fpu.state is a fregs_state, this code would generate garbage for
> some fields. With the new code we're talking about, it will cause
> memory corruption. But maybe it's not possible to run this code on a
> system with an fregs_state, because such systems would predate VMX?

Ahh, got it.

So, you *can* clear X86_FEATURE_* bits from the kernel command-line, so
it's theoretically possible to have a system that supports VMX, but
doesn't support a modern MMU.  It's obviously not well tested. :)

The KVM code you pasted, to be "correct" should probably be checking
X86_FEATURE_FXSR and X86_FEATURE_FPU *somewhere*.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-10-31 21:44 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-31 13:26 [kvm PATCH v5 0/4] shrink vcpu_vmx down to order 2 Marc Orr
2018-10-31 13:26 ` [kvm PATCH v5 1/4] kvm: x86: Use task structs fpu field for user Marc Orr
2018-10-31 13:26 ` [kvm PATCH v5 2/4] kvm: x86: Dynamically allocate guest_fpu Marc Orr
2018-10-31 14:11   ` Dave Hansen
2018-10-31 21:13     ` Marc Orr
2018-10-31 21:21       ` Dave Hansen
2018-10-31 21:24         ` Marc Orr
2018-10-31 21:30           ` Dave Hansen
2018-10-31 21:39             ` Marc Orr
2018-10-31 21:44               ` Dave Hansen
2018-10-31 13:26 ` [kvm PATCH v5 3/4] kvm: vmx: refactor vmx_msrs struct for vmalloc Marc Orr
2018-10-31 14:12   ` Dave Hansen
2018-10-31 14:15     ` Sean Christopherson
2018-10-31 14:19       ` Marc Orr
2018-10-31 13:26 ` [kvm PATCH v5 4/4] kvm: vmx: use vmalloc() to allocate vcpus Marc Orr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.