Re: [PATCH v3 1/3] KVM: SVM: use vmsave/vmload for saving/restoring additional host state

From: Tom Lendacky <thomas.lendacky@amd.com>
To: Michael Roth <michael.roth@amd.com>, kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 1/3] KVM: SVM: use vmsave/vmload for saving/restoring additional host state
Date: Thu, 7 Jan 2021 09:39:54 -0600	[thread overview]
Message-ID: <ec0a6376-9879-2277-9943-0d73339aadbb@amd.com> (raw)
In-Reply-To: <20210105143749.557054-2-michael.roth@amd.com>

On 1/5/21 8:37 AM, Michael Roth wrote:
> Using a guest workload which simply issues 'hlt' in a tight loop to
> generate VMEXITs, it was observed (on a recent EPYC processor) that a
> significant amount of the VMEXIT overhead measured on the host was the
> result of MSR reads/writes in svm_vcpu_load/svm_vcpu_put according to
> perf:
> 
>    67.49%--kvm_arch_vcpu_ioctl_run
>            |
>            |--23.13%--vcpu_put
>            |          kvm_arch_vcpu_put
>            |          |
>            |          |--21.31%--native_write_msr
>            |          |
>            |           --1.27%--svm_set_cr4
>            |
>            |--16.11%--vcpu_load
>            |          |
>            |           --15.58%--kvm_arch_vcpu_load
>            |                     |
>            |                     |--13.97%--svm_set_cr4
>            |                     |          |
>            |                     |          |--12.64%--native_read_msr
> 
> Most of these MSRs relate to 'syscall'/'sysenter' and segment bases, and
> can be saved/restored using 'vmsave'/'vmload' instructions rather than
> explicit MSR reads/writes. In doing so there is a significant reduction
> in the svm_vcpu_load/svm_vcpu_put overhead measured for the above
> workload:
> 
>    50.92%--kvm_arch_vcpu_ioctl_run
>            |
>            |--19.28%--disable_nmi_singlestep
>            |
>            |--13.68%--vcpu_load
>            |          kvm_arch_vcpu_load
>            |          |
>            |          |--9.19%--svm_set_cr4
>            |          |          |
>            |          |           --6.44%--native_read_msr
>            |          |
>            |           --3.55%--native_write_msr
>            |
>            |--6.05%--kvm_inject_nmi
>            |--2.80%--kvm_sev_es_mmio_read
>            |--2.19%--vcpu_put
>            |          |
>            |           --1.25%--kvm_arch_vcpu_put
>            |                     native_write_msr
> 
> Quantifying this further, if we look at the raw cycle counts for a
> normal iteration of the above workload (according to 'rdtscp'),
> kvm_arch_vcpu_ioctl_run() takes ~4600 cycles from start to finish with
> the current behavior. Using 'vmsave'/'vmload', this is reduced to
> ~2800 cycles, a savings of 39%.
> 
> While this approach doesn't seem to manifest in any noticeable
> improvement for more realistic workloads like UnixBench, netperf, and
> kernel builds, likely due to their exit paths generally involving IO
> with comparatively high latencies, it does improve overall overhead
> of KVM_RUN significantly, which may still be noticeable for certain
> situations. It also simplifies some aspects of the code.
> 
> With this change, explicit save/restore is no longer needed for the
> following host MSRs, since they are documented[1] as being part of the
> VMCB State Save Area:
> 
>    MSR_STAR, MSR_LSTAR, MSR_CSTAR,
>    MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE,
>    MSR_IA32_SYSENTER_CS,
>    MSR_IA32_SYSENTER_ESP,
>    MSR_IA32_SYSENTER_EIP,
>    MSR_FS_BASE, MSR_GS_BASE
> 
> and only the following MSR needs individual handling in
> svm_vcpu_put/svm_vcpu_load:
> 
>    MSR_TSC_AUX
> 
> We could drop the host_save_user_msrs array/loop and instead handle
> MSR read/write of MSR_TSC_AUX directly, but we leave that for now as
> a potential follow-up.
> 
> Since 'vmsave'/'vmload' also handles the LDTR and FS/GS segment
> registers (and associated hidden state)[2], some of the code
> previously used to handle this is no longer needed, so we drop it
> as well.
> 
> The first public release of the SVM spec[3] also documents the same
> handling for the host state in question, so we make these changes
> unconditionally.
> 
> Also worth noting is that we 'vmsave' to the same page that is
> subsequently used by 'vmrun' to record some host additional state. This
> is okay, since, in accordance with the spec[2], the additional state
> written to the page by 'vmrun' does not overwrite any fields written by
> 'vmsave'. This has also been confirmed through testing (for the above
> CPU, at least).
> 
> [1] AMD64 Architecture Programmer's Manual, Rev 3.33, Volume 2, Appendix B, Table B-2
> [2] AMD64 Architecture Programmer's Manual, Rev 3.31, Volume 3, Chapter 4, VMSAVE/VMLOAD
> [3] Secure Virtual Machine Architecture Reference Manual, Rev 3.01
> 
> Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
> ---
>   arch/x86/kvm/svm/svm.c     | 36 +++++++-----------------------------
>   arch/x86/kvm/svm/svm.h     | 19 +------------------
>   arch/x86/kvm/svm/vmenter.S | 10 ++++++++++
>   3 files changed, 18 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 941e5251e13f..7a7e9b7d47a7 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1420,16 +1420,12 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	if (sev_es_guest(svm->vcpu.kvm)) {
>   		sev_es_vcpu_load(svm, cpu);
>   	} else {
> -#ifdef CONFIG_X86_64
> -		rdmsrl(MSR_GS_BASE, to_svm(vcpu)->host.gs_base);
> -#endif
> -		savesegment(fs, svm->host.fs);
> -		savesegment(gs, svm->host.gs);
> -		svm->host.ldt = kvm_read_ldt();
> -
>   		for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
>   			rdmsrl(host_save_user_msrs[i].index,
>   			       svm->host_user_msrs[i]);
> +
> +		asm volatile(__ex("vmsave %%"_ASM_AX)
> +			     : : "a" (page_to_phys(sd->save_area)) : "memory");
>   	}
>   
>   	if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) {
> @@ -1461,17 +1457,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
>   	if (sev_es_guest(svm->vcpu.kvm)) {
>   		sev_es_vcpu_put(svm);
>   	} else {
> -		kvm_load_ldt(svm->host.ldt);
> -#ifdef CONFIG_X86_64
> -		loadsegment(fs, svm->host.fs);
> -		wrmsrl(MSR_KERNEL_GS_BASE, current->thread.gsbase);
> -		load_gs_index(svm->host.gs);
> -#else
> -#ifdef CONFIG_X86_32_LAZY_GS
> -		loadsegment(gs, svm->host.gs);
> -#endif
> -#endif
> -
>   		for (i = 0; i < NR_HOST_SAVE_USER_MSRS; i++)
>   			wrmsrl(host_save_user_msrs[i].index,
>   			       svm->host_user_msrs[i]);
> @@ -3675,7 +3660,7 @@ static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
>   	return EXIT_FASTPATH_NONE;
>   }
>   
> -void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
> +void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs, unsigned long hostsa_pa);

There was a follow on fix patch to remove this forward declaration since, 
for SEV-ES, I had moved it into svm.h without deleting it here. I'm not 
sure when it will hit Paolo's tree.

Thanks,
Tom

>   
>   static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>   					struct vcpu_svm *svm)
> @@ -3703,16 +3688,9 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>   	if (sev_es_guest(svm->vcpu.kvm)) {
>   		__svm_sev_es_vcpu_run(svm->vmcb_pa);
>   	} else {
> -		__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs);
> -
> -#ifdef CONFIG_X86_64
> -		native_wrmsrl(MSR_GS_BASE, svm->host.gs_base);
> -#else
> -		loadsegment(fs, svm->host.fs);
> -#ifndef CONFIG_X86_32_LAZY_GS
> -		loadsegment(gs, svm->host.gs);
> -#endif
> -#endif
> +		__svm_vcpu_run(svm->vmcb_pa, (unsigned long *)&svm->vcpu.arch.regs,
> +			       page_to_phys(per_cpu(svm_data,
> +						    vcpu->cpu)->save_area));
>   	}
>   
>   	/*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 5431e6335e2e..1f4460508036 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -27,17 +27,6 @@ static const struct svm_host_save_msrs {
>   	u32 index;		/* Index of the MSR */
>   	bool sev_es_restored;	/* True if MSR is restored on SEV-ES VMEXIT */
>   } host_save_user_msrs[] = {
> -#ifdef CONFIG_X86_64
> -	{ .index = MSR_STAR,			.sev_es_restored = true },
> -	{ .index = MSR_LSTAR,			.sev_es_restored = true },
> -	{ .index = MSR_CSTAR,			.sev_es_restored = true },
> -	{ .index = MSR_SYSCALL_MASK,		.sev_es_restored = true },
> -	{ .index = MSR_KERNEL_GS_BASE,		.sev_es_restored = true },
> -	{ .index = MSR_FS_BASE,			.sev_es_restored = true },
> -#endif
> -	{ .index = MSR_IA32_SYSENTER_CS,	.sev_es_restored = true },
> -	{ .index = MSR_IA32_SYSENTER_ESP,	.sev_es_restored = true },
> -	{ .index = MSR_IA32_SYSENTER_EIP,	.sev_es_restored = true },
>   	{ .index = MSR_TSC_AUX,			.sev_es_restored = false },
>   };
>   #define NR_HOST_SAVE_USER_MSRS ARRAY_SIZE(host_save_user_msrs)
> @@ -130,12 +119,6 @@ struct vcpu_svm {
>   	u64 next_rip;
>   
>   	u64 host_user_msrs[NR_HOST_SAVE_USER_MSRS];
> -	struct {
> -		u16 fs;
> -		u16 gs;
> -		u16 ldt;
> -		u64 gs_base;
> -	} host;
>   
>   	u64 spec_ctrl;
>   	/*
> @@ -595,6 +578,6 @@ void sev_es_vcpu_put(struct vcpu_svm *svm);
>   /* vmenter.S */
>   
>   void __svm_sev_es_vcpu_run(unsigned long vmcb_pa);
> -void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
> +void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs, unsigned long hostsa_pa);
>   
>   #endif
> diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S
> index 6feb8c08f45a..89f4e8e7bf0e 100644
> --- a/arch/x86/kvm/svm/vmenter.S
> +++ b/arch/x86/kvm/svm/vmenter.S
> @@ -33,6 +33,7 @@
>    * __svm_vcpu_run - Run a vCPU via a transition to SVM guest mode
>    * @vmcb_pa:	unsigned long
>    * @regs:	unsigned long * (to guest registers)
> + * @hostsa_pa:	unsigned long
>    */
>   SYM_FUNC_START(__svm_vcpu_run)
>   	push %_ASM_BP
> @@ -47,6 +48,9 @@ SYM_FUNC_START(__svm_vcpu_run)
>   #endif
>   	push %_ASM_BX
>   
> +	/* Save @hostsa_pa */
> +	push %_ASM_ARG3
> +
>   	/* Save @regs. */
>   	push %_ASM_ARG2
>   
> @@ -154,6 +158,12 @@ SYM_FUNC_START(__svm_vcpu_run)
>   	xor %r15d, %r15d
>   #endif
>   
> +	/* "POP" @hostsa_pa to RAX. */
> +	pop %_ASM_AX
> +
> +	/* Restore host user state and FS/GS base */
> +	vmload %_ASM_AX
> +
>   	pop %_ASM_BX
>   
>   #ifdef CONFIG_X86_64
>