Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system

From: Sean Christopherson <seanjc@google.com>
To: Andrei Vagin <avagin@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Wanpeng Li <wanpengli@tencent.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Jianfeng Tan <henry.tjf@antfin.com>,
	Adin Scannell <ascannell@google.com>,
	Konstantin Bogomolov <bogomolov@google.com>,
	Etienne Perot <eperot@google.com>,
	Andy Lutomirski <luto@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system
Date: Tue, 26 Jul 2022 15:10:34 +0000	[thread overview]
Message-ID: <YuAD6qY+F2nuGm62@google.com> (raw)
In-Reply-To: <CAEWA0a4hrRb5HYLqa1Q47=guY6TLsWSJ_zxNjOXXV2jCjUekUA@mail.gmail.com>

On Tue, Jul 26, 2022, Andrei Vagin wrote:
> On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > +x86 maintainers, patch 1 most definitely needs acceptance from folks beyond KVM.
> >
> > On Fri, Jul 22, 2022, Andrei Vagin wrote:
> > > Another option is the KVM platform. In this case, the Sentry (gVisor
> > > kernel) can run in a guest ring0 and create/manage multiple address
> > > spaces. Its performance is much better than the ptrace one, but it is
> > > still not great compared with the native performance. This change
> > > optimizes the most critical part, which is the syscall overhead.
> >
> > What exactly is the source of the syscall overhead,
> 
> Here are perf traces for two cases: when "guest" syscalls are executed via
> hypercalls and when syscalls are executed by the user-space VMM:
> https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e
> 
> And here are two tests that I use to collect these traces:
> https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337f2e85689b81f99
> 
> If we compare these traces, we can find that in the second case, we spend extra
> time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put,
> syscall_exit_to_user_mode.

So of those, I think the only path a robust implementation can actually avoid,
without significantly whittling down the allowed set of syscalls, is
syscall_exit_to_user_mode().

The bulk of vcpu_put() is vmx_prepare_switch_to_host(), and KVM needs to run
through that before calling out of KVM.  E.g. prctrl(ARCH_GET_GS) will read the
wrong GS.base if MSR_KERNEL_GS_BASE isn't restored.  And that necessitates
calling vmx_prepare_switch_to_guest() when resuming the vCPU.

FPU state, i.e. fpu_swap_kvm_fpstate() is likely a similar story, there's bound
to be a syscall that accesses user FPU state and will do the wrong thing if guest
state is loaded.

For gVisor, that's all presumably a non-issue because it uses a small set of
syscalls (or has guest==host state?), but for a common KVM feature it's problematic.

> > and what alternatives have been explored?  Making arbitrary syscalls from
> > within KVM is mildly terrifying.
> 
> "mildly terrifying" is a good sentence in this case:). If I were in your place,
> I would think about it similarly.
> 
> I understand these concerns about calling syscalls from the KVM code, and this
> is why I hide this feature under a separate capability that can be enabled
> explicitly.
> 
> We can think about restricting the list of system calls that this hypercall can
> execute. In the user-space changes for gVisor, we have a list of system calls
> that are not executed via this hypercall.

Can you provide that list?

> But it has downsides:
> * Each sentry system call trigger the full exit to hr3.
> * Each vmenter/vmexit requires to trigger a signal but it is expensive.

Can you explain this one?  I didn't quite follow what this is referring to.

> * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
>   has to be fully enclosed in a VM to be able to support these technologies.

Speaking of SGX, this reminds me a lot of Graphene, SCONEs, etc..., which IIRC
tackled the "syscalls are crazy expensive" problem by using a message queue and
a dedicated task outside of the enclave to handle syscalls.  Would something like
that work, or is having to burn a pCPU (or more) to handle syscalls in the host a
non-starter?