Re: [PATCH v2 3/3] kvm: x86: Emulate hypercall instead of fixing hypercall instruction

From: Sean Christopherson <seanjc@google.com>
To: Hou Wenlong <houwenlong93@linux.alibaba.com>
Cc: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
	<x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	"open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" 
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 3/3] kvm: x86: Emulate hypercall instead of fixing hypercall instruction
Date: Thu, 16 Sep 2021 16:00:17 +0000	[thread overview]
Message-ID: <YUNqEeWg32kNwfO8@google.com> (raw)
In-Reply-To: <7893d13e11648be0326834248fcb943088fb0b76.1631188011.git.houwenlong93@linux.alibaba.com>

On Thu, Sep 09, 2021, Hou Wenlong wrote:
> It is guest's resposibility to use right instruction for hypercall,
> hypervisor could emulate wrong instruction instead of modifying
> guest's instruction.
> 
> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
> ---
> @@ -8747,16 +8747,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>  
> -static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
> +static int emulator_hypercall(struct x86_emulate_ctxt *ctxt)
>  {
> +	int ret;
>  	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> -	char instruction[3];
> -	unsigned long rip = kvm_rip_read(vcpu);
> -
> -	static_call(kvm_x86_patch_hypercall)(vcpu, instruction);
>  
> -	return emulator_write_emulated(ctxt, rip, instruction, 3,
> -		&ctxt->exception);
> +	ret = kvm_emulate_hypercall_noskip(vcpu);

I have mixed feelings on calling out of the emulator to do all this work.  One on
hand, I think it's a somewhat silly, arbitrary boundary.  On the other hand, reading
and writing GPRs directly means the emulation context holds stale data, especially
in the case where the hypercall triggers an exit to userspace.

> +	if (ret)
> +		return X86EMUL_CONTINUE;
> +	return X86EMUL_IO_NEEDED;

Unfortunately, simply returning X86EMUL_IO_NEEDED is not sufficient to handle the
KVM_HC_MAP_GPA_RANGE case.  x86_emulate_instruction() doesn't directly act on
X86EMUL_IO_NEEDED, because x86_emulate_insn() doesn't return X86EMUL_*, it returns
EMULATION_FAILED, EMULATION_OK, etc...  x86_emulate_instruction() instead looks
for other signals to differentiate between exception injection, PIO, MMIO, etc...

The IO_NEEDED path would also need to provide an alternative complete_userspace_io
callback, otherwise complete_hypercall_exit() will attempt to skip the instruction
using e.g. vmcs.VM_EXIT_INSTRUCTION_LEN and likely send the guest into the weeds.

In the prior patch, having kvm_emulate_hypercall_noskip() skip the CPL check is
definitely wrong, and skipping Xen/Hyper-V hypercall support is odd, though arguably
correct since Xen/Hyper-V hypercalls should never hit this path.

All of the above are solvable problems, but there is a non-trivial cost in doing
so, especially looking ahead to TDX support, which also needs/wants to split
kvm_emulate_hypercall() but in slightly different ways[*].

I 100% agree that patching the hypercall instruction is awful.  There are myriad
fatal issues with the current approach:

  1. Patches using an emulated guest write, which will fail if RIP is not mapped
     writable, and even injects a #PF into the guest on failure.

  2. Doesn't ensure the write is "atomic", e.g. a hypercall that splits a page
     boundary will be handled as two separate writes, which means that a partial,
     corrupted instruction can be observed by a separate vCPU.

  3. Doesn't serialize other CPU cores after updating the code stream.

  4. Completely fails to account for the case where KVM is emulating due to invalid
     guest state with unrestricted_guest=0.  Patching and retrying the instruction
     will result in vCPU getting stuck in an infinite loop.

But, the "support" _so_ awful, especially #1, that there's practically zero chance
that a modern guest kernel can rely on KVM to patch the guest.  E.g. patching fails
on modern Linux due to kernel code being mapped NX (barring admin override).  This
was addressed in the Linux guest back in 2014 by commit c1118b3602c2 ("x86: kvm: use
alternatives for VMCALL vs. VMMCALL if kernel text is read-only").

In other words, odds are very good that only old Linux guest kernels rely on KVM's
patching.  Because of that, my preference is to keep the patching, do our best to
make it suck a little less, and aim to completely remove the patching entirely at
some point in the future.

For #1, inject a #UD instead of #PF if the write fails.  The guest will still likely
die, but the failure signature is much friendlier to debuggers.

For #2 and #3, do nothing as fixing those is non-trivial.

For #4, inject a #UD if the hypercall instruction is already the "right" instruction.
I.e. retroactively define KVM's ABI to be that KVM hypercalls have undefined behavior
in Big Real Mode and other modes that trigger invalid guest state (since the hypercalls
will still work if unrestricted_guest=1).  This can't be an ABI breakage since it does
not work today and can't possibly have ever worked in the past.

I have mostly-complete to do the above (I ran afoul of the NX thing), I'll hopefully get
them out next week.

[*] https://lkml.kernel.org/r/9e1e66787c50232391e20cb4b3d1c5b249e3f910.1625186503.git.isaku.yamahata@intel.com