Re: [PATCH RFC 1/1] KVM: x86: add KVM_HC_UCALL hypercall

From: Liran Alon <liran.alon@oracle.com>
To: Sean Christopherson <sean.j.christopherson@intel.com>,
	Forrest Yuan Yu <yuanyu@google.com>
Cc: kvm@vger.kernel.org
Subject: Re: [PATCH RFC 1/1] KVM: x86: add KVM_HC_UCALL hypercall
Date: Sat, 2 May 2020 04:05:06 +0300	[thread overview]
Message-ID: <49fea649-9376-f8f8-1718-72672926e1bf@oracle.com> (raw)
In-Reply-To: <20200501204552.GD4760@linux.intel.com>

On 01/05/2020 23:45, Sean Christopherson wrote:
> On Fri, May 01, 2020 at 11:51:47AM -0700, Forrest Yuan Yu wrote:
>> The purpose of this new hypercall is to exchange message between
>> guest and hypervisor. For example, a guest may want to ask hypervisor
>> to harden security by setting restricted access permission on guest
>> SLAT entry. In this case, the guest can use this hypercall to send
>>
>> a message to the hypervisor which will do its job and send back
>> anything it wants the guest to know.
> Hrm, so this reintroduces KVM_EXIT_HYPERCALL without justifying _why_ it
> needs to be reintroduced.  I'm not familiar with the history, but the
> comments in the documentation advise "use KVM_EXIT_IO or KVM_EXIT_MMIO".
Both of these options have the disadvantage of requiring instruction 
emulation (Although a trivial one for PIO). Which enlarge host attack 
surface.
I think this is one of the reasons why Hyper-V defined their PV devices 
(E.g. NetVSC/StorVSC) doorbell kick with hypercall instead of PIO/MMIO.
(This is currently not much relevant as KVM's instruction emulator is 
not optional and is not even offloaded to host userspace. But relevant 
for the future...)
>
> Off the top of my head, IO and/or MMIO has a few advantages:
>
>    - Allows the guest kernel to delegate permissions to guest userspace,
>      whereas KVM restrict hypercalls to CPL0.
>    - Allows "pass-through", whereas VMCALL is unconditionally forwarded to
>      L1.
>    - Is vendor agnostic, e.g. VMX and SVM recognized different opcodes for
>      VMCALL vs VMMCALL.
I agree with all the above (I believe similar rational had led VMware to 
design their Backdoor PIO interface).

Also note that recently AWS introduced Nitro Enclave PV device which is 
also de-facto a PV control-plane interface between guest and host userspace.
Why similar approach couldn't have been used here?
(Capability is exposed on a per-VM basis by attaching PV device to VM, 
communication interface is device specific and no KVM changes, only host 
userspace changes).
>   
>> Signed-off-by: Forrest Yuan Yu <yuanyu@google.com>
>> ---
>> diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
>> index 01b081f6e7ea..ff313f6827bf 100644
>> --- a/Documentation/virt/kvm/cpuid.rst
>> +++ b/Documentation/virt/kvm/cpuid.rst
>> @@ -86,6 +86,9 @@ KVM_FEATURE_PV_SCHED_YIELD        13          guest checks this feature bit
>>                                                 before using paravirtualized
>>                                                 sched yield.
>>   
>> +KVM_FEATURE_UCALL                 14          guest checks this feature bit
>> +                                              before calling hypercall ucall.
> Why make the UCALL a KVM CPUID feature?  I can understand wanting to query
> KVM support from host userspace, but presumably the guest will care about
> capabiliteis built on top of the UCALL, not the UCALL itself.
I agree with this.
In case of PV device approach, device detection by guest will be the 
capability discovery.
>
>> +
>>   KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
>>                                                 per-cpu warps are expeced in
>>                                                 kvmclock
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c5835f9cb9ad..388a4f89464d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3385,6 +3385,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   	case KVM_CAP_GET_MSR_FEATURES:
>>   	case KVM_CAP_MSR_PLATFORM_INFO:
>>   	case KVM_CAP_EXCEPTION_PAYLOAD:
>> +	case KVM_CAP_UCALL:
> For whatever reason I have a metnal block with UCALL, can we call this
> KVM_CAP_USERSPACE_HYPERCALL
+1
>
>>   		r = 1;
>>   		break;
>>   	case KVM_CAP_SYNC_REGS:
>> @@ -4895,6 +4896,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   		kvm->arch.exception_payload_enabled = cap->args[0];
>>   		r = 0;
>>   		break;
>> +	case KVM_CAP_UCALL:
>> +		kvm->arch.hypercall_ucall_enabled = cap->args[0];
>> +		r = 0;
>> +		break;
>>   	default:
>>   		r = -EINVAL;
>>   		break;
>> @@ -7554,6 +7559,19 @@ static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id)
>>   		kvm_vcpu_yield_to(target);
>>   }
>>   
>> +static int complete_hypercall(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 ret = vcpu->run->hypercall.ret;
>> +
>> +	if (!is_64_bit_mode(vcpu))
> Do we really anticipate adding support in 32-bit guests?  Honest question.
>
>> +		ret = (u32)ret;
>> +	kvm_rax_write(vcpu, ret);
>> +
>> +	++vcpu->stat.hypercalls;
>> +
>> +	return kvm_skip_emulated_instruction(vcpu);
>> +}
>> +
>>   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>   {
>>   	unsigned long nr, a0, a1, a2, a3, ret;
>> @@ -7605,17 +7623,26 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>   		kvm_sched_yield(vcpu->kvm, a0);
>>   		ret = 0;
>>   		break;
>> +	case KVM_HC_UCALL:
>> +		if (vcpu->kvm->arch.hypercall_ucall_enabled) {
>> +			vcpu->run->hypercall.nr = nr;
>> +			vcpu->run->hypercall.args[0] = a0;
>> +			vcpu->run->hypercall.args[1] = a1;
>> +			vcpu->run->hypercall.args[2] = a2;
>> +			vcpu->run->hypercall.args[3] = a3;
> If performance is a justification for a more direct userspace exit, why
> limit it to just four parameters?  E.g. why not copy all registers to
> kvm_sync_regs and reverse the process on the way back in?
I don't think performance should be relevant for a hypercall interface. 
It's control-plane path.
If a fast-path is required, guest should use this interface to 
coordinate a separate fast-path (e.g. via ring-buffer on some guest 
memory page).

Anyway, these kind of questions is another reason why I agree with Sean 
it seems using a PV device is preferred.
Instead of forcing a general userspace hypercall interface standard, one 
could just implement whatever PV device it wants in host userspace which 
is device specific.

In QEMU's VMPort implementation BTW, userspace calls 
cpu_synchronize_state() which de-facto syncs tons of vCPU state from KVM 
to userspace. Not just the GP registers.
Because it's a slow-path, it's considered fine. :P

-Liran

>
>> +			vcpu->run->exit_reason = KVM_EXIT_HYPERCALL;
>> +			vcpu->arch.complete_userspace_io = complete_hypercall;
>> +			return 0; // message is going to userspace
>> +		}
>> +		ret = -KVM_ENOSYS;
>> +		break;
>>   	default:
>>   		ret = -KVM_ENOSYS;
>>   		break;
>>   	}
>>   out:
>> -	if (!op_64_bit)
>> -		ret = (u32)ret;
>> -	kvm_rax_write(vcpu, ret);
>> -
>> -	++vcpu->stat.hypercalls;
>> -	return kvm_skip_emulated_instruction(vcpu);
>> +	vcpu->run->hypercall.ret = ret;
>> +	return complete_hypercall(vcpu);
>>   }
>>   EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);