From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jike Song <jike.song@intel.com>
Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT
 - a Mediated ...)
Date: Wed, 27 Jan 2016 13:43:47 +0800
Message-ID: <56A85913.1020506@intel.com>
References: <569C5071.6080004@intel.com>   <1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com>   <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com>   <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>   <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>   <56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>   <1453826249.26652.54.camel@redhat.com>   <AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>   <1453844613.18049.1.camel@redhat.com>   <AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>   <1453846073.18049.3.camel@redhat.com>   <AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>   <1453847250.18049.5.camel@redhat.com>   <AADFC41AFE54684AB9EE6CBC0274A5D
 15F78ED63@SHSMSX101.ccr.corp.intel.com>  <1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com> <1453864068.3107.3.camel@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	"Ruan, Shuai" <shuai.ruan@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	Neo Jia <cjia@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mga01.intel.com ([192.55.52.88]:32603 "EHLO mga01.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751162AbcA0Fnh (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 27 Jan 2016 00:43:37 -0500
In-Reply-To: <1453864068.3107.3.camel@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 01/27/2016 11:07 AM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
>> On 01/27/2016 06:56 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>>>  
>>>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>>>  
>>>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>>>  
>>>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>>>> as possible. After another thought, yes we can reuse existing read/write
>>>>>> flags:
>>>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>>>> delivery is required;
>>>>>  
>>>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>>>> deliver in-kernel any time it possibly could?
>>>>>  
>>>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>>>> directly;
>>>>>>  	- when the variable is false, KVMGT will not register anything.
>>>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>>>> will be used to finally reach KVMGT emulation logic;
>>>>>  
>>>>> No, that means the interface is entirely dependent on a backdoor through
>>>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>>>> region with KVM handled via a provided file descriptor and offset,
>>>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>>>  
>>>>  
>>>> Could you elaborate this thought? If it can achieve the purpose w/o
>>>> a kernel exit definitely we can adapt to it. :-)
>>>  
>>> I only thought of it when replying to the last email and have been doing
>>> some research, but we already do quite a bit of synchronization through
>>> file descriptors.  The kvm-vfio pseudo device uses a group file
>>> descriptor to ensure a user has access to a group, allowing some degree
>>> of interaction between modules.  Eventfds and irqfds already make use of
>>> f_ops on file descriptors to poke data.  So, if KVM had information that
>>> an MMIO region was backed by a file descriptor for which it already has
>>> a reference via fdget() (and verified access rights and whatnot), then
>>> it ought to be a simple matter to get to f_ops->read/write knowing the
>>> base offset of that MMIO region.  Perhaps it could even simply use
>>> __vfs_read/write().  Then we've got a proper reference to the file
>>> descriptor for ownership purposes and we've transparently jumped across
>>> modules without any implicit knowledge of the other end.  Could it work?
>>  
>> This is OK for KVMGT, from fops to vgpu device-model would always be simple.
>> The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?
> 
> Hi Jike,
> 
> Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
> to the fd via fdget(), so the vfio device wouldn't be closed until the
> VM exits and KVM releases that reference.
> 

Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervisor.

>> copy-and-paste the current implementation of vcpu_mmio_write(), seems
>> nothing but GPA and len are provided:
> 
> I presume that an MMIO region is already registered with a GPA and
> length, the additional information necessary would be a file descriptor
> and offset into the file descriptor for the base of the MMIO space.
> 
>>  	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
>>  				   const void *v)
>>  	{
>>  		int handled = 0;
>>  		int n;
>>  
>>  		do {
>>  			n = min(len, 8);
>>  			if (!(vcpu->arch.apic &&
>>  			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
>>  			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
>>  				break;
>>  			handled += n;
>>  			addr += n;
>>  			len -= n;
>>  			v += n;
>>  		} while (len);
>>  
>>  		return handled;
>>  	}
>>  
>> If we back a GPA range with a fd, this will also be a 'backdoor'?
> 
> KVM would simply be able to service the MMIO access using the provided
> fd and offset.  It's not a back door because we will have created an API
> for KVM to have a file descriptor and offset registered (by userspace)
> to handle the access.  Also, KVM does not know the file descriptor is
> handled by a VFIO device and VFIO doesn't know the read/write accesses
> is initiated by KVM.  Seems like the question is whether we can fit
> something like that into the existing KVM MMIO bus/device handlers
> in-kernel.  Thanks,
> 

Had a look at eventfd, I would say yes, technically we are able to
achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
call into vgpu device-model, also an iodev registered for a MMIO GPA
range to invoke the fop->{read|write}.  I just didn't understand why
userspace can't register an iodev via API directly.

Besides, this doesn't necessarily require another thread, right?
I guess it can be within the VCPU thread? 

And this brought another question: except the vfio bus drvier and
iommu backend (and the page_track ulitiy used for guest memory write-protection), 
is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
becoming less and less willing to do that with VFIO, it's still better
to know that before going wrong.

Thanks!


> Alex
>

--
Thanks,
Jike

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58695)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jike.song@intel.com>) id 1aOItJ-000828-3a
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 00:43:42 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jike.song@intel.com>) id 1aOItF-0007Ef-Oi
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 00:43:41 -0500
Received: from mga09.intel.com ([134.134.136.24]:38100)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jike.song@intel.com>) id 1aOItF-0007EJ-DT
	for qemu-devel@nongnu.org; Wed, 27 Jan 2016 00:43:37 -0500
Message-ID: <56A85913.1020506@intel.com>
Date: Wed, 27 Jan 2016 13:43:47 +0800
From: Jike Song <jike.song@intel.com>
MIME-Version: 1.0
References: <569C5071.6080004@intel.com> <1453092476.32741.67.camel@redhat.com>
	<569CA8AD.6070200@intel.com>
	<1453143919.32741.169.camel@redhat.com>
	<569F4C86.2070501@intel.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	<1453826249.26652.54.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	<1453844613.18049.1.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	<1453846073.18049.3.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	<1453847250.18049.5.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
	<1453848975.18049.7.camel@redhat.com> <56A821AD.5090606@intel.com>
	<1453864068.3107.3.camel@redhat.com>
In-Reply-To: <1453864068.3107.3.camel@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3
 release of XenGT - a Mediated ...)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>, "Ruan, Shuai" <shuai.ruan@intel.com>, "Tian, Kevin" <kevin.tian@intel.com>, Neo Jia <cjia@nvidia.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>, qemu-devel <qemu-devel@nongnu.org>, Gerd Hoffmann <kraxel@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "Lv, Zhiyuan" <zhiyuan.lv@intel.com>

On 01/27/2016 11:07 AM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
>> On 01/27/2016 06:56 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>>>  
>>>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>>>  
>>>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>>>  
>>>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>>>> as possible. After another thought, yes we can reuse existing read/write
>>>>>> flags:
>>>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>>>> delivery is required;
>>>>>  
>>>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>>>> deliver in-kernel any time it possibly could?
>>>>>  
>>>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>>>> directly;
>>>>>>  	- when the variable is false, KVMGT will not register anything.
>>>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>>>> will be used to finally reach KVMGT emulation logic;
>>>>>  
>>>>> No, that means the interface is entirely dependent on a backdoor through
>>>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>>>> region with KVM handled via a provided file descriptor and offset,
>>>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>>>  
>>>>  
>>>> Could you elaborate this thought? If it can achieve the purpose w/o
>>>> a kernel exit definitely we can adapt to it. :-)
>>>  
>>> I only thought of it when replying to the last email and have been doing
>>> some research, but we already do quite a bit of synchronization through
>>> file descriptors.  The kvm-vfio pseudo device uses a group file
>>> descriptor to ensure a user has access to a group, allowing some degree
>>> of interaction between modules.  Eventfds and irqfds already make use of
>>> f_ops on file descriptors to poke data.  So, if KVM had information that
>>> an MMIO region was backed by a file descriptor for which it already has
>>> a reference via fdget() (and verified access rights and whatnot), then
>>> it ought to be a simple matter to get to f_ops->read/write knowing the
>>> base offset of that MMIO region.  Perhaps it could even simply use
>>> __vfs_read/write().  Then we've got a proper reference to the file
>>> descriptor for ownership purposes and we've transparently jumped across
>>> modules without any implicit knowledge of the other end.  Could it work?
>>  
>> This is OK for KVMGT, from fops to vgpu device-model would always be simple.
>> The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?
> 
> Hi Jike,
> 
> Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
> to the fd via fdget(), so the vfio device wouldn't be closed until the
> VM exits and KVM releases that reference.
> 

Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervisor.

>> copy-and-paste the current implementation of vcpu_mmio_write(), seems
>> nothing but GPA and len are provided:
> 
> I presume that an MMIO region is already registered with a GPA and
> length, the additional information necessary would be a file descriptor
> and offset into the file descriptor for the base of the MMIO space.
> 
>>  	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
>>  				   const void *v)
>>  	{
>>  		int handled = 0;
>>  		int n;
>>  
>>  		do {
>>  			n = min(len, 8);
>>  			if (!(vcpu->arch.apic &&
>>  			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
>>  			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
>>  				break;
>>  			handled += n;
>>  			addr += n;
>>  			len -= n;
>>  			v += n;
>>  		} while (len);
>>  
>>  		return handled;
>>  	}
>>  
>> If we back a GPA range with a fd, this will also be a 'backdoor'?
> 
> KVM would simply be able to service the MMIO access using the provided
> fd and offset.  It's not a back door because we will have created an API
> for KVM to have a file descriptor and offset registered (by userspace)
> to handle the access.  Also, KVM does not know the file descriptor is
> handled by a VFIO device and VFIO doesn't know the read/write accesses
> is initiated by KVM.  Seems like the question is whether we can fit
> something like that into the existing KVM MMIO bus/device handlers
> in-kernel.  Thanks,
> 

Had a look at eventfd, I would say yes, technically we are able to
achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
call into vgpu device-model, also an iodev registered for a MMIO GPA
range to invoke the fop->{read|write}.  I just didn't understand why
userspace can't register an iodev via API directly.

Besides, this doesn't necessarily require another thread, right?
I guess it can be within the VCPU thread? 

And this brought another question: except the vfio bus drvier and
iommu backend (and the page_track ulitiy used for guest memory write-protection), 
is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
becoming less and less willing to do that with VFIO, it's still better
to know that before going wrong.

Thanks!


> Alex
>

--
Thanks,
Jike