From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alex Williamson Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) Date: Wed, 27 Jan 2016 09:10:16 -0700 Message-ID: <1453911016.6261.3.camel@redhat.com> References: <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com> <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com> <20160126102003.GA14400@nvidia.com> <1453838773.15515.1.camel@redhat.com> <20160126222830.GB21927@nvidia.com> <1453851038.18049.9.camel@redhat.com> <20160127091432.GA10936@nvidia.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Cc: "Ruan, Shuai" , "Tian, Kevin" , "kvm@vger.kernel.org" , "igvt-g@lists.01.org" , "Song, Jike" , qemu-devel , Kirti Wankhede , "Lv, Zhiyuan" , Paolo Bonzini , Gerd Hoffmann To: Neo Jia Return-path: In-Reply-To: <20160127091432.GA10936@nvidia.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: kvm.vger.kernel.org On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote: > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > > 1.1 Under per-physical device sysfs: > > > > > ---------------------------------------------------------------= ------------------- > > > > > =C2=A0 > > > > > vgpu_supported_types - RO, list the current supported virtual G= PU types and its > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads o= f > > > > > "vgpu_supported_types". > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > vgpu_create - WO, input syntax , create a = virtual > > > > > gpu device on a target physical GPU. idx: virtual device index = inside a VM > > > > > =C2=A0 > > > > > vgpu_destroy - WO, input syntax , destroy a virtua= l gpu device on a > > > > > target physical GPU > > > > =C2=A0 > > > > =C2=A0 > > > > I've noted in previous discussions that we need to separate user = policy > > > > from kernel policy here, the kernel policy should not require a "= VM > > > > UUID".=C2=A0=C2=A0A UUID simply represents a set of one or more d= evices and an > > > > index picks the device within the set.=C2=A0=C2=A0Whether that UU= ID matches a VM > > > > or is independently used is up to the user policy when creating t= he > > > > device. > > > > =C2=A0 > > > > Personally I'd also prefer to get rid of the concept of indexes w= ithin a > > > > UUID set of devices and instead have each device be independent.=C2= =A0=C2=A0This > > > > seems to be an imposition on the nvidia implementation into the k= ernel > > > > interface design. > > > > =C2=A0 > > > =C2=A0 > > > Hi Alex, > > > =C2=A0 > > > I agree with you that we should not put UUID concept into a kernel = API. At > > > this point (without any prototyping), I am thinking of using a list= of virtual > > > devices instead of UUID. > >=C2=A0 > > Hi Neo, > >=C2=A0 > > A UUID is a perfectly fine name, so long as we let it be just a UUID = and > > not the UUID matching some specific use case. > >=C2=A0 > > > > > =C2=A0 > > > > > int vgpu_map_virtual_bar > > > > > ( > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint64_t virt_bar_addr, > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint64_t phys_bar_addr, > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint32_t len, > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint32_t flags > > > > > ) > > > > > =C2=A0 > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > =C2=A0 > > > > =C2=A0 > > > > Per the implementation provided, this needs to be implemented in = the > > > > vfio device driver, not in the iommu interface.=C2=A0=C2=A0Findin= g the DMA mapping > > > > of the device and replacing it is wrong.=C2=A0=C2=A0It should be = remapped at the > > > > vfio device file interface using vm_ops. > > > > =C2=A0 > > > =C2=A0 > > > So you are basically suggesting that we are going to take a mmap fa= ult and > > > within that fault handler, we will go into vendor driver to look up= the > > > "pre-registered" mapping and remap there. > > > =C2=A0 > > > Is my understanding correct? > >=C2=A0 > > Essentially, hopefully the vendor driver will have already registered > > the backing for the mmap prior to the fault, but either way could wor= k. > > I think the key though is that you want to remap it onto the vma > > accessing the vfio device file, not scanning it out of an IOVA mappin= g > > that might be dynamic and doing a vma lookup based on the point in ti= me > > mapping of the BAR.=C2=A0=C2=A0The latter doesn't give me much confid= ence that > > mappings couldn't change while the former should be a one time fault. >=C2=A0 > Hi Alex, >=C2=A0 > The fact is that the vendor driver can only prevent such mmap fault by = looking > up the mapping table that we have saved from IOMMU memory l= isterner Why do we need to prevent the fault?=C2=A0=C2=A0We need to handle the fau= lt when it occurs. > when the guest region gets programmed. Also, like you have mentioned be= low, such > mapping between iova and hva shouldn't be changed as long as the SBIOS = and > guest OS are done with their job.=C2=A0 But you don't know they're done with their job. > Yes, you are right it is one time fault, but the gpu work is heavily pi= pelined.=C2=A0 Why does that matter?=C2=A0=C2=A0We're talking about the first time the V= M accesses the range of the BAR that will be direct mapped to the physical GPU.=C2=A0=C2=A0This isn't going to happen in the middle of a benchmark, = it's going to happen during driver initialization in the guest. > Probably we should just limit this interface to guest MMIO region and w= e can have > some crosscheck between the VFIO driver who has monitored the config sp= cae > access to make sure nothing getting moved around? No, the solution for the bar is very clear, map on fault to the vma accessing the mmap and be done with it for the remainder of this instance of the VM. > > In case it's not clear to folks at Intel, the purpose of this is that= a > > vGPU may directly map a segment of the physical GPU MMIO space, but w= e > > may not know what segment that is at setup time, when QEMU does an mm= ap > > of the vfio device file descriptor.=C2=A0=C2=A0The thought is that we= can create > > an invalid mapping when QEMU calls mmap(), knowing that it won't be > > accessed until later, then we can fault in the real mmap on demand.=C2= =A0=C2=A0Do > > you need anything similar? > >=C2=A0 > > > > =C2=A0 > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t coun= t) > > > > > =C2=A0 > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > =C2=A0 > > > > > Still a lot to be added and modified, such as supporting multip= le VMs and=C2=A0 > > > > > multiple virtual devices, tracking the mapped / pinned region w= ithin VGPU IOMMU=C2=A0 > > > > > kernel driver, error handling, roll-back and locked memory size= per user, etc.=C2=A0 > > > > =C2=A0 > > > > Particularly, handling of mapping changes is completely missing.=C2= =A0=C2=A0This > > > > cannot be a point in time translation, the user is free to remap > > > > addresses whenever they wish and device translations need to be u= pdated > > > > accordingly. > > > > =C2=A0 > > > =C2=A0 > > > When you say "user", do you mean the QEMU? > >=C2=A0 > > vfio is a generic userspace driver interface, QEMU is a very, very > > important user of the interface, but not the only user.=C2=A0=C2=A0So= for this > > conversation, we're mostly talking about QEMU as the user, but we sho= uld > > be careful about assuming QEMU is the only user. > >=C2=A0 >=C2=A0 > Understood. I have to say that our focus at this moment is to support Q= EMU and > KVM, but I know VFIO interface is much more than that, and that is why = I think > it is right to leverage this framework so we can together explore futur= e use > case in the userland. >=C2=A0 >=C2=A0 > > > Here, whenever the DMA that > > > the guest driver is going to launch will be first pinned within VM,= and then > > > registered to QEMU, therefore the IOMMU memory listener, eventually= the pages > > > will be pinned by the GPU or DMA engine. > > > =C2=A0 > > > Since we are keeping the upper level code same, thinking about pass= thru case, > > > where the GPU has already put the real IOVA into his PTEs, I don't = know how QEMU > > > can change that mapping without causing an IOMMU fault on a active = DMA device. > >=C2=A0 > > For the virtual BAR mapping above, it's easy to imagine that mapping = a > > BAR to a given address is at the guest discretion, it may be mapped a= nd > > unmapped, it may be mapped to different addresses at different points= in > > time, the guest BIOS may choose to map it at yet another address, etc= . > > So if somehow we were trying to setup a mapping for peer-to-peer, the= re > > are lots of ways that IOVA could change.=C2=A0=C2=A0But even with RAM= , we can > > support memory hotplug in a VM.=C2=A0=C2=A0What was once a DMA target= may be > > removed or may now be backed by something else.=C2=A0=C2=A0Chipset co= nfiguration > > on the emulated platform may change how guest physical memory appears > > and that might change between VM boots. > >=C2=A0 > > Currently with physical device assignment the memory listener watches > > for both maps and unmaps and updates the iotlb to match.=C2=A0=C2=A0J= ust like real > > hardware doing these same sorts of things, we rely on the guest to st= op > > using memory that's going to be moved as a DMA target prior to moving > > it. >=C2=A0 > Right,=C2=A0=C2=A0you can only do that when the device is quiescent. >=C2=A0 > As long as this will be notified to the guest, I think we should be abl= e to > support it although the real implementation will depend on how the devi= ce gets into=C2=A0 > quiescent state. >=C2=A0 > This is definitely a very interesting feature we should explore, but I = hope we > probably can first focus on the most basic functionality. If we only do a point-in-time translation and assume it never changes, that's good enough for a proof of concept, but it's not a complete solution.=C2=A0=C2=A0I think this is=C2=A0=C2=A0practical problem, not ju= st an academic problem.=C2=A0=C2=A0There needs to be a mechanism mappings to be invalida= ted based on VM memory changes.=C2=A0=C2=A0Thanks, Alex From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47972) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOSfm-0003Uf-MP for qemu-devel@nongnu.org; Wed, 27 Jan 2016 11:10:24 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aOSfi-0001ds-It for qemu-devel@nongnu.org; Wed, 27 Jan 2016 11:10:22 -0500 Received: from mx1.redhat.com ([209.132.183.28]:43198) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOSfi-0001dH-8R for qemu-devel@nongnu.org; Wed, 27 Jan 2016 11:10:18 -0500 Message-ID: <1453911016.6261.3.camel@redhat.com> From: Alex Williamson Date: Wed, 27 Jan 2016 09:10:16 -0700 In-Reply-To: <20160127091432.GA10936@nvidia.com> References: <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com> <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com> <20160126102003.GA14400@nvidia.com> <1453838773.15515.1.camel@redhat.com> <20160126222830.GB21927@nvidia.com> <1453851038.18049.9.camel@redhat.com> <20160127091432.GA10936@nvidia.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Neo Jia Cc: "Ruan, Shuai" , "Tian, Kevin" , "kvm@vger.kernel.org" , "igvt-g@lists.01.org" , "Song, Jike" , qemu-devel , Kirti Wankhede , "Lv, Zhiyuan" , Paolo Bonzini , Gerd Hoffmann On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote: > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > > 1.1 Under per-physical device sysfs: > > > > > ---------------------------------------------------------------= ------------------- > > > > > =C2=A0 > > > > > vgpu_supported_types - RO, list the current supported virtual G= PU types and its > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads o= f > > > > > "vgpu_supported_types". > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > > > > vgpu_create - WO, input syntax , create a = virtual > > > > > gpu device on a target physical GPU. idx: virtual device index = inside a VM > > > > > =C2=A0 > > > > > vgpu_destroy - WO, input syntax , destroy a virtua= l gpu device on a > > > > > target physical GPU > > > > =C2=A0 > > > > =C2=A0 > > > > I've noted in previous discussions that we need to separate user = policy > > > > from kernel policy here, the kernel policy should not require a "= VM > > > > UUID".=C2=A0=C2=A0A UUID simply represents a set of one or more d= evices and an > > > > index picks the device within the set.=C2=A0=C2=A0Whether that UU= ID matches a VM > > > > or is independently used is up to the user policy when creating t= he > > > > device. > > > > =C2=A0 > > > > Personally I'd also prefer to get rid of the concept of indexes w= ithin a > > > > UUID set of devices and instead have each device be independent.=C2= =A0=C2=A0This > > > > seems to be an imposition on the nvidia implementation into the k= ernel > > > > interface design. > > > > =C2=A0 > > > =C2=A0 > > > Hi Alex, > > > =C2=A0 > > > I agree with you that we should not put UUID concept into a kernel = API. At > > > this point (without any prototyping), I am thinking of using a list= of virtual > > > devices instead of UUID. > >=C2=A0 > > Hi Neo, > >=C2=A0 > > A UUID is a perfectly fine name, so long as we let it be just a UUID = and > > not the UUID matching some specific use case. > >=C2=A0 > > > > > =C2=A0 > > > > > int vgpu_map_virtual_bar > > > > > ( > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint64_t virt_bar_addr, > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint64_t phys_bar_addr, > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint32_t len, > > > > > =C2=A0=C2=A0=C2=A0=C2=A0uint32_t flags > > > > > ) > > > > > =C2=A0 > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > =C2=A0 > > > > =C2=A0 > > > > Per the implementation provided, this needs to be implemented in = the > > > > vfio device driver, not in the iommu interface.=C2=A0=C2=A0Findin= g the DMA mapping > > > > of the device and replacing it is wrong.=C2=A0=C2=A0It should be = remapped at the > > > > vfio device file interface using vm_ops. > > > > =C2=A0 > > > =C2=A0 > > > So you are basically suggesting that we are going to take a mmap fa= ult and > > > within that fault handler, we will go into vendor driver to look up= the > > > "pre-registered" mapping and remap there. > > > =C2=A0 > > > Is my understanding correct? > >=C2=A0 > > Essentially, hopefully the vendor driver will have already registered > > the backing for the mmap prior to the fault, but either way could wor= k. > > I think the key though is that you want to remap it onto the vma > > accessing the vfio device file, not scanning it out of an IOVA mappin= g > > that might be dynamic and doing a vma lookup based on the point in ti= me > > mapping of the BAR.=C2=A0=C2=A0The latter doesn't give me much confid= ence that > > mappings couldn't change while the former should be a one time fault. >=C2=A0 > Hi Alex, >=C2=A0 > The fact is that the vendor driver can only prevent such mmap fault by = looking > up the mapping table that we have saved from IOMMU memory l= isterner Why do we need to prevent the fault?=C2=A0=C2=A0We need to handle the fau= lt when it occurs. > when the guest region gets programmed. Also, like you have mentioned be= low, such > mapping between iova and hva shouldn't be changed as long as the SBIOS = and > guest OS are done with their job.=C2=A0 But you don't know they're done with their job. > Yes, you are right it is one time fault, but the gpu work is heavily pi= pelined.=C2=A0 Why does that matter?=C2=A0=C2=A0We're talking about the first time the V= M accesses the range of the BAR that will be direct mapped to the physical GPU.=C2=A0=C2=A0This isn't going to happen in the middle of a benchmark, = it's going to happen during driver initialization in the guest. > Probably we should just limit this interface to guest MMIO region and w= e can have > some crosscheck between the VFIO driver who has monitored the config sp= cae > access to make sure nothing getting moved around? No, the solution for the bar is very clear, map on fault to the vma accessing the mmap and be done with it for the remainder of this instance of the VM. > > In case it's not clear to folks at Intel, the purpose of this is that= a > > vGPU may directly map a segment of the physical GPU MMIO space, but w= e > > may not know what segment that is at setup time, when QEMU does an mm= ap > > of the vfio device file descriptor.=C2=A0=C2=A0The thought is that we= can create > > an invalid mapping when QEMU calls mmap(), knowing that it won't be > > accessed until later, then we can fault in the real mmap on demand.=C2= =A0=C2=A0Do > > you need anything similar? > >=C2=A0 > > > > =C2=A0 > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t coun= t) > > > > > =C2=A0 > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > =C2=A0 > > > > > Still a lot to be added and modified, such as supporting multip= le VMs and=C2=A0 > > > > > multiple virtual devices, tracking the mapped / pinned region w= ithin VGPU IOMMU=C2=A0 > > > > > kernel driver, error handling, roll-back and locked memory size= per user, etc.=C2=A0 > > > > =C2=A0 > > > > Particularly, handling of mapping changes is completely missing.=C2= =A0=C2=A0This > > > > cannot be a point in time translation, the user is free to remap > > > > addresses whenever they wish and device translations need to be u= pdated > > > > accordingly. > > > > =C2=A0 > > > =C2=A0 > > > When you say "user", do you mean the QEMU? > >=C2=A0 > > vfio is a generic userspace driver interface, QEMU is a very, very > > important user of the interface, but not the only user.=C2=A0=C2=A0So= for this > > conversation, we're mostly talking about QEMU as the user, but we sho= uld > > be careful about assuming QEMU is the only user. > >=C2=A0 >=C2=A0 > Understood. I have to say that our focus at this moment is to support Q= EMU and > KVM, but I know VFIO interface is much more than that, and that is why = I think > it is right to leverage this framework so we can together explore futur= e use > case in the userland. >=C2=A0 >=C2=A0 > > > Here, whenever the DMA that > > > the guest driver is going to launch will be first pinned within VM,= and then > > > registered to QEMU, therefore the IOMMU memory listener, eventually= the pages > > > will be pinned by the GPU or DMA engine. > > > =C2=A0 > > > Since we are keeping the upper level code same, thinking about pass= thru case, > > > where the GPU has already put the real IOVA into his PTEs, I don't = know how QEMU > > > can change that mapping without causing an IOMMU fault on a active = DMA device. > >=C2=A0 > > For the virtual BAR mapping above, it's easy to imagine that mapping = a > > BAR to a given address is at the guest discretion, it may be mapped a= nd > > unmapped, it may be mapped to different addresses at different points= in > > time, the guest BIOS may choose to map it at yet another address, etc= . > > So if somehow we were trying to setup a mapping for peer-to-peer, the= re > > are lots of ways that IOVA could change.=C2=A0=C2=A0But even with RAM= , we can > > support memory hotplug in a VM.=C2=A0=C2=A0What was once a DMA target= may be > > removed or may now be backed by something else.=C2=A0=C2=A0Chipset co= nfiguration > > on the emulated platform may change how guest physical memory appears > > and that might change between VM boots. > >=C2=A0 > > Currently with physical device assignment the memory listener watches > > for both maps and unmaps and updates the iotlb to match.=C2=A0=C2=A0J= ust like real > > hardware doing these same sorts of things, we rely on the guest to st= op > > using memory that's going to be moved as a DMA target prior to moving > > it. >=C2=A0 > Right,=C2=A0=C2=A0you can only do that when the device is quiescent. >=C2=A0 > As long as this will be notified to the guest, I think we should be abl= e to > support it although the real implementation will depend on how the devi= ce gets into=C2=A0 > quiescent state. >=C2=A0 > This is definitely a very interesting feature we should explore, but I = hope we > probably can first focus on the most basic functionality. If we only do a point-in-time translation and assume it never changes, that's good enough for a proof of concept, but it's not a complete solution.=C2=A0=C2=A0I think this is=C2=A0=C2=A0practical problem, not ju= st an academic problem.=C2=A0=C2=A0There needs to be a mechanism mappings to be invalida= ted based on VM memory changes.=C2=A0=C2=A0Thanks, Alex