Re: [iGVT-g] <summary> RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

From: Neo Jia <cjia@nvidia.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Kirti Wankhede <kwankhede@nvidia.com>
Subject: Re: [iGVT-g] <summary> RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date: Wed, 3 Feb 2016 19:52:22 -0800	[thread overview]
Message-ID: <20160204035222.GA7092@nvidia.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F79D9B9@SHSMSX103.ccr.corp.intel.com>

On Thu, Feb 04, 2016 at 03:01:36AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, February 04, 2016 4:45 AM
> > >
> > > First, Jike told me before his vacation, that we cannot do any change to
> > > KVM module according to community comments. Now I think it's not true.
> > > We can do necessary changes, as long as it is done in a structural/layered
> > > approach, w/o hard assumption on KVMGT as the only user. That's the
> > > guideline we need to obey. :-)
> > 
> > We certainly need to separate the functionality that you're trying to
> > enable from the more pure concept of vfio.  vfio is a userspace driver
> > interfaces, not a userspace driver interface for KVM-based virtual
> > machines.  Maybe it's more of a gimmick that we can assign PCI devices
> > to QEMU tcg VMs, but that's really just the proof of concept for more
> > useful capabilities, like supporting DPDK applications.  So, I
> > begrudgingly agree that structured/layered interactions are acceptable,
> > but consider what use cases may be excluded by doing so.
> 
> Understand. We shouldn't assume VFIO always connected to KVM. For 
> example, once we have vfio-vgpu ready, it can be used to drive container
> usage too, not exactly always connecting with KVM/Qemu. Actually thinking
> more from this angle there is a new open which I'll describe in the end...
> 
> > >
> > > 4) Map/unmap guest memory
> > > --
> > > It's there for KVM.
> > 
> > Map and unmap for who?  For the vGPU or for the VM?  It seems like we
> > know how to map guest memory for the vGPU without KVM, but that's
> > covered in 7), so I'm not entirely sure what this is specifying.
> 
> Map guest memory for emulation purpose in vGPU device model, e.g. to r/w
> guest GPU page table, command buffer, etc. It's the basic requirement as
> we see in any device model.
> 
> 7) provides the database (both GPA->IOVA and GPA->HPA), where GPA->HPA
> can be used to implement this interface for KVM. However for Xen it's
> different, as special foreign domain mapping hypercall is involved which is
> Xen specific so not appropriate to be in VFIO. 
> 
> That's why we list this interface separately as a key requirement (though
> it's obvious for KVM)

Hi Kevin,

It seems you are trying to map the guest physical memory into your kernel driver
on the host, right? 

If yes, I think we have already have the required information to achieve that.

The type1 IOMMU VGPU interface has provided <QEMU_VA, iova, qemu_mm>, which is
enough for us to do any lookup.

> 
> > 
> > > 5) Pin/unpin guest memory
> > > --
> > > IGD or any PCI passthru should have same requirement. So we should be
> > > able to leverage existing code in VFIO. The only tricky thing (Jike may
> > > elaborate after he is back), is that KVMGT requires to pin EPT entry too,
> > > which requires some further change in KVM side. But I'm not sure whether
> > > it still holds true after some design changes made in this thread. So I'll
> > > leave to Jike to further comment.
> > 
> > PCI assignment requires pinning all of guest memory, I would think that
> > IGD would only need to pin selective memory, so is this simply stating
> > that both have the need to pin memory, not that they'll do it to the
> > same extent?
> 
> For simplicity let's first pin all memory, while taking selective pinning as a
> future enhancement.
> 
> The tricky thing is that existing 'pin' action in VFIO doesn't actually pin
> EPT entry too (only pin host page tables for Qemu process). There are 
> various places where EPT entries might be invalidated when guest is 
> running, while KVMGT requires EPT entries to be pinned too. Let's wait 
> for Jike to elaborate whether this part is still required today.

Sorry, don't quite follow the logic here. The current VFIO TYPE1 IOMMU (including API
and underlying IOMMU implementation) will pin the guest physical memory and
install those pages to the proper device domain. Yes, it is only for the QEMU
process as that is what the VM is running at. 

Do I miss something here?

> 
> > 
> > > 6) Write-protect a guest memory page
> > > --
> > > The primary purpose is for GPU page table shadowing. We need to track
> > > modifications on guest GPU page table, so shadow part can be synchronized
> > > accordingly. Just think about CPU page table shadowing. And old example
> > > as Zhiyuan pointed out, is to write-protect guest cmd buffer. But it becomes
> > > not necessary now.
> > >
> > > So we need KVM to provide an interface so some agents can request such
> > > write-protection action (not just for KVMGT. could be for other tracking
> > > usages). Guangrong has been working on a general page tracking mechanism,
> > > upon which write-protection can be easily built on. The review is still in
> > > progress.
> > 
> > I have a hard time believing we don't have the mechanics to do this
> > outside of KVM.  We should be able to write protect user pages from the
> > kernel, this is how copy-on-write generally works.  So it seems like we
> > should be able to apply those same mechanics to our userspace process,
> > which just happens to be a KVM VM.  I'm hoping that Paolo might have
> > some ideas how to make this work or maybe Intel has some virtual memory
> > experts that can point us in the right direction.
> 
> What we want to write-protect, is when the access happens inside VM.
> I don't know why any tricks in host page table can help here. The only
> way is to tweak page tables used in non-root mode (either EPT or
> shadow page table).
> 
> > 
> > Thanks for your summary, Kevin.  It does seem like there are only a few
> > outstanding issues which should be manageable and hopefully the overall
> > approach is cleaner for QEMU, management tools, and provides a more
> > consistent user interface as well.  If we can translate the solution to
> > Xen, that's even better.  Thanks,
> > 
> 
> Here is the main open in my head, after thinking about the role of VFIO:
> 
> For above 7 services required by vGPU device model, they can fall into
> two categories:
> 
> a) services to connect vGPU with VM, which are essentially what a device
> driver is doing (so VFIO can fit here), including:
> 	1) Selectively pass-through a region to a VM
> 	2) Trap-and-emulate a region
> 	3) Inject a virtual interrupt
> 	5) Pin/unpin guest memory
> 	7) GPA->IOVA/HVA translation (as a side-effect)
> 
> b) services to support device emulation, which gonna be hypervisor 
> specific, including:
> 	4) Map/unmap guest memory

I think we have the ability to support this already with VFIO, see my comments
above.

Thanks,
Neo

> 	6) Write-protect a guest memory page
> 
> VFIO can fulfill category a), but not for b). A possible abstraction would
> be in vGPU core driver, to allow specific hypervisor registering callbacks
> for category b) (which means a KVMGT specific file say KVM-vGPU will 
> be added to KVM to connect both together).
> 
> Then a likely layered blocks would be like:
> 
> VFIO-vGPU  <--------->  vGPU Core  <-------------> KVMGT-vGPU
>                         ^       ^
>                         |       |
>                         |       |
>                         v       v
>                       nvidia   intel
>                        vGPU    vGPU
> 
> Xen will register its own vGPU bus driver (not using VFIO today) and
> also hypervisor services using the same framework. With this design,
> everything is abstracted/registered through vGPU core driver, instead
> of talking with each other directly.
> 
> Thoughts?
> 
> P.S. from the description of above requirements, the whole framework
> might be also extended to cover any device type using same mediated
> pass-through approach. Though graphics has some special requirement,
> the majority are actually device agnostics. Maybe better not limiting it
> with a vGPU name at all. :-)
> 
> Thanks
> Kevin
> _______________________________________________
> iGVT-g mailing list
> iGVT-g@lists.01.org
> https://lists.01.org/mailman/listinfo/igvt-g