From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1757856AbbKSHXZ (ORCPT <rfc822;w@1wt.eu>);
	Thu, 19 Nov 2015 02:23:25 -0500
Received: from mga11.intel.com ([192.55.52.93]:18288 "EHLO mga11.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753065AbbKSHXX (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 19 Nov 2015 02:23:23 -0500
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.20,317,1444719600"; 
   d="scan'208";a="854556431"
Message-ID: <564D78D0.80904@intel.com>
Date: Thu, 19 Nov 2015 15:22:56 +0800
From: Jike Song <jike.song@intel.com>
User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Alex Williamson <alex.williamson@redhat.com>
CC: "Tian, Kevin" <kevin.tian@intel.com>,
        "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
        "igvt-g@ml01.01.org" <igvt-g@ml01.01.org>,
        "intel-gfx@lists.freedesktop.org" <intel-gfx@lists.freedesktop.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "White, Michael L" <michael.l.white@intel.com>,
        "Dong, Eddie" <eddie.dong@intel.com>, "Li, Susie" <susie.li@intel.com>,
        "Cowperthwaite, David J" <david.j.cowperthwaite@intel.com>,
        "Reddy, Raghuveer" <raghuveer.reddy@intel.com>,
        "Zhu, Libo" <libo.zhu@intel.com>, "Zhou, Chao" <chao.zhou@intel.com>,
        "Wang, Hongbo" <hongbo.wang@intel.com>,
        "Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
        qemu-devel <qemu-devel@nongnu.org>,
        Paolo Bonzini <pbonzini@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Intel-gfx] [Announcement] 2015-Q3 release of XenGT - a Mediated
 Graphics Passthrough Solution from Intel
References: <53D215D3.50608@intel.com> <547FCAAD.2060406@intel.com>  <54AF967B.3060503@intel.com> <5527CEC4.9080700@intel.com>  <559B3E38.1080707@intel.com> <562F4311.9@intel.com> <1447870341.4697.92.camel@redhat.com> <AADFC41AFE54684AB9EE6CBC0274A5D15F7152DB@SHSMSX101.ccr.corp.intel.com>
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F7152DB@SHSMSX101.ccr.corp.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Alex,
On 11/19/2015 12:06 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Thursday, November 19, 2015 2:12 AM
>>
>> [cc +qemu-devel, +paolo, +gerd]
>>
>> On Tue, 2015-10-27 at 17:25 +0800, Jike Song wrote:
>>> {snip}
>>
>> Hi!
>>
>> At redhat we've been thinking about how to support vGPUs from multiple
>> vendors in a common way within QEMU.  We want to enable code sharing
>> between vendors and give new vendors an easy path to add their own
>> support.  We also have the complication that not all vGPU vendors are as
>> open source friendly as Intel, so being able to abstract the device
>> mediation and access outside of QEMU is a big advantage.
>>
>> The proposal I'd like to make is that a vGPU, whether it is from Intel
>> or another vendor, is predominantly a PCI(e) device.  We have an
>> interface in QEMU already for exposing arbitrary PCI devices, vfio-pci.
>> Currently vfio-pci uses the VFIO API to interact with "physical" devices
>> and system IOMMUs.  I highlight /physical/ there because some of these
>> physical devices are SR-IOV VFs, which is somewhat of a fuzzy concept,
>> somewhere between fixed hardware and a virtual device implemented in
>> software.  That software just happens to be running on the physical
>> endpoint.
>
> Agree.
>
> One clarification for rest discussion, is that we're talking about GVT-g vGPU
> here which is a pure software GPU virtualization technique. GVT-d (note
> some use in the text) refers to passing through the whole GPU or a specific
> VF. GVT-d already falls into existing VFIO APIs nicely (though some on-going
> effort to remove Intel specific platform stickness from gfx driver). :-)
>

Hi Alex, thanks for the discussion.

In addition to Kevin's replies, I have a high-level question: can VFIO
be used by QEMU for both KVM and Xen?

--
Thanks,
Jike

  
>>
>> vGPUs are similar, with the virtual device created at a different point,
>> host software.  They also rely on different IOMMU constructs, making use
>> of the MMU capabilities of the GPU (GTTs and such), but really having
>> similar requirements.
>
> One important difference between system IOMMU and GPU-MMU here.
> System IOMMU is very much about translation from a DMA target
> (IOVA on native, or GPA in virtualization case) to HPA. However GPU
> internal MMUs is to translate from Graphics Memory Address (GMA)
> to DMA target (HPA if system IOMMU is disabled, or IOVA/GPA if system
> IOMMU is enabled). GMA is an internal addr space within GPU, not
> exposed to Qemu and fully managed by GVT-g device model. Since it's
> not a standard PCI defined resource, we don't need abstract this capability
> in VFIO interface.
>
>>
>> The proposal is therefore that GPU vendors can expose vGPUs to
>> userspace, and thus to QEMU, using the VFIO API.  For instance, vfio
>> supports modular bus drivers and IOMMU drivers.  An intel-vfio-gvt-d
>> module (or extension of i915) can register as a vfio bus driver, create
>> a struct device per vGPU, create an IOMMU group for that device, and
>> register that device with the vfio-core.  Since we don't rely on the
>> system IOMMU for GVT-d vGPU assignment, another vGPU vendor driver (or
>> extension of the same module) can register a "type1" compliant IOMMU
>> driver into vfio-core.  From the perspective of QEMU then, all of the
>> existing vfio-pci code is re-used, QEMU remains largely unaware of any
>> specifics of the vGPU being assigned, and the only necessary change so
>> far is how QEMU traverses sysfs to find the device and thus the IOMMU
>> group leading to the vfio group.
>
> GVT-g requires to pin guest memory and query GPA->HPA information,
> upon which shadow GTTs will be updated accordingly from (GMA->GPA)
> to (GMA->HPA). So yes, here a dummy or simple "type1" compliant IOMMU
> can be introduced just for this requirement.
>
> However there's one tricky point which I'm not sure whether overall
> VFIO concept will be violated. GVT-g doesn't require system IOMMU
> to function, however host system may enable system IOMMU just for
> hardening purpose. This means two-level translations existing (GMA->
> IOVA->HPA), so the dummy IOMMU driver has to request system IOMMU
> driver to allocate IOVA for VMs and then setup IOVA->HPA mapping
> in IOMMU page table. In this case, multiple VM's translations are
> multiplexed in one IOMMU page table.
>
> We might need create some group/sub-group or parent/child concepts
> among those IOMMUs for thorough permission control.
>
>>
>> There are a few areas where we know we'll need to extend the VFIO API to
>> make this work, but it seems like they can all be done generically.  One
>> is that PCI BARs are described through the VFIO API as regions and each
>> region has a single flag describing whether mmap (ie. direct mapping) of
>> that region is possible.  We expect that vGPUs likely need finer
>> granularity, enabling some areas within a BAR to be trapped and fowarded
>> as a read or write access for the vGPU-vfio-device module to emulate,
>> while other regions, like framebuffers or texture regions, are directly
>> mapped.  I have prototype code to enable this already.
>
> Yes in GVT-g one BAR resource might be partitioned among multiple vGPUs.
> If VFIO can support such partial resource assignment, it'd be great. Similar
> parent/child concept might also be required here, so any resource enumerated
> on a vGPU shouldn't break limitations enforced on the physical device.
>
> One unique requirement for GVT-g here, though, is that vGPU device model
> need to know guest BAR configuration for proper emulation (e.g. register
> IO emulation handler to KVM). Similar is about guest MSI vector for virtual
> interrupt injection. Not sure how this can be fit into common VFIO model.
> Does VFIO allow vendor specific extension today?
>
>>
>> Another area is that we really don't want to proliferate each vGPU
>> needing a new IOMMU type within vfio.  The existing type1 IOMMU provides
>> potentially the most simple mapping and unmapping interface possible.
>> We'd therefore need to allow multiple "type1" IOMMU drivers for vfio,
>> making type1 be more of an interface specification rather than a single
>> implementation.  This is a trivial change to make within vfio and one
>> that I believe is compatible with the existing API.  Note that
>> implementing a type1-compliant vfio IOMMU does not imply pinning an
>> mapping every registered page.  A vGPU, with mediated device access, may
>> use this only to track the current HVA to GPA mappings for a VM.  Only
>> when a DMA is enabled for the vGPU instance is that HVA pinned and an
>> HPA to GPA translation programmed into the GPU MMU.
>>
>> Another area of extension is how to expose a framebuffer to QEMU for
>> seamless integration into a SPICE/VNC channel.  For this I believe we
>> could use a new region, much like we've done to expose VGA access
>> through a vfio device file descriptor.  An area within this new
>> framebuffer region could be directly mappable in QEMU while a
>> non-mappable page, at a standard location with standardized format,
>> provides a description of framebuffer and potentially even a
>> communication channel to synchronize framebuffer captures.  This would
>> be new code for QEMU, but something we could share among all vGPU
>> implementations.
>
> Now GVT-g already provides an interface to decode framebuffer information,
> w/ an assumption that the framebuffer will be further composited into
> OpenGL APIs. So the format is defined according to OpenGL definition.
> Does that meet SPICE requirement?
>
> Another thing to be added. Framebuffers are frequently switched in
> reality. So either Qemu needs to poll or a notification mechanism is required.
> And since it's dynamic, having framebuffer page directly exposed in the
> new region might be tricky. We can just expose framebuffer information
> (including base, format, etc.) and let Qemu to map separately out of VFIO
> interface.
>
> And... this works fine with vGPU model since software knows all the
> detail about framebuffer. However in pass-through case, who do you expect
> to provide that information? Is it OK to introduce vGPU specific APIs in
> VFIO?
>
>>
>> Another obvious area to be standardized would be how to discover,
>> create, and destroy vGPU instances.  SR-IOV has a standard mechanism to
>> create VFs in sysfs and I would propose that vGPU vendors try to
>> standardize on similar interfaces to enable libvirt to easily discover
>> the vGPU capabilities of a given GPU and manage the lifecycle of a vGPU
>> instance.
>
> Now there is no standard. We expose vGPU life-cycle mgmt. APIs through
> sysfs (under i915 node), which is very Intel specific. In reality different
> vendors have quite different capabilities for their own vGPUs, so not sure
> how standard we can define such a mechanism. But this code should be
> minor to be maintained in libvirt.
>
>>
>> This is obviously a lot to digest, but I'd certainly be interested in
>> hearing feedback on this proposal as well as try to clarify anything
>> I've left out or misrepresented above.  Another benefit to this
>> mechanism is that direct GPU assignment and vGPU assignment use the same
>> code within QEMU and same API to the kernel, which should make debugging
>> and code support between the two easier.  I'd really like to start a
>> discussion around this proposal, and of course the first open source
>> implementation of this sort of model will really help to drive the
>> direction it takes.  Thanks!
>>
>
> Thanks for starting this discussion. Intel will definitely work with
> community on this work. Based on earlier comments, I'm not sure
> whether we can exactly same code for direct GPU assignment and
> vGPU assignment, since even we extend VFIO some interfaces might
> be vGPU specific. Does this way still achieve your end goal?
>
> Thanks
> Kevin
>