All of lore.kernel.org
 help / color / mirror / Atom feed
* VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-18  2:39 ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-18  2:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, igvt-g, Gerd Hoffmann,
	Paolo Bonzini, Zhiyuan Lv

Hi Alex, let's continue with a new thread :)

Basically we agree with you: exposing vGPU via VFIO can make
QEMU share as much code as possible with pcidev(PF or VF) assignment.
And yes, different vGPU vendors can share quite a lot of the
QEMU part, which will do good for upper layers such as libvirt.


To achieve this, there are quite a lot to do, I'll summarize
it below. I dived into VFIO for a while but still may have
things misunderstood, so please correct me :)



First, let me illustrate my understanding of current VFIO
framework used to pass through a pcidev to guest:


                 +----------------------------------+
                 |            vfio qemu             |
                 +-----+------------------------+---+
                       |DMA                  ^  |CFG
QEMU                   |map               IRQ|  |
-----------------------|---------------------|--|-----------
KERNEL    +------------|---------------------|--|----------+
          | VFIO       |                     |  |          |
          |            v                     |  v          |
          |  +-------------------+     +-----+-----------+ |
IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
API  <-------+                   |     |                 | |
Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
          |  +-------------------+     +-----------------+ |
          +------------------------------------------------+


Here when a particular pcidev is passed-through to a KVM guest,
it is attached to vfio_pci driver in host, and guest memory
is mapped into IOMMU via the type1 iommu driver.


Then, the draft infrastructure of future VFIO-based vgpu:



                 +-------------------------------------+
                 |              vfio qemu              |
                 +----+-------------------------+------+
                      |DMA                   ^  |CFG
QEMU                  |map                IRQ|  |
----------------------|----------------------|--|-----------
KERNEL                |                      |  |
         +------------|----------------------|--|----------+
         |VFIO        |                      |  |          |
         |            v                      |  v          |
         | +--------------------+      +-----+-----------+ |
DMA      | | vfio iommu driver  |      | vfio bus driver | |
API <------+                    |      |                 | |
Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
         | +--------------------+      +-----------------+ |
         |         |  ^                      |  ^          |
         +---------|--|----------------------|--|----------+
                   |  |                      |  |
                   |  |                      v  |
         +---------|--|----------+   +---------------------+
         | +-------v-----------+ |   |                     |
         | |                   | |   |                     |
         | |      KVMGT        | |   |                     |
         | |                   | |   |   host gfx driver   |
         | +-------------------+ |   |                     |
         |                       |   |                     |
         |    KVM hypervisor     |   |                     |
         +-----------------------+   +---------------------+

        NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
                of VFIO, they may be implemented in KVM hypervisor
                or host gfx driver.



Here we need to implement a new vfio IOMMU driver instead of type1,
let's call it vfio_type2 temporarily. The main difference from pcidev
assignment is, vGPU doesn't have its own DMA requester id, so it has
to share mappings with host and other vGPUs.

        - type1 iommu driver maps gpa to hpa for passing through;
          whereas type2 maps iova to hpa;

        - hardware iommu is always needed by type1, whereas for
          type2, hardware iommu is optional;

        - type1 will invoke low-level IOMMU API (iommu_map et al) to
          setup IOMMU page table directly, whereas type2 dosen't (only
          need to invoke higher level DMA API like dma_map_page);


We also need to implement a new 'bus' driver instead of vfio_pci,
let's call it vfio_vgpu temporarily:

        - vfio_pci is a real pci driver, it has a probe method called
          during dev attaching; whereas the vfio_vgpu is a pseudo
          driver, it won't attach any devivce - the GPU is always owned by
          host gfx driver. It has to do 'probing' elsewhere, but
          still in host gfx driver attached to the device;

        - pcidev(PF or VF) attached to vfio_pci has a natural path
          in sysfs; whereas vgpu is purely a software concept:
          vfio_vgpu needs to create create/destory vgpu instances,
          maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
          etc. There should be something added in a higher layer
          to do this (VFIO or DRM).

        - vfio_pci in most case will allow QEMU to access pcidev
          hardware; whereas vfio_vgpu is to access virtual resource
          emulated by another device model;

        - vfio_pci will inject an IRQ to guest only when physical IRQ
          generated; whereas vfio_vgpu may inject an IRQ for emulation
          purpose. Anyway they can share the same injection interface;


Questions:

        [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
            in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
            In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
            case, instead it needs a new implementation which fits both
            w/ and w/o IOMMU. Is this correct?


For things not mentioned above, we might have them discussed in
other threads, or temporarily maintained in a TODO list (we might get
back to them after the big picture get agreed):


        - How to expose guest framebuffer via VFIO for SPICE;

        - How to avoid double translation with two-stage: GTT + IOMMU,
          whether identity map is possible, and if yes, how to make it
          more effectively;

        - Application acceleration
          You mentioned that with VFIO, a vGPU may be used by
          applications to get GPU acceleration. It's a potential
          opportunity to use vGPU for container usage, worthy of
          further investigation.





--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-18  2:39 ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-18  2:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, igvt-g, Gerd Hoffmann,
	Paolo Bonzini, Zhiyuan Lv

Hi Alex, let's continue with a new thread :)

Basically we agree with you: exposing vGPU via VFIO can make
QEMU share as much code as possible with pcidev(PF or VF) assignment.
And yes, different vGPU vendors can share quite a lot of the
QEMU part, which will do good for upper layers such as libvirt.


To achieve this, there are quite a lot to do, I'll summarize
it below. I dived into VFIO for a while but still may have
things misunderstood, so please correct me :)



First, let me illustrate my understanding of current VFIO
framework used to pass through a pcidev to guest:


                 +----------------------------------+
                 |            vfio qemu             |
                 +-----+------------------------+---+
                       |DMA                  ^  |CFG
QEMU                   |map               IRQ|  |
-----------------------|---------------------|--|-----------
KERNEL    +------------|---------------------|--|----------+
          | VFIO       |                     |  |          |
          |            v                     |  v          |
          |  +-------------------+     +-----+-----------+ |
IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
API  <-------+                   |     |                 | |
Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
          |  +-------------------+     +-----------------+ |
          +------------------------------------------------+


Here when a particular pcidev is passed-through to a KVM guest,
it is attached to vfio_pci driver in host, and guest memory
is mapped into IOMMU via the type1 iommu driver.


Then, the draft infrastructure of future VFIO-based vgpu:



                 +-------------------------------------+
                 |              vfio qemu              |
                 +----+-------------------------+------+
                      |DMA                   ^  |CFG
QEMU                  |map                IRQ|  |
----------------------|----------------------|--|-----------
KERNEL                |                      |  |
         +------------|----------------------|--|----------+
         |VFIO        |                      |  |          |
         |            v                      |  v          |
         | +--------------------+      +-----+-----------+ |
DMA      | | vfio iommu driver  |      | vfio bus driver | |
API <------+                    |      |                 | |
Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
         | +--------------------+      +-----------------+ |
         |         |  ^                      |  ^          |
         +---------|--|----------------------|--|----------+
                   |  |                      |  |
                   |  |                      v  |
         +---------|--|----------+   +---------------------+
         | +-------v-----------+ |   |                     |
         | |                   | |   |                     |
         | |      KVMGT        | |   |                     |
         | |                   | |   |   host gfx driver   |
         | +-------------------+ |   |                     |
         |                       |   |                     |
         |    KVM hypervisor     |   |                     |
         +-----------------------+   +---------------------+

        NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
                of VFIO, they may be implemented in KVM hypervisor
                or host gfx driver.



Here we need to implement a new vfio IOMMU driver instead of type1,
let's call it vfio_type2 temporarily. The main difference from pcidev
assignment is, vGPU doesn't have its own DMA requester id, so it has
to share mappings with host and other vGPUs.

        - type1 iommu driver maps gpa to hpa for passing through;
          whereas type2 maps iova to hpa;

        - hardware iommu is always needed by type1, whereas for
          type2, hardware iommu is optional;

        - type1 will invoke low-level IOMMU API (iommu_map et al) to
          setup IOMMU page table directly, whereas type2 dosen't (only
          need to invoke higher level DMA API like dma_map_page);


We also need to implement a new 'bus' driver instead of vfio_pci,
let's call it vfio_vgpu temporarily:

        - vfio_pci is a real pci driver, it has a probe method called
          during dev attaching; whereas the vfio_vgpu is a pseudo
          driver, it won't attach any devivce - the GPU is always owned by
          host gfx driver. It has to do 'probing' elsewhere, but
          still in host gfx driver attached to the device;

        - pcidev(PF or VF) attached to vfio_pci has a natural path
          in sysfs; whereas vgpu is purely a software concept:
          vfio_vgpu needs to create create/destory vgpu instances,
          maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
          etc. There should be something added in a higher layer
          to do this (VFIO or DRM).

        - vfio_pci in most case will allow QEMU to access pcidev
          hardware; whereas vfio_vgpu is to access virtual resource
          emulated by another device model;

        - vfio_pci will inject an IRQ to guest only when physical IRQ
          generated; whereas vfio_vgpu may inject an IRQ for emulation
          purpose. Anyway they can share the same injection interface;


Questions:

        [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
            in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
            In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
            case, instead it needs a new implementation which fits both
            w/ and w/o IOMMU. Is this correct?


For things not mentioned above, we might have them discussed in
other threads, or temporarily maintained in a TODO list (we might get
back to them after the big picture get agreed):


        - How to expose guest framebuffer via VFIO for SPICE;

        - How to avoid double translation with two-stage: GTT + IOMMU,
          whether identity map is possible, and if yes, how to make it
          more effectively;

        - Application acceleration
          You mentioned that with VFIO, a vGPU may be used by
          applications to get GPU acceleration. It's a potential
          opportunity to use vGPU for container usage, worthy of
          further investigation.





--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-18  2:39 ` [Qemu-devel] " Jike Song
@ 2016-01-18  4:47   ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-18  4:47 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Zhiyuan Lv

Hi Jike,

On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
> Hi Alex, let's continue with a new thread :)
> 
> Basically we agree with you: exposing vGPU via VFIO can make
> QEMU share as much code as possible with pcidev(PF or VF) assignment.
> And yes, different vGPU vendors can share quite a lot of the
> QEMU part, which will do good for upper layers such as libvirt.
> 
> 
> To achieve this, there are quite a lot to do, I'll summarize
> it below. I dived into VFIO for a while but still may have
> things misunderstood, so please correct me :)
> 
> 
> 
> First, let me illustrate my understanding of current VFIO
> framework used to pass through a pcidev to guest:
> 
> 
>                  +----------------------------------+
>                  |            vfio qemu             |
>                  +-----+------------------------+---+
>                        |DMA                  ^  |CFG
> QEMU                   |map               IRQ|  |
> -----------------------|---------------------|--|-----------
> KERNEL    +------------|---------------------|--|----------+
>           | VFIO       |                     |  |          |
>           |            v                     |  v          |
>           |  +-------------------+     +-----+-----------+ |
> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
> API  <-------+                   |     |                 | |
> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>           |  +-------------------+     +-----------------+ |
>           +------------------------------------------------+
> 
> 
> Here when a particular pcidev is passed-through to a KVM guest,
> it is attached to vfio_pci driver in host, and guest memory
> is mapped into IOMMU via the type1 iommu driver.
> 
> 
> Then, the draft infrastructure of future VFIO-based vgpu:
> 
> 
> 
>                  +-------------------------------------+
>                  |              vfio qemu              |
>                  +----+-------------------------+------+
>                       |DMA                   ^  |CFG
> QEMU                  |map                IRQ|  |
> ----------------------|----------------------|--|-----------
> KERNEL                |                      |  |
>          +------------|----------------------|--|----------+
>          |VFIO        |                      |  |          |
>          |            v                      |  v          |
>          | +--------------------+      +-----+-----------+ |
> DMA      | | vfio iommu driver  |      | vfio bus driver | |
> API <------+                    |      |                 | |
> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>          | +--------------------+      +-----------------+ |
>          |         |  ^                      |  ^          |
>          +---------|--|----------------------|--|----------+
>                    |  |                      |  |
>                    |  |                      v  |
>          +---------|--|----------+   +---------------------+
>          | +-------v-----------+ |   |                     |
>          | |                   | |   |                     |
>          | |      KVMGT        | |   |                     |
>          | |                   | |   |   host gfx driver   |
>          | +-------------------+ |   |                     |
>          |                       |   |                     |
>          |    KVM hypervisor     |   |                     |
>          +-----------------------+   +---------------------+
> 
>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>                 of VFIO, they may be implemented in KVM hypervisor
>                 or host gfx driver.
> 
> 
> 
> Here we need to implement a new vfio IOMMU driver instead of type1,
> let's call it vfio_type2 temporarily. The main difference from pcidev
> assignment is, vGPU doesn't have its own DMA requester id, so it has
> to share mappings with host and other vGPUs.
> 
>         - type1 iommu driver maps gpa to hpa for passing through;
>           whereas type2 maps iova to hpa;
> 
>         - hardware iommu is always needed by type1, whereas for
>           type2, hardware iommu is optional;
> 
>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>           setup IOMMU page table directly, whereas type2 dosen't (only
>           need to invoke higher level DMA API like dma_map_page);

Yes, the current type1 implementation is not compatible with vgpu since
there are not separate requester IDs on the bus and you probably don't
want or need to pin all of guest memory like we do for direct
assignment.  However, let's separate the type1 user API from the
current implementation.  It's quite easy within the vfio code to
consider "type1" to be an API specification that may have multiple
implementations.  A minor code change would allow us to continue
looking for compatible iommu backends if the group we're trying to
attach is rejected.  The benefit here is that QEMU could work
unmodified, using the type1 vfio-iommu API regardless of whether a
device is directly assigned or virtual.

Let's look at the type1 interface; we have simple map and unmap
interfaces which map and unmap process virtual address space (vaddr) to
the device address space (iova).  The host physical address is obtained
by pinning the vaddr.  In the current implementation, a map operation
pins pages and populates the hardware iommu.  A vgpu compatible
implementation might simply register the translation into a kernel-
based database to be called upon later.  When the host graphics driver
needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
translation, it already possesses the iova to vaddr mapping, which
becomes iova to hpa after a pinning operation.

So, I would encourage you to look at creating a vgpu vfio iommu
backened that makes use of the type1 api since it will reduce the
changes necessary for userspace.

> We also need to implement a new 'bus' driver instead of vfio_pci,
> let's call it vfio_vgpu temporarily:
> 
>         - vfio_pci is a real pci driver, it has a probe method called
>           during dev attaching; whereas the vfio_vgpu is a pseudo
>           driver, it won't attach any devivce - the GPU is always owned by
>           host gfx driver. It has to do 'probing' elsewhere, but
>           still in host gfx driver attached to the device;
> 
>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>           in sysfs; whereas vgpu is purely a software concept:
>           vfio_vgpu needs to create create/destory vgpu instances,
>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>           etc. There should be something added in a higher layer
>           to do this (VFIO or DRM).
> 
>         - vfio_pci in most case will allow QEMU to access pcidev
>           hardware; whereas vfio_vgpu is to access virtual resource
>           emulated by another device model;
> 
>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>           purpose. Anyway they can share the same injection interface;

Here too, I think you're making assumptions based on an implementation
path.  Personally, I think each vgpu should be a struct device and that
an iommu group should be created for each.  I think this is a valid
abstraction; dma isolation is provided through something other than a
system-level iommu, but it's still provided.  Without this, the entire
vfio core would need to be aware of vgpu, since the core operates on
devices and groups.  I believe creating a struct device also gives you
basic probe and release support for a driver.

There will be a need for some sort of lifecycle management of a vgpu.
 How is it created?  Destroyed?  Can it be given more or less resources
than other vgpus, etc.  This could be implemented in sysfs for each
physical gpu with vgpu support, sort of like how we support sr-iov now,
the PF exports controls for creating VFs.  The more commonality we can
get for lifecycle and device access for userspace, the better.

As for virtual vs physical resources and interrupts, part of the
purpose of vfio is to abstract a device into basic components.  It's up
to the bus driver how accesses to each space map to the physical
device.  Take for instance PCI config space, the existing vfio-pci
driver emulates some portions of config space for the user.

> Questions:
> 
>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>             case, instead it needs a new implementation which fits both
>             w/ and w/o IOMMU. Is this correct?
> 

vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
simply a case that the kernel development outpaced the intended user
and I didn't want to commit to the user api changes until it had been
completely vetted.  In any case, vgpu should have no dependency
whatsoever on no-iommu.  As above, I think vgpu should create virtual
devices and add them to an iommu group, similar to how no-iommu does,
but without the kernel tainting because you are actually providing
isolation through other means than a system iommu.

> For things not mentioned above, we might have them discussed in
> other threads, or temporarily maintained in a TODO list (we might get
> back to them after the big picture get agreed):
> 
> 
>         - How to expose guest framebuffer via VFIO for SPICE;

Potentially through a new, device specific region, which I think can be
done within the existing vfio API.  The API can already expose an
arbitrary number of regions to the user, it's just a matter of how we
tell the user the purpose of a region index beyond the fixed set we map
to PCI resources.

>         - How to avoid double translation with two-stage: GTT + IOMMU,
>           whether identity map is possible, and if yes, how to make it
>           more effectively;
> 
>         - Application acceleration
>           You mentioned that with VFIO, a vGPU may be used by
>           applications to get GPU acceleration. It's a potential
>           opportunity to use vGPU for container usage, worthy of
>           further investigation.

Yes, interesting topics.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-18  4:47   ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-18  4:47 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Zhiyuan Lv

Hi Jike,

On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
> Hi Alex, let's continue with a new thread :)
> 
> Basically we agree with you: exposing vGPU via VFIO can make
> QEMU share as much code as possible with pcidev(PF or VF) assignment.
> And yes, different vGPU vendors can share quite a lot of the
> QEMU part, which will do good for upper layers such as libvirt.
> 
> 
> To achieve this, there are quite a lot to do, I'll summarize
> it below. I dived into VFIO for a while but still may have
> things misunderstood, so please correct me :)
> 
> 
> 
> First, let me illustrate my understanding of current VFIO
> framework used to pass through a pcidev to guest:
> 
> 
>                  +----------------------------------+
>                  |            vfio qemu             |
>                  +-----+------------------------+---+
>                        |DMA                  ^  |CFG
> QEMU                   |map               IRQ|  |
> -----------------------|---------------------|--|-----------
> KERNEL    +------------|---------------------|--|----------+
>           | VFIO       |                     |  |          |
>           |            v                     |  v          |
>           |  +-------------------+     +-----+-----------+ |
> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
> API  <-------+                   |     |                 | |
> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>           |  +-------------------+     +-----------------+ |
>           +------------------------------------------------+
> 
> 
> Here when a particular pcidev is passed-through to a KVM guest,
> it is attached to vfio_pci driver in host, and guest memory
> is mapped into IOMMU via the type1 iommu driver.
> 
> 
> Then, the draft infrastructure of future VFIO-based vgpu:
> 
> 
> 
>                  +-------------------------------------+
>                  |              vfio qemu              |
>                  +----+-------------------------+------+
>                       |DMA                   ^  |CFG
> QEMU                  |map                IRQ|  |
> ----------------------|----------------------|--|-----------
> KERNEL                |                      |  |
>          +------------|----------------------|--|----------+
>          |VFIO        |                      |  |          |
>          |            v                      |  v          |
>          | +--------------------+      +-----+-----------+ |
> DMA      | | vfio iommu driver  |      | vfio bus driver | |
> API <------+                    |      |                 | |
> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>          | +--------------------+      +-----------------+ |
>          |         |  ^                      |  ^          |
>          +---------|--|----------------------|--|----------+
>                    |  |                      |  |
>                    |  |                      v  |
>          +---------|--|----------+   +---------------------+
>          | +-------v-----------+ |   |                     |
>          | |                   | |   |                     |
>          | |      KVMGT        | |   |                     |
>          | |                   | |   |   host gfx driver   |
>          | +-------------------+ |   |                     |
>          |                       |   |                     |
>          |    KVM hypervisor     |   |                     |
>          +-----------------------+   +---------------------+
> 
>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>                 of VFIO, they may be implemented in KVM hypervisor
>                 or host gfx driver.
> 
> 
> 
> Here we need to implement a new vfio IOMMU driver instead of type1,
> let's call it vfio_type2 temporarily. The main difference from pcidev
> assignment is, vGPU doesn't have its own DMA requester id, so it has
> to share mappings with host and other vGPUs.
> 
>         - type1 iommu driver maps gpa to hpa for passing through;
>           whereas type2 maps iova to hpa;
> 
>         - hardware iommu is always needed by type1, whereas for
>           type2, hardware iommu is optional;
> 
>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>           setup IOMMU page table directly, whereas type2 dosen't (only
>           need to invoke higher level DMA API like dma_map_page);

Yes, the current type1 implementation is not compatible with vgpu since
there are not separate requester IDs on the bus and you probably don't
want or need to pin all of guest memory like we do for direct
assignment.  However, let's separate the type1 user API from the
current implementation.  It's quite easy within the vfio code to
consider "type1" to be an API specification that may have multiple
implementations.  A minor code change would allow us to continue
looking for compatible iommu backends if the group we're trying to
attach is rejected.  The benefit here is that QEMU could work
unmodified, using the type1 vfio-iommu API regardless of whether a
device is directly assigned or virtual.

Let's look at the type1 interface; we have simple map and unmap
interfaces which map and unmap process virtual address space (vaddr) to
the device address space (iova).  The host physical address is obtained
by pinning the vaddr.  In the current implementation, a map operation
pins pages and populates the hardware iommu.  A vgpu compatible
implementation might simply register the translation into a kernel-
based database to be called upon later.  When the host graphics driver
needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
translation, it already possesses the iova to vaddr mapping, which
becomes iova to hpa after a pinning operation.

So, I would encourage you to look at creating a vgpu vfio iommu
backened that makes use of the type1 api since it will reduce the
changes necessary for userspace.

> We also need to implement a new 'bus' driver instead of vfio_pci,
> let's call it vfio_vgpu temporarily:
> 
>         - vfio_pci is a real pci driver, it has a probe method called
>           during dev attaching; whereas the vfio_vgpu is a pseudo
>           driver, it won't attach any devivce - the GPU is always owned by
>           host gfx driver. It has to do 'probing' elsewhere, but
>           still in host gfx driver attached to the device;
> 
>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>           in sysfs; whereas vgpu is purely a software concept:
>           vfio_vgpu needs to create create/destory vgpu instances,
>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>           etc. There should be something added in a higher layer
>           to do this (VFIO or DRM).
> 
>         - vfio_pci in most case will allow QEMU to access pcidev
>           hardware; whereas vfio_vgpu is to access virtual resource
>           emulated by another device model;
> 
>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>           purpose. Anyway they can share the same injection interface;

Here too, I think you're making assumptions based on an implementation
path.  Personally, I think each vgpu should be a struct device and that
an iommu group should be created for each.  I think this is a valid
abstraction; dma isolation is provided through something other than a
system-level iommu, but it's still provided.  Without this, the entire
vfio core would need to be aware of vgpu, since the core operates on
devices and groups.  I believe creating a struct device also gives you
basic probe and release support for a driver.

There will be a need for some sort of lifecycle management of a vgpu.
 How is it created?  Destroyed?  Can it be given more or less resources
than other vgpus, etc.  This could be implemented in sysfs for each
physical gpu with vgpu support, sort of like how we support sr-iov now,
the PF exports controls for creating VFs.  The more commonality we can
get for lifecycle and device access for userspace, the better.

As for virtual vs physical resources and interrupts, part of the
purpose of vfio is to abstract a device into basic components.  It's up
to the bus driver how accesses to each space map to the physical
device.  Take for instance PCI config space, the existing vfio-pci
driver emulates some portions of config space for the user.

> Questions:
> 
>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>             case, instead it needs a new implementation which fits both
>             w/ and w/o IOMMU. Is this correct?
> 

vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
simply a case that the kernel development outpaced the intended user
and I didn't want to commit to the user api changes until it had been
completely vetted.  In any case, vgpu should have no dependency
whatsoever on no-iommu.  As above, I think vgpu should create virtual
devices and add them to an iommu group, similar to how no-iommu does,
but without the kernel tainting because you are actually providing
isolation through other means than a system iommu.

> For things not mentioned above, we might have them discussed in
> other threads, or temporarily maintained in a TODO list (we might get
> back to them after the big picture get agreed):
> 
> 
>         - How to expose guest framebuffer via VFIO for SPICE;

Potentially through a new, device specific region, which I think can be
done within the existing vfio API.  The API can already expose an
arbitrary number of regions to the user, it's just a matter of how we
tell the user the purpose of a region index beyond the fixed set we map
to PCI resources.

>         - How to avoid double translation with two-stage: GTT + IOMMU,
>           whether identity map is possible, and if yes, how to make it
>           more effectively;
> 
>         - Application acceleration
>           You mentioned that with VFIO, a vGPU may be used by
>           applications to get GPU acceleration. It's a potential
>           opportunity to use vGPU for container usage, worthy of
>           further investigation.

Yes, interesting topics.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-18  4:47   ` [Qemu-devel] " Alex Williamson
@ 2016-01-18  8:56     ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-18  8:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Gerd Hoffmann, Paolo Bonzini, Tian, Kevin, Zhiyuan Lv, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org

On 01/18/2016 12:47 PM, Alex Williamson wrote:
> Hi Jike,
> 
> On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
>> Hi Alex, let's continue with a new thread :)
>>
>> Basically we agree with you: exposing vGPU via VFIO can make
>> QEMU share as much code as possible with pcidev(PF or VF) assignment.
>> And yes, different vGPU vendors can share quite a lot of the
>> QEMU part, which will do good for upper layers such as libvirt.
>>
>>
>> To achieve this, there are quite a lot to do, I'll summarize
>> it below. I dived into VFIO for a while but still may have
>> things misunderstood, so please correct me :)
>>
>>
>>
>> First, let me illustrate my understanding of current VFIO
>> framework used to pass through a pcidev to guest:
>>
>>
>>                  +----------------------------------+
>>                  |            vfio qemu             |
>>                  +-----+------------------------+---+
>>                        |DMA                  ^  |CFG
>> QEMU                   |map               IRQ|  |
>> -----------------------|---------------------|--|-----------
>> KERNEL    +------------|---------------------|--|----------+
>>           | VFIO       |                     |  |          |
>>           |            v                     |  v          |
>>           |  +-------------------+     +-----+-----------+ |
>> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
>> API  <-------+                   |     |                 | |
>> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>>           |  +-------------------+     +-----------------+ |
>>           +------------------------------------------------+
>>
>>
>> Here when a particular pcidev is passed-through to a KVM guest,
>> it is attached to vfio_pci driver in host, and guest memory
>> is mapped into IOMMU via the type1 iommu driver.
>>
>>
>> Then, the draft infrastructure of future VFIO-based vgpu:
>>
>>
>>
>>                  +-------------------------------------+
>>                  |              vfio qemu              |
>>                  +----+-------------------------+------+
>>                       |DMA                   ^  |CFG
>> QEMU                  |map                IRQ|  |
>> ----------------------|----------------------|--|-----------
>> KERNEL                |                      |  |
>>          +------------|----------------------|--|----------+
>>          |VFIO        |                      |  |          |
>>          |            v                      |  v          |
>>          | +--------------------+      +-----+-----------+ |
>> DMA      | | vfio iommu driver  |      | vfio bus driver | |
>> API <------+                    |      |                 | |
>> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>>          | +--------------------+      +-----------------+ |
>>          |         |  ^                      |  ^          |
>>          +---------|--|----------------------|--|----------+
>>                    |  |                      |  |
>>                    |  |                      v  |
>>          +---------|--|----------+   +---------------------+
>>          | +-------v-----------+ |   |                     |
>>          | |                   | |   |                     |
>>          | |      KVMGT        | |   |                     |
>>          | |                   | |   |   host gfx driver   |
>>          | +-------------------+ |   |                     |
>>          |                       |   |                     |
>>          |    KVM hypervisor     |   |                     |
>>          +-----------------------+   +---------------------+
>>
>>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>>                 of VFIO, they may be implemented in KVM hypervisor
>>                 or host gfx driver.
>>
>>
>>
>> Here we need to implement a new vfio IOMMU driver instead of type1,
>> let's call it vfio_type2 temporarily. The main difference from pcidev
>> assignment is, vGPU doesn't have its own DMA requester id, so it has
>> to share mappings with host and other vGPUs.
>>
>>         - type1 iommu driver maps gpa to hpa for passing through;
>>           whereas type2 maps iova to hpa;
>>
>>         - hardware iommu is always needed by type1, whereas for
>>           type2, hardware iommu is optional;
>>
>>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>>           setup IOMMU page table directly, whereas type2 dosen't (only
>>           need to invoke higher level DMA API like dma_map_page);
> 
> Yes, the current type1 implementation is not compatible with vgpu since
> there are not separate requester IDs on the bus and you probably don't
> want or need to pin all of guest memory like we do for direct
> assignment.  However, let's separate the type1 user API from the
> current implementation.  It's quite easy within the vfio code to
> consider "type1" to be an API specification that may have multiple
> implementations.  A minor code change would allow us to continue
> looking for compatible iommu backends if the group we're trying to
> attach is rejected.

Would you elaborate a bit about 'iommu backends' here? Previously
I thought that entire type1 will be duplicated. If not, what is supposed
to add, a new vfio_dma_do_map?

> The benefit here is that QEMU could work
> unmodified, using the type1 vfio-iommu API regardless of whether a
> device is directly assigned or virtual.
> 
> Let's look at the type1 interface; we have simple map and unmap
> interfaces which map and unmap process virtual address space (vaddr) to
> the device address space (iova).  The host physical address is obtained
> by pinning the vaddr.  In the current implementation, a map operation
> pins pages and populates the hardware iommu.  A vgpu compatible
> implementation might simply register the translation into a kernel-
> based database to be called upon later.  When the host graphics driver
> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> translation, it already possesses the iova to vaddr mapping, which
> becomes iova to hpa after a pinning operation.
> 
> So, I would encourage you to look at creating a vgpu vfio iommu
> backened that makes use of the type1 api since it will reduce the
> changes necessary for userspace.
> 

Yes, keeping type1 API sounds a great idea.

>> We also need to implement a new 'bus' driver instead of vfio_pci,
>> let's call it vfio_vgpu temporarily:
>>
>>         - vfio_pci is a real pci driver, it has a probe method called
>>           during dev attaching; whereas the vfio_vgpu is a pseudo
>>           driver, it won't attach any devivce - the GPU is always owned by
>>           host gfx driver. It has to do 'probing' elsewhere, but
>>           still in host gfx driver attached to the device;
>>
>>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>>           in sysfs; whereas vgpu is purely a software concept:
>>           vfio_vgpu needs to create create/destory vgpu instances,
>>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>>           etc. There should be something added in a higher layer
>>           to do this (VFIO or DRM).
>>
>>         - vfio_pci in most case will allow QEMU to access pcidev
>>           hardware; whereas vfio_vgpu is to access virtual resource
>>           emulated by another device model;
>>
>>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>>           purpose. Anyway they can share the same injection interface;
> 
> Here too, I think you're making assumptions based on an implementation
> path.  Personally, I think each vgpu should be a struct device and that
> an iommu group should be created for each.  I think this is a valid
> abstraction; dma isolation is provided through something other than a
> system-level iommu, but it's still provided.  Without this, the entire
> vfio core would need to be aware of vgpu, since the core operates on
> devices and groups.  I believe creating a struct device also gives you
> basic probe and release support for a driver.
> 

Indeed.
BTW, that should be done in the 'bus' driver, right?

> There will be a need for some sort of lifecycle management of a vgpu.
>  How is it created?  Destroyed?  Can it be given more or less resources
> than other vgpus, etc.  This could be implemented in sysfs for each
> physical gpu with vgpu support, sort of like how we support sr-iov now,
> the PF exports controls for creating VFs.  The more commonality we can
> get for lifecycle and device access for userspace, the better.
> 

Will have a look at the VF managements, thanks for the info.

> As for virtual vs physical resources and interrupts, part of the
> purpose of vfio is to abstract a device into basic components.  It's up
> to the bus driver how accesses to each space map to the physical
> device.  Take for instance PCI config space, the existing vfio-pci
> driver emulates some portions of config space for the user.
> 
>> Questions:
>>
>>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>>             case, instead it needs a new implementation which fits both
>>             w/ and w/o IOMMU. Is this correct?
>>
> 
> vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> simply a case that the kernel development outpaced the intended user
> and I didn't want to commit to the user api changes until it had been
> completely vetted.  In any case, vgpu should have no dependency
> whatsoever on no-iommu.  As above, I think vgpu should create virtual
> devices and add them to an iommu group, similar to how no-iommu does,
> but without the kernel tainting because you are actually providing
> isolation through other means than a system iommu.
> 

Thanks for confirmation.

>> For things not mentioned above, we might have them discussed in
>> other threads, or temporarily maintained in a TODO list (we might get
>> back to them after the big picture get agreed):
>>
>>
>>         - How to expose guest framebuffer via VFIO for SPICE;
> 
> Potentially through a new, device specific region, which I think can be
> done within the existing vfio API.  The API can already expose an
> arbitrary number of regions to the user, it's just a matter of how we
> tell the user the purpose of a region index beyond the fixed set we map
> to PCI resources.
> 
>>         - How to avoid double translation with two-stage: GTT + IOMMU,
>>           whether identity map is possible, and if yes, how to make it
>>           more effectively;
>>
>>         - Application acceleration
>>           You mentioned that with VFIO, a vGPU may be used by
>>           applications to get GPU acceleration. It's a potential
>>           opportunity to use vGPU for container usage, worthy of
>>           further investigation.
> 
> Yes, interesting topics.  Thanks,
> 

Looks that things get more clear overall, with small exceptions.
Thanks for the advice:)


> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-18  8:56     ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-18  8:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Zhiyuan Lv

On 01/18/2016 12:47 PM, Alex Williamson wrote:
> Hi Jike,
> 
> On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
>> Hi Alex, let's continue with a new thread :)
>>
>> Basically we agree with you: exposing vGPU via VFIO can make
>> QEMU share as much code as possible with pcidev(PF or VF) assignment.
>> And yes, different vGPU vendors can share quite a lot of the
>> QEMU part, which will do good for upper layers such as libvirt.
>>
>>
>> To achieve this, there are quite a lot to do, I'll summarize
>> it below. I dived into VFIO for a while but still may have
>> things misunderstood, so please correct me :)
>>
>>
>>
>> First, let me illustrate my understanding of current VFIO
>> framework used to pass through a pcidev to guest:
>>
>>
>>                  +----------------------------------+
>>                  |            vfio qemu             |
>>                  +-----+------------------------+---+
>>                        |DMA                  ^  |CFG
>> QEMU                   |map               IRQ|  |
>> -----------------------|---------------------|--|-----------
>> KERNEL    +------------|---------------------|--|----------+
>>           | VFIO       |                     |  |          |
>>           |            v                     |  v          |
>>           |  +-------------------+     +-----+-----------+ |
>> IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
>> API  <-------+                   |     |                 | |
>> Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
>>           |  +-------------------+     +-----------------+ |
>>           +------------------------------------------------+
>>
>>
>> Here when a particular pcidev is passed-through to a KVM guest,
>> it is attached to vfio_pci driver in host, and guest memory
>> is mapped into IOMMU via the type1 iommu driver.
>>
>>
>> Then, the draft infrastructure of future VFIO-based vgpu:
>>
>>
>>
>>                  +-------------------------------------+
>>                  |              vfio qemu              |
>>                  +----+-------------------------+------+
>>                       |DMA                   ^  |CFG
>> QEMU                  |map                IRQ|  |
>> ----------------------|----------------------|--|-----------
>> KERNEL                |                      |  |
>>          +------------|----------------------|--|----------+
>>          |VFIO        |                      |  |          |
>>          |            v                      |  v          |
>>          | +--------------------+      +-----+-----------+ |
>> DMA      | | vfio iommu driver  |      | vfio bus driver | |
>> API <------+                    |      |                 | |
>> Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
>>          | +--------------------+      +-----------------+ |
>>          |         |  ^                      |  ^          |
>>          +---------|--|----------------------|--|----------+
>>                    |  |                      |  |
>>                    |  |                      v  |
>>          +---------|--|----------+   +---------------------+
>>          | +-------v-----------+ |   |                     |
>>          | |                   | |   |                     |
>>          | |      KVMGT        | |   |                     |
>>          | |                   | |   |   host gfx driver   |
>>          | +-------------------+ |   |                     |
>>          |                       |   |                     |
>>          |    KVM hypervisor     |   |                     |
>>          +-----------------------+   +---------------------+
>>
>>         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
>>                 of VFIO, they may be implemented in KVM hypervisor
>>                 or host gfx driver.
>>
>>
>>
>> Here we need to implement a new vfio IOMMU driver instead of type1,
>> let's call it vfio_type2 temporarily. The main difference from pcidev
>> assignment is, vGPU doesn't have its own DMA requester id, so it has
>> to share mappings with host and other vGPUs.
>>
>>         - type1 iommu driver maps gpa to hpa for passing through;
>>           whereas type2 maps iova to hpa;
>>
>>         - hardware iommu is always needed by type1, whereas for
>>           type2, hardware iommu is optional;
>>
>>         - type1 will invoke low-level IOMMU API (iommu_map et al) to
>>           setup IOMMU page table directly, whereas type2 dosen't (only
>>           need to invoke higher level DMA API like dma_map_page);
> 
> Yes, the current type1 implementation is not compatible with vgpu since
> there are not separate requester IDs on the bus and you probably don't
> want or need to pin all of guest memory like we do for direct
> assignment.  However, let's separate the type1 user API from the
> current implementation.  It's quite easy within the vfio code to
> consider "type1" to be an API specification that may have multiple
> implementations.  A minor code change would allow us to continue
> looking for compatible iommu backends if the group we're trying to
> attach is rejected.

Would you elaborate a bit about 'iommu backends' here? Previously
I thought that entire type1 will be duplicated. If not, what is supposed
to add, a new vfio_dma_do_map?

> The benefit here is that QEMU could work
> unmodified, using the type1 vfio-iommu API regardless of whether a
> device is directly assigned or virtual.
> 
> Let's look at the type1 interface; we have simple map and unmap
> interfaces which map and unmap process virtual address space (vaddr) to
> the device address space (iova).  The host physical address is obtained
> by pinning the vaddr.  In the current implementation, a map operation
> pins pages and populates the hardware iommu.  A vgpu compatible
> implementation might simply register the translation into a kernel-
> based database to be called upon later.  When the host graphics driver
> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> translation, it already possesses the iova to vaddr mapping, which
> becomes iova to hpa after a pinning operation.
> 
> So, I would encourage you to look at creating a vgpu vfio iommu
> backened that makes use of the type1 api since it will reduce the
> changes necessary for userspace.
> 

Yes, keeping type1 API sounds a great idea.

>> We also need to implement a new 'bus' driver instead of vfio_pci,
>> let's call it vfio_vgpu temporarily:
>>
>>         - vfio_pci is a real pci driver, it has a probe method called
>>           during dev attaching; whereas the vfio_vgpu is a pseudo
>>           driver, it won't attach any devivce - the GPU is always owned by
>>           host gfx driver. It has to do 'probing' elsewhere, but
>>           still in host gfx driver attached to the device;
>>
>>         - pcidev(PF or VF) attached to vfio_pci has a natural path
>>           in sysfs; whereas vgpu is purely a software concept:
>>           vfio_vgpu needs to create create/destory vgpu instances,
>>           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>>           etc. There should be something added in a higher layer
>>           to do this (VFIO or DRM).
>>
>>         - vfio_pci in most case will allow QEMU to access pcidev
>>           hardware; whereas vfio_vgpu is to access virtual resource
>>           emulated by another device model;
>>
>>         - vfio_pci will inject an IRQ to guest only when physical IRQ
>>           generated; whereas vfio_vgpu may inject an IRQ for emulation
>>           purpose. Anyway they can share the same injection interface;
> 
> Here too, I think you're making assumptions based on an implementation
> path.  Personally, I think each vgpu should be a struct device and that
> an iommu group should be created for each.  I think this is a valid
> abstraction; dma isolation is provided through something other than a
> system-level iommu, but it's still provided.  Without this, the entire
> vfio core would need to be aware of vgpu, since the core operates on
> devices and groups.  I believe creating a struct device also gives you
> basic probe and release support for a driver.
> 

Indeed.
BTW, that should be done in the 'bus' driver, right?

> There will be a need for some sort of lifecycle management of a vgpu.
>  How is it created?  Destroyed?  Can it be given more or less resources
> than other vgpus, etc.  This could be implemented in sysfs for each
> physical gpu with vgpu support, sort of like how we support sr-iov now,
> the PF exports controls for creating VFs.  The more commonality we can
> get for lifecycle and device access for userspace, the better.
> 

Will have a look at the VF managements, thanks for the info.

> As for virtual vs physical resources and interrupts, part of the
> purpose of vfio is to abstract a device into basic components.  It's up
> to the bus driver how accesses to each space map to the physical
> device.  Take for instance PCI config space, the existing vfio-pci
> driver emulates some portions of config space for the user.
> 
>> Questions:
>>
>>         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>>             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>>             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>>             case, instead it needs a new implementation which fits both
>>             w/ and w/o IOMMU. Is this correct?
>>
> 
> vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> simply a case that the kernel development outpaced the intended user
> and I didn't want to commit to the user api changes until it had been
> completely vetted.  In any case, vgpu should have no dependency
> whatsoever on no-iommu.  As above, I think vgpu should create virtual
> devices and add them to an iommu group, similar to how no-iommu does,
> but without the kernel tainting because you are actually providing
> isolation through other means than a system iommu.
> 

Thanks for confirmation.

>> For things not mentioned above, we might have them discussed in
>> other threads, or temporarily maintained in a TODO list (we might get
>> back to them after the big picture get agreed):
>>
>>
>>         - How to expose guest framebuffer via VFIO for SPICE;
> 
> Potentially through a new, device specific region, which I think can be
> done within the existing vfio API.  The API can already expose an
> arbitrary number of regions to the user, it's just a matter of how we
> tell the user the purpose of a region index beyond the fixed set we map
> to PCI resources.
> 
>>         - How to avoid double translation with two-stage: GTT + IOMMU,
>>           whether identity map is possible, and if yes, how to make it
>>           more effectively;
>>
>>         - Application acceleration
>>           You mentioned that with VFIO, a vGPU may be used by
>>           applications to get GPU acceleration. It's a potential
>>           opportunity to use vGPU for container usage, worthy of
>>           further investigation.
> 
> Yes, interesting topics.  Thanks,
> 

Looks that things get more clear overall, with small exceptions.
Thanks for the advice:)


> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-18  8:56     ` [Qemu-devel] " Jike Song
@ 2016-01-18 19:05       ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-18 19:05 UTC (permalink / raw)
  To: Jike Song
  Cc: Gerd Hoffmann, Paolo Bonzini, Tian, Kevin, Zhiyuan Lv, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org

On Mon, 2016-01-18 at 16:56 +0800, Jike Song wrote:
> On 01/18/2016 12:47 PM, Alex Williamson wrote:
> > Hi Jike,
> > 
> > On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
> > > Hi Alex, let's continue with a new thread :)
> > > 
> > > Basically we agree with you: exposing vGPU via VFIO can make
> > > QEMU share as much code as possible with pcidev(PF or VF) assignment.
> > > And yes, different vGPU vendors can share quite a lot of the
> > > QEMU part, which will do good for upper layers such as libvirt.
> > > 
> > > 
> > > To achieve this, there are quite a lot to do, I'll summarize
> > > it below. I dived into VFIO for a while but still may have
> > > things misunderstood, so please correct me :)
> > > 
> > > 
> > > 
> > > First, let me illustrate my understanding of current VFIO
> > > framework used to pass through a pcidev to guest:
> > > 
> > > 
> > >                  +----------------------------------+
> > >                  |            vfio qemu             |
> > >                  +-----+------------------------+---+
> > >                        |DMA                  ^  |CFG
> > > QEMU                   |map               IRQ|  |
> > > -----------------------|---------------------|--|-----------
> > > KERNEL    +------------|---------------------|--|----------+
> > >           | VFIO       |                     |  |          |
> > >           |            v                     |  v          |
> > >           |  +-------------------+     +-----+-----------+ |
> > > IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
> > > API  <-------+                   |     |                 | |
> > > Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
> > >           |  +-------------------+     +-----------------+ |
> > >           +------------------------------------------------+
> > > 
> > > 
> > > Here when a particular pcidev is passed-through to a KVM guest,
> > > it is attached to vfio_pci driver in host, and guest memory
> > > is mapped into IOMMU via the type1 iommu driver.
> > > 
> > > 
> > > Then, the draft infrastructure of future VFIO-based vgpu:
> > > 
> > > 
> > > 
> > >                  +-------------------------------------+
> > >                  |              vfio qemu              |
> > >                  +----+-------------------------+------+
> > >                       |DMA                   ^  |CFG
> > > QEMU                  |map                IRQ|  |
> > > ----------------------|----------------------|--|-----------
> > > KERNEL                |                      |  |
> > >          +------------|----------------------|--|----------+
> > >          |VFIO        |                      |  |          |
> > >          |            v                      |  v          |
> > >          | +--------------------+      +-----+-----------+ |
> > > DMA      | | vfio iommu driver  |      | vfio bus driver | |
> > > API <------+                    |      |                 | |
> > > Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
> > >          | +--------------------+      +-----------------+ |
> > >          |         |  ^                      |  ^          |
> > >          +---------|--|----------------------|--|----------+
> > >                    |  |                      |  |
> > >                    |  |                      v  |
> > >          +---------|--|----------+   +---------------------+
> > >          | +-------v-----------+ |   |                     |
> > >          | |                   | |   |                     |
> > >          | |      KVMGT        | |   |                     |
> > >          | |                   | |   |   host gfx driver   |
> > >          | +-------------------+ |   |                     |
> > >          |                       |   |                     |
> > >          |    KVM hypervisor     |   |                     |
> > >          +-----------------------+   +---------------------+
> > > 
> > >         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
> > >                 of VFIO, they may be implemented in KVM hypervisor
> > >                 or host gfx driver.
> > > 
> > > 
> > > 
> > > Here we need to implement a new vfio IOMMU driver instead of type1,
> > > let's call it vfio_type2 temporarily. The main difference from pcidev
> > > assignment is, vGPU doesn't have its own DMA requester id, so it has
> > > to share mappings with host and other vGPUs.
> > > 
> > >         - type1 iommu driver maps gpa to hpa for passing through;
> > >           whereas type2 maps iova to hpa;
> > > 
> > >         - hardware iommu is always needed by type1, whereas for
> > >           type2, hardware iommu is optional;
> > > 
> > >         - type1 will invoke low-level IOMMU API (iommu_map et al) to
> > >           setup IOMMU page table directly, whereas type2 dosen't (only
> > >           need to invoke higher level DMA API like dma_map_page);
> > 
> > Yes, the current type1 implementation is not compatible with vgpu since
> > there are not separate requester IDs on the bus and you probably don't
> > want or need to pin all of guest memory like we do for direct
> > assignment.  However, let's separate the type1 user API from the
> > current implementation.  It's quite easy within the vfio code to
> > consider "type1" to be an API specification that may have multiple
> > implementations.  A minor code change would allow us to continue
> > looking for compatible iommu backends if the group we're trying to
> > attach is rejected.
> 
> Would you elaborate a bit about 'iommu backends' here? Previously
> I thought that entire type1 will be duplicated. If not, what is supposed
> to add, a new vfio_dma_do_map?

I don't know that you necessarily want to re-use any of the
vfio_iommu_type1.c code as-is, it's just the API that we'll want to
keep consistent so QEMU doesn't need to learn about a new iommu
backend.  Opportunities for sharing certainly may arise, you may want
to use a similar red-black tree for storing current mappings, the
pinning code may be similar, etc.  We can evaluate on a case by case
basis whether it makes sense to pull out common code for each of those.

As for an iommu backend in general, if you look at the code flow
example in Documentation/vfio.txt, the user opens a container
(/dev/vfio/vfio) and a group (/dev/vfio/$GROUPNUM).  The group is set
to associate with a container instance via VFIO_GROUP_SET_CONTAINER and
then an iommu model is set for the container with VFIO_SET_IOMMU.
 Looking at drivers/vfio/vfio.c:vfio_ioctl_set_iommu(), we look for an
iommu backend that supports the requested extension (VFIO_TYPE1_IOMMU),
call the open() callback on it and then attempt to attach the group via
the attach_group() callback.  At this latter callback, the iommu
backend can compare the device to those that it actually supports.  For
instance the existing vfio_iommu_type1 will attempt to use the IOMMU
API and should fail if the device cannot be supported with that.  The
current loop in vfio_ioctl_set_iommu() will exit in this case, but as
you can see in the code, it's easy to make it continue and look for
another iommu backend that supports the requested extension.

> > The benefit here is that QEMU could work
> > unmodified, using the type1 vfio-iommu API regardless of whether a
> > device is directly assigned or virtual.
> > 
> > Let's look at the type1 interface; we have simple map and unmap
> > interfaces which map and unmap process virtual address space (vaddr) to
> > the device address space (iova).  The host physical address is obtained
> > by pinning the vaddr.  In the current implementation, a map operation
> > pins pages and populates the hardware iommu.  A vgpu compatible
> > implementation might simply register the translation into a kernel-
> > based database to be called upon later.  When the host graphics driver
> > needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> > translation, it already possesses the iova to vaddr mapping, which
> > becomes iova to hpa after a pinning operation.
> > 
> > So, I would encourage you to look at creating a vgpu vfio iommu
> > backened that makes use of the type1 api since it will reduce the
> > changes necessary for userspace.
> > 
> 
> Yes, keeping type1 API sounds a great idea.
> 
> > > We also need to implement a new 'bus' driver instead of vfio_pci,
> > > let's call it vfio_vgpu temporarily:
> > > 
> > >         - vfio_pci is a real pci driver, it has a probe method called
> > >           during dev attaching; whereas the vfio_vgpu is a pseudo
> > >           driver, it won't attach any devivce - the GPU is always owned by
> > >           host gfx driver. It has to do 'probing' elsewhere, but
> > >           still in host gfx driver attached to the device;
> > > 
> > >         - pcidev(PF or VF) attached to vfio_pci has a natural path
> > >           in sysfs; whereas vgpu is purely a software concept:
> > >           vfio_vgpu needs to create create/destory vgpu instances,
> > >           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
> > >           etc. There should be something added in a higher layer
> > >           to do this (VFIO or DRM).
> > > 
> > >         - vfio_pci in most case will allow QEMU to access pcidev
> > >           hardware; whereas vfio_vgpu is to access virtual resource
> > >           emulated by another device model;
> > > 
> > >         - vfio_pci will inject an IRQ to guest only when physical IRQ
> > >           generated; whereas vfio_vgpu may inject an IRQ for emulation
> > >           purpose. Anyway they can share the same injection interface;
> > 
> > Here too, I think you're making assumptions based on an implementation
> > path.  Personally, I think each vgpu should be a struct device and that
> > an iommu group should be created for each.  I think this is a valid
> > abstraction; dma isolation is provided through something other than a
> > system-level iommu, but it's still provided.  Without this, the entire
> > vfio core would need to be aware of vgpu, since the core operates on
> > devices and groups.  I believe creating a struct device also gives you
> > basic probe and release support for a driver.
> > 
> 
> Indeed.
> BTW, that should be done in the 'bus' driver, right?

I think you have some flexibility between the graphics driver and the
vfio-vgpu driver in where this is done.  If we want vfio-vgpu to be
more generic, then vgpu device creation and management should probably
be done in the graphics driver and vfio-vgpu should be able to probe
that device and call back into the graphics driver to handle requests.
If it turns out there's not much for vfio-vgpu to share, ie. it's just
a passthrough for device specific emulation, then maybe we want a vfio-
intel-vgpu instead.

> > There will be a need for some sort of lifecycle management of a vgpu.
> >  How is it created?  Destroyed?  Can it be given more or less resources
> > than other vgpus, etc.  This could be implemented in sysfs for each
> > physical gpu with vgpu support, sort of like how we support sr-iov now,
> > the PF exports controls for creating VFs.  The more commonality we can
> > get for lifecycle and device access for userspace, the better.
> > 
> 
> Will have a look at the VF managements, thanks for the info.
> 
> > As for virtual vs physical resources and interrupts, part of the
> > purpose of vfio is to abstract a device into basic components.  It's up
> > to the bus driver how accesses to each space map to the physical
> > device.  Take for instance PCI config space, the existing vfio-pci
> > driver emulates some portions of config space for the user.
> > 
> > > Questions:
> > > 
> > >         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
> > >             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
> > >             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
> > >             case, instead it needs a new implementation which fits both
> > >             w/ and w/o IOMMU. Is this correct?
> > > 
> > 
> > vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> > simply a case that the kernel development outpaced the intended user
> > and I didn't want to commit to the user api changes until it had been
> > completely vetted.  In any case, vgpu should have no dependency
> > whatsoever on no-iommu.  As above, I think vgpu should create virtual
> > devices and add them to an iommu group, similar to how no-iommu does,
> > but without the kernel tainting because you are actually providing
> > isolation through other means than a system iommu.
> > 
> 
> Thanks for confirmation.
> 
> > > For things not mentioned above, we might have them discussed in
> > > other threads, or temporarily maintained in a TODO list (we might get
> > > back to them after the big picture get agreed):
> > > 
> > > 
> > >         - How to expose guest framebuffer via VFIO for SPICE;
> > 
> > Potentially through a new, device specific region, which I think can be
> > done within the existing vfio API.  The API can already expose an
> > arbitrary number of regions to the user, it's just a matter of how we
> > tell the user the purpose of a region index beyond the fixed set we map
> > to PCI resources.
> > 
> > >         - How to avoid double translation with two-stage: GTT + IOMMU,
> > >           whether identity map is possible, and if yes, how to make it
> > >           more effectively;
> > > 
> > >         - Application acceleration
> > >           You mentioned that with VFIO, a vGPU may be used by
> > >           applications to get GPU acceleration. It's a potential
> > >           opportunity to use vGPU for container usage, worthy of
> > >           further investigation.
> > 
> > Yes, interesting topics.  Thanks,
> > 
> 
> Looks that things get more clear overall, with small exceptions.
> Thanks for the advice:)

Yes, please let me know how I can help.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-18 19:05       ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-18 19:05 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Zhiyuan Lv

On Mon, 2016-01-18 at 16:56 +0800, Jike Song wrote:
> On 01/18/2016 12:47 PM, Alex Williamson wrote:
> > Hi Jike,
> > 
> > On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
> > > Hi Alex, let's continue with a new thread :)
> > > 
> > > Basically we agree with you: exposing vGPU via VFIO can make
> > > QEMU share as much code as possible with pcidev(PF or VF) assignment.
> > > And yes, different vGPU vendors can share quite a lot of the
> > > QEMU part, which will do good for upper layers such as libvirt.
> > > 
> > > 
> > > To achieve this, there are quite a lot to do, I'll summarize
> > > it below. I dived into VFIO for a while but still may have
> > > things misunderstood, so please correct me :)
> > > 
> > > 
> > > 
> > > First, let me illustrate my understanding of current VFIO
> > > framework used to pass through a pcidev to guest:
> > > 
> > > 
> > >                  +----------------------------------+
> > >                  |            vfio qemu             |
> > >                  +-----+------------------------+---+
> > >                        |DMA                  ^  |CFG
> > > QEMU                   |map               IRQ|  |
> > > -----------------------|---------------------|--|-----------
> > > KERNEL    +------------|---------------------|--|----------+
> > >           | VFIO       |                     |  |          |
> > >           |            v                     |  v          |
> > >           |  +-------------------+     +-----+-----------+ |
> > > IOMMU     |  | vfio iommu driver |     | vfio bus driver | |
> > > API  <-------+                   |     |                 | |
> > > Layer     |  | e.g. type1        |     | e.g. vfio_pci   | |
> > >           |  +-------------------+     +-----------------+ |
> > >           +------------------------------------------------+
> > > 
> > > 
> > > Here when a particular pcidev is passed-through to a KVM guest,
> > > it is attached to vfio_pci driver in host, and guest memory
> > > is mapped into IOMMU via the type1 iommu driver.
> > > 
> > > 
> > > Then, the draft infrastructure of future VFIO-based vgpu:
> > > 
> > > 
> > > 
> > >                  +-------------------------------------+
> > >                  |              vfio qemu              |
> > >                  +----+-------------------------+------+
> > >                       |DMA                   ^  |CFG
> > > QEMU                  |map                IRQ|  |
> > > ----------------------|----------------------|--|-----------
> > > KERNEL                |                      |  |
> > >          +------------|----------------------|--|----------+
> > >          |VFIO        |                      |  |          |
> > >          |            v                      |  v          |
> > >          | +--------------------+      +-----+-----------+ |
> > > DMA      | | vfio iommu driver  |      | vfio bus driver | |
> > > API <------+                    |      |                 | |
> > > Layer    | |  e.g. vfio_type2   |      |  e.g. vfio_vgpu | |
> > >          | +--------------------+      +-----------------+ |
> > >          |         |  ^                      |  ^          |
> > >          +---------|--|----------------------|--|----------+
> > >                    |  |                      |  |
> > >                    |  |                      v  |
> > >          +---------|--|----------+   +---------------------+
> > >          | +-------v-----------+ |   |                     |
> > >          | |                   | |   |                     |
> > >          | |      KVMGT        | |   |                     |
> > >          | |                   | |   |   host gfx driver   |
> > >          | +-------------------+ |   |                     |
> > >          |                       |   |                     |
> > >          |    KVM hypervisor     |   |                     |
> > >          +-----------------------+   +---------------------+
> > > 
> > >         NOTE    vfio_type2 and vfio_vgpu are only *logically* parts
> > >                 of VFIO, they may be implemented in KVM hypervisor
> > >                 or host gfx driver.
> > > 
> > > 
> > > 
> > > Here we need to implement a new vfio IOMMU driver instead of type1,
> > > let's call it vfio_type2 temporarily. The main difference from pcidev
> > > assignment is, vGPU doesn't have its own DMA requester id, so it has
> > > to share mappings with host and other vGPUs.
> > > 
> > >         - type1 iommu driver maps gpa to hpa for passing through;
> > >           whereas type2 maps iova to hpa;
> > > 
> > >         - hardware iommu is always needed by type1, whereas for
> > >           type2, hardware iommu is optional;
> > > 
> > >         - type1 will invoke low-level IOMMU API (iommu_map et al) to
> > >           setup IOMMU page table directly, whereas type2 dosen't (only
> > >           need to invoke higher level DMA API like dma_map_page);
> > 
> > Yes, the current type1 implementation is not compatible with vgpu since
> > there are not separate requester IDs on the bus and you probably don't
> > want or need to pin all of guest memory like we do for direct
> > assignment.  However, let's separate the type1 user API from the
> > current implementation.  It's quite easy within the vfio code to
> > consider "type1" to be an API specification that may have multiple
> > implementations.  A minor code change would allow us to continue
> > looking for compatible iommu backends if the group we're trying to
> > attach is rejected.
> 
> Would you elaborate a bit about 'iommu backends' here? Previously
> I thought that entire type1 will be duplicated. If not, what is supposed
> to add, a new vfio_dma_do_map?

I don't know that you necessarily want to re-use any of the
vfio_iommu_type1.c code as-is, it's just the API that we'll want to
keep consistent so QEMU doesn't need to learn about a new iommu
backend.  Opportunities for sharing certainly may arise, you may want
to use a similar red-black tree for storing current mappings, the
pinning code may be similar, etc.  We can evaluate on a case by case
basis whether it makes sense to pull out common code for each of those.

As for an iommu backend in general, if you look at the code flow
example in Documentation/vfio.txt, the user opens a container
(/dev/vfio/vfio) and a group (/dev/vfio/$GROUPNUM).  The group is set
to associate with a container instance via VFIO_GROUP_SET_CONTAINER and
then an iommu model is set for the container with VFIO_SET_IOMMU.
 Looking at drivers/vfio/vfio.c:vfio_ioctl_set_iommu(), we look for an
iommu backend that supports the requested extension (VFIO_TYPE1_IOMMU),
call the open() callback on it and then attempt to attach the group via
the attach_group() callback.  At this latter callback, the iommu
backend can compare the device to those that it actually supports.  For
instance the existing vfio_iommu_type1 will attempt to use the IOMMU
API and should fail if the device cannot be supported with that.  The
current loop in vfio_ioctl_set_iommu() will exit in this case, but as
you can see in the code, it's easy to make it continue and look for
another iommu backend that supports the requested extension.

> > The benefit here is that QEMU could work
> > unmodified, using the type1 vfio-iommu API regardless of whether a
> > device is directly assigned or virtual.
> > 
> > Let's look at the type1 interface; we have simple map and unmap
> > interfaces which map and unmap process virtual address space (vaddr) to
> > the device address space (iova).  The host physical address is obtained
> > by pinning the vaddr.  In the current implementation, a map operation
> > pins pages and populates the hardware iommu.  A vgpu compatible
> > implementation might simply register the translation into a kernel-
> > based database to be called upon later.  When the host graphics driver
> > needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> > translation, it already possesses the iova to vaddr mapping, which
> > becomes iova to hpa after a pinning operation.
> > 
> > So, I would encourage you to look at creating a vgpu vfio iommu
> > backened that makes use of the type1 api since it will reduce the
> > changes necessary for userspace.
> > 
> 
> Yes, keeping type1 API sounds a great idea.
> 
> > > We also need to implement a new 'bus' driver instead of vfio_pci,
> > > let's call it vfio_vgpu temporarily:
> > > 
> > >         - vfio_pci is a real pci driver, it has a probe method called
> > >           during dev attaching; whereas the vfio_vgpu is a pseudo
> > >           driver, it won't attach any devivce - the GPU is always owned by
> > >           host gfx driver. It has to do 'probing' elsewhere, but
> > >           still in host gfx driver attached to the device;
> > > 
> > >         - pcidev(PF or VF) attached to vfio_pci has a natural path
> > >           in sysfs; whereas vgpu is purely a software concept:
> > >           vfio_vgpu needs to create create/destory vgpu instances,
> > >           maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
> > >           etc. There should be something added in a higher layer
> > >           to do this (VFIO or DRM).
> > > 
> > >         - vfio_pci in most case will allow QEMU to access pcidev
> > >           hardware; whereas vfio_vgpu is to access virtual resource
> > >           emulated by another device model;
> > > 
> > >         - vfio_pci will inject an IRQ to guest only when physical IRQ
> > >           generated; whereas vfio_vgpu may inject an IRQ for emulation
> > >           purpose. Anyway they can share the same injection interface;
> > 
> > Here too, I think you're making assumptions based on an implementation
> > path.  Personally, I think each vgpu should be a struct device and that
> > an iommu group should be created for each.  I think this is a valid
> > abstraction; dma isolation is provided through something other than a
> > system-level iommu, but it's still provided.  Without this, the entire
> > vfio core would need to be aware of vgpu, since the core operates on
> > devices and groups.  I believe creating a struct device also gives you
> > basic probe and release support for a driver.
> > 
> 
> Indeed.
> BTW, that should be done in the 'bus' driver, right?

I think you have some flexibility between the graphics driver and the
vfio-vgpu driver in where this is done.  If we want vfio-vgpu to be
more generic, then vgpu device creation and management should probably
be done in the graphics driver and vfio-vgpu should be able to probe
that device and call back into the graphics driver to handle requests.
If it turns out there's not much for vfio-vgpu to share, ie. it's just
a passthrough for device specific emulation, then maybe we want a vfio-
intel-vgpu instead.

> > There will be a need for some sort of lifecycle management of a vgpu.
> >  How is it created?  Destroyed?  Can it be given more or less resources
> > than other vgpus, etc.  This could be implemented in sysfs for each
> > physical gpu with vgpu support, sort of like how we support sr-iov now,
> > the PF exports controls for creating VFs.  The more commonality we can
> > get for lifecycle and device access for userspace, the better.
> > 
> 
> Will have a look at the VF managements, thanks for the info.
> 
> > As for virtual vs physical resources and interrupts, part of the
> > purpose of vfio is to abstract a device into basic components.  It's up
> > to the bus driver how accesses to each space map to the physical
> > device.  Take for instance PCI config space, the existing vfio-pci
> > driver emulates some portions of config space for the user.
> > 
> > > Questions:
> > > 
> > >         [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
> > >             in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
> > >             In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
> > >             case, instead it needs a new implementation which fits both
> > >             w/ and w/o IOMMU. Is this correct?
> > > 
> > 
> > vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> > simply a case that the kernel development outpaced the intended user
> > and I didn't want to commit to the user api changes until it had been
> > completely vetted.  In any case, vgpu should have no dependency
> > whatsoever on no-iommu.  As above, I think vgpu should create virtual
> > devices and add them to an iommu group, similar to how no-iommu does,
> > but without the kernel tainting because you are actually providing
> > isolation through other means than a system iommu.
> > 
> 
> Thanks for confirmation.
> 
> > > For things not mentioned above, we might have them discussed in
> > > other threads, or temporarily maintained in a TODO list (we might get
> > > back to them after the big picture get agreed):
> > > 
> > > 
> > >         - How to expose guest framebuffer via VFIO for SPICE;
> > 
> > Potentially through a new, device specific region, which I think can be
> > done within the existing vfio API.  The API can already expose an
> > arbitrary number of regions to the user, it's just a matter of how we
> > tell the user the purpose of a region index beyond the fixed set we map
> > to PCI resources.
> > 
> > >         - How to avoid double translation with two-stage: GTT + IOMMU,
> > >           whether identity map is possible, and if yes, how to make it
> > >           more effectively;
> > > 
> > >         - Application acceleration
> > >           You mentioned that with VFIO, a vGPU may be used by
> > >           applications to get GPU acceleration. It's a potential
> > >           opportunity to use vGPU for container usage, worthy of
> > >           further investigation.
> > 
> > Yes, interesting topics.  Thanks,
> > 
> 
> Looks that things get more clear overall, with small exceptions.
> Thanks for the advice:)

Yes, please let me know how I can help.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-18 19:05       ` [Qemu-devel] " Alex Williamson
@ 2016-01-20  8:59         ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-20  8:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Zhiyuan Lv

On 01/19/2016 03:05 AM, Alex Williamson wrote:
> On Mon, 2016-01-18 at 16:56 +0800, Jike Song wrote:
>>
>> Would you elaborate a bit about 'iommu backends' here? Previously
>> I thought that entire type1 will be duplicated. If not, what is supposed
>> to add, a new vfio_dma_do_map?
> 
> I don't know that you necessarily want to re-use any of the
> vfio_iommu_type1.c code as-is, it's just the API that we'll want to
> keep consistent so QEMU doesn't need to learn about a new iommu
> backend.  Opportunities for sharing certainly may arise, you may want
> to use a similar red-black tree for storing current mappings, the
> pinning code may be similar, etc.  We can evaluate on a case by case
> basis whether it makes sense to pull out common code for each of those.

It would be great if you can help abstracting it :) 

> 
> As for an iommu backend in general, if you look at the code flow
> example in Documentation/vfio.txt, the user opens a container
> (/dev/vfio/vfio) and a group (/dev/vfio/$GROUPNUM).  The group is set
> to associate with a container instance via VFIO_GROUP_SET_CONTAINER and
> then an iommu model is set for the container with VFIO_SET_IOMMU.
>  Looking at drivers/vfio/vfio.c:vfio_ioctl_set_iommu(), we look for an
> iommu backend that supports the requested extension (VFIO_TYPE1_IOMMU),
> call the open() callback on it and then attempt to attach the group via
> the attach_group() callback.  At this latter callback, the iommu
> backend can compare the device to those that it actually supports.  For
> instance the existing vfio_iommu_type1 will attempt to use the IOMMU
> API and should fail if the device cannot be supported with that.  The
> current loop in vfio_ioctl_set_iommu() will exit in this case, but as
> you can see in the code, it's easy to make it continue and look for
> another iommu backend that supports the requested extension.
> 

Got it, sure type1 API w/ userspace should be kept, while a new backend
being used for vgpu.

>>> The benefit here is that QEMU could work
>>> unmodified, using the type1 vfio-iommu API regardless of whether a
>>> device is directly assigned or virtual.
>>>
>>> Let's look at the type1 interface; we have simple map and unmap
>>> interfaces which map and unmap process virtual address space (vaddr) to
>>> the device address space (iova).  The host physical address is obtained
>>> by pinning the vaddr.  In the current implementation, a map operation
>>> pins pages and populates the hardware iommu.  A vgpu compatible
>>> implementation might simply register the translation into a kernel-
>>> based database to be called upon later.  When the host graphics driver
>>> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
>>> translation, it already possesses the iova to vaddr mapping, which
>>> becomes iova to hpa after a pinning operation.
>>>
>>> So, I would encourage you to look at creating a vgpu vfio iommu
>>> backened that makes use of the type1 api since it will reduce the
>>> changes necessary for userspace.
>>>
>>
>> BTW, that should be done in the 'bus' driver, right?
> 
> I think you have some flexibility between the graphics driver and the
> vfio-vgpu driver in where this is done.  If we want vfio-vgpu to be
> more generic, then vgpu device creation and management should probably
> be done in the graphics driver and vfio-vgpu should be able to probe
> that device and call back into the graphics driver to handle requests.
> If it turns out there's not much for vfio-vgpu to share, ie. it's just
> a passthrough for device specific emulation, then maybe we want a vfio-
> intel-vgpu instead.
>

Good to know that.

>>
>> Looks that things get more clear overall, with small exceptions.
>> Thanks for the advice:)
> 
> Yes, please let me know how I can help.  Thanks,
> 
> Alex
> 

I will start the coding soon, will do :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-20  8:59         ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-20  8:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Zhiyuan Lv

On 01/19/2016 03:05 AM, Alex Williamson wrote:
> On Mon, 2016-01-18 at 16:56 +0800, Jike Song wrote:
>>
>> Would you elaborate a bit about 'iommu backends' here? Previously
>> I thought that entire type1 will be duplicated. If not, what is supposed
>> to add, a new vfio_dma_do_map?
> 
> I don't know that you necessarily want to re-use any of the
> vfio_iommu_type1.c code as-is, it's just the API that we'll want to
> keep consistent so QEMU doesn't need to learn about a new iommu
> backend.  Opportunities for sharing certainly may arise, you may want
> to use a similar red-black tree for storing current mappings, the
> pinning code may be similar, etc.  We can evaluate on a case by case
> basis whether it makes sense to pull out common code for each of those.

It would be great if you can help abstracting it :) 

> 
> As for an iommu backend in general, if you look at the code flow
> example in Documentation/vfio.txt, the user opens a container
> (/dev/vfio/vfio) and a group (/dev/vfio/$GROUPNUM).  The group is set
> to associate with a container instance via VFIO_GROUP_SET_CONTAINER and
> then an iommu model is set for the container with VFIO_SET_IOMMU.
>  Looking at drivers/vfio/vfio.c:vfio_ioctl_set_iommu(), we look for an
> iommu backend that supports the requested extension (VFIO_TYPE1_IOMMU),
> call the open() callback on it and then attempt to attach the group via
> the attach_group() callback.  At this latter callback, the iommu
> backend can compare the device to those that it actually supports.  For
> instance the existing vfio_iommu_type1 will attempt to use the IOMMU
> API and should fail if the device cannot be supported with that.  The
> current loop in vfio_ioctl_set_iommu() will exit in this case, but as
> you can see in the code, it's easy to make it continue and look for
> another iommu backend that supports the requested extension.
> 

Got it, sure type1 API w/ userspace should be kept, while a new backend
being used for vgpu.

>>> The benefit here is that QEMU could work
>>> unmodified, using the type1 vfio-iommu API regardless of whether a
>>> device is directly assigned or virtual.
>>>
>>> Let's look at the type1 interface; we have simple map and unmap
>>> interfaces which map and unmap process virtual address space (vaddr) to
>>> the device address space (iova).  The host physical address is obtained
>>> by pinning the vaddr.  In the current implementation, a map operation
>>> pins pages and populates the hardware iommu.  A vgpu compatible
>>> implementation might simply register the translation into a kernel-
>>> based database to be called upon later.  When the host graphics driver
>>> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
>>> translation, it already possesses the iova to vaddr mapping, which
>>> becomes iova to hpa after a pinning operation.
>>>
>>> So, I would encourage you to look at creating a vgpu vfio iommu
>>> backened that makes use of the type1 api since it will reduce the
>>> changes necessary for userspace.
>>>
>>
>> BTW, that should be done in the 'bus' driver, right?
> 
> I think you have some flexibility between the graphics driver and the
> vfio-vgpu driver in where this is done.  If we want vfio-vgpu to be
> more generic, then vgpu device creation and management should probably
> be done in the graphics driver and vfio-vgpu should be able to probe
> that device and call back into the graphics driver to handle requests.
> If it turns out there's not much for vfio-vgpu to share, ie. it's just
> a passthrough for device specific emulation, then maybe we want a vfio-
> intel-vgpu instead.
>

Good to know that.

>>
>> Looks that things get more clear overall, with small exceptions.
>> Thanks for the advice:)
> 
> Yes, please let me know how I can help.  Thanks,
> 
> Alex
> 

I will start the coding soon, will do :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-20  8:59         ` [Qemu-devel] " Jike Song
@ 2016-01-20  9:05           ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-20  9:05 UTC (permalink / raw)
  To: Song, Jike, Alex Williamson
  Cc: Ruan, Shuai, kvm, igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann,
	Paolo Bonzini, Lv, Zhiyuan

> From: Song, Jike
> Sent: Wednesday, January 20, 2016 5:00 PM
> >> BTW, that should be done in the 'bus' driver, right?
> >
> > I think you have some flexibility between the graphics driver and the
> > vfio-vgpu driver in where this is done.  If we want vfio-vgpu to be
> > more generic, then vgpu device creation and management should probably
> > be done in the graphics driver and vfio-vgpu should be able to probe
> > that device and call back into the graphics driver to handle requests.
> > If it turns out there's not much for vfio-vgpu to share, ie. it's just
> > a passthrough for device specific emulation, then maybe we want a vfio-
> > intel-vgpu instead.
> >
> 
> Good to know that.

Possibly let's 1st implement a vfio-intel-vgpu, since KVMGT is the only 
customer of this design change. We can see how to better abstract
letter when there comes other vendor's vgpu support.

> 
> >>
> >> Looks that things get more clear overall, with small exceptions.
> >> Thanks for the advice:)
> >
> > Yes, please let me know how I can help.  Thanks,
> >
> > Alex
> >
> 
> I will start the coding soon, will do :)
> 

I would expect we can spell out next level tasks toward above
direction, upon which Alex can easily judge whether there are
some common VFIO framework changes that he can help.:-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-20  9:05           ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-20  9:05 UTC (permalink / raw)
  To: Song, Jike, Alex Williamson
  Cc: Ruan, Shuai, kvm, igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann,
	Paolo Bonzini, Lv, Zhiyuan

> From: Song, Jike
> Sent: Wednesday, January 20, 2016 5:00 PM
> >> BTW, that should be done in the 'bus' driver, right?
> >
> > I think you have some flexibility between the graphics driver and the
> > vfio-vgpu driver in where this is done.  If we want vfio-vgpu to be
> > more generic, then vgpu device creation and management should probably
> > be done in the graphics driver and vfio-vgpu should be able to probe
> > that device and call back into the graphics driver to handle requests.
> > If it turns out there's not much for vfio-vgpu to share, ie. it's just
> > a passthrough for device specific emulation, then maybe we want a vfio-
> > intel-vgpu instead.
> >
> 
> Good to know that.

Possibly let's 1st implement a vfio-intel-vgpu, since KVMGT is the only 
customer of this design change. We can see how to better abstract
letter when there comes other vendor's vgpu support.

> 
> >>
> >> Looks that things get more clear overall, with small exceptions.
> >> Thanks for the advice:)
> >
> > Yes, please let me know how I can help.  Thanks,
> >
> > Alex
> >
> 
> I will start the coding soon, will do :)
> 

I would expect we can spell out next level tasks toward above
direction, upon which Alex can easily judge whether there are
some common VFIO framework changes that he can help.:-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-20  9:05           ` [Qemu-devel] " Tian, Kevin
@ 2016-01-25 11:34             ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-25 11:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org

On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> I would expect we can spell out next level tasks toward above
> direction, upon which Alex can easily judge whether there are
> some common VFIO framework changes that he can help :-)

Hi Alex,

Here is a draft task list after a short discussion w/ Kevin,
would you please have a look?

	Bus Driver

		{ in i915/vgt/xxx.c }

		- define a subset of vfio_pci interfaces
		- selective pass-through (say aperture)
		- trap MMIO: interface w/ QEMU

	IOMMU

		{ in a new vfio_xxx.c }

		- allocate: struct device & IOMMU group
		- map/unmap functions for vgpu
		- rb-tree to maintain iova/hpa mappings
		- interacts with kvmgt.c


	vgpu instance management

		{ in i915 }

		- path, create/destroy


--
Thanks,
Jike

 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-25 11:34             ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-25 11:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> I would expect we can spell out next level tasks toward above
> direction, upon which Alex can easily judge whether there are
> some common VFIO framework changes that he can help :-)

Hi Alex,

Here is a draft task list after a short discussion w/ Kevin,
would you please have a look?

	Bus Driver

		{ in i915/vgt/xxx.c }

		- define a subset of vfio_pci interfaces
		- selective pass-through (say aperture)
		- trap MMIO: interface w/ QEMU

	IOMMU

		{ in a new vfio_xxx.c }

		- allocate: struct device & IOMMU group
		- map/unmap functions for vgpu
		- rb-tree to maintain iova/hpa mappings
		- interacts with kvmgt.c


	vgpu instance management

		{ in i915 }

		- path, create/destroy


--
Thanks,
Jike

 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-25 11:34             ` [Qemu-devel] " Jike Song
@ 2016-01-25 21:30               ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-25 21:30 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

[cc +Neo @Nvidia]

Hi Jike,

On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > I would expect we can spell out next level tasks toward above
> > direction, upon which Alex can easily judge whether there are
> > some common VFIO framework changes that he can help :-)
> 
> Hi Alex,
> 
> Here is a draft task list after a short discussion w/ Kevin,
> would you please have a look?
> 
> 	Bus Driver
> 
> 		{ in i915/vgt/xxx.c }
> 
> 		- define a subset of vfio_pci interfaces
> 		- selective pass-through (say aperture)
> 		- trap MMIO: interface w/ QEMU

What's included in the subset?  Certainly the bus reset ioctls really
don't apply, but you'll need to support the full device interface,
right?  That includes the region info ioctl and access through the vfio
device file descriptor as well as the interrupt info and setup ioctls.

> 	IOMMU
> 
> 		{ in a new vfio_xxx.c }
> 
> 		- allocate: struct device & IOMMU group

It seems like the vgpu instance management would do this.

> 		- map/unmap functions for vgpu
> 		- rb-tree to maintain iova/hpa mappings

Yep, pretty much what type1 does now, but without mapping through the
IOMMU API.  Essentially just a database of the current userspace
mappings that can be accessed for page pinning and IOVA->HPA
translation.

> 		- interacts with kvmgt.c
> 
> 
> 	vgpu instance management
> 
> 		{ in i915 }
> 
> 		- path, create/destroy
> 

Yes, and since you're creating and destroying the vgpu here, this is
where I'd expect a struct device to be created and added to an IOMMU
group.  The lifecycle management should really include links between
the vGPU and physical GPU, which would be much, much easier to do with
struct devices create here rather than at the point where we start
doing vfio "stuff".

Nvidia has also been looking at this and has some ideas how we might
standardize on some of the interfaces and create a vgpu framework to
help share code between vendors and hopefully make a more consistent
userspace interface for libvirt as well.  I'll let Neo provide some
details.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-25 21:30               ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-25 21:30 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

[cc +Neo @Nvidia]

Hi Jike,

On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > I would expect we can spell out next level tasks toward above
> > direction, upon which Alex can easily judge whether there are
> > some common VFIO framework changes that he can help :-)
> 
> Hi Alex,
> 
> Here is a draft task list after a short discussion w/ Kevin,
> would you please have a look?
> 
> 	Bus Driver
> 
> 		{ in i915/vgt/xxx.c }
> 
> 		- define a subset of vfio_pci interfaces
> 		- selective pass-through (say aperture)
> 		- trap MMIO: interface w/ QEMU

What's included in the subset?  Certainly the bus reset ioctls really
don't apply, but you'll need to support the full device interface,
right?  That includes the region info ioctl and access through the vfio
device file descriptor as well as the interrupt info and setup ioctls.

> 	IOMMU
> 
> 		{ in a new vfio_xxx.c }
> 
> 		- allocate: struct device & IOMMU group

It seems like the vgpu instance management would do this.

> 		- map/unmap functions for vgpu
> 		- rb-tree to maintain iova/hpa mappings

Yep, pretty much what type1 does now, but without mapping through the
IOMMU API.  Essentially just a database of the current userspace
mappings that can be accessed for page pinning and IOVA->HPA
translation.

> 		- interacts with kvmgt.c
> 
> 
> 	vgpu instance management
> 
> 		{ in i915 }
> 
> 		- path, create/destroy
> 

Yes, and since you're creating and destroying the vgpu here, this is
where I'd expect a struct device to be created and added to an IOMMU
group.  The lifecycle management should really include links between
the vGPU and physical GPU, which would be much, much easier to do with
struct devices create here rather than at the point where we start
doing vfio "stuff".

Nvidia has also been looking at this and has some ideas how we might
standardize on some of the interfaces and create a vgpu framework to
help share code between vendors and hopefully make a more consistent
userspace interface for libvirt as well.  I'll let Neo provide some
details.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-25 21:30               ` [Qemu-devel] " Alex Williamson
@ 2016-01-25 21:45                 ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-25 21:45 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, January 26, 2016 5:30 AM
> 
> [cc +Neo @Nvidia]
> 
> Hi Jike,
> 
> On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > I would expect we can spell out next level tasks toward above
> > > direction, upon which Alex can easily judge whether there are
> > > some common VFIO framework changes that he can help :-)
> >
> > Hi Alex,
> >
> > Here is a draft task list after a short discussion w/ Kevin,
> > would you please have a look?
> >
> > 	Bus Driver
> >
> > 		{ in i915/vgt/xxx.c }
> >
> > 		- define a subset of vfio_pci interfaces
> > 		- selective pass-through (say aperture)
> > 		- trap MMIO: interface w/ QEMU
> 
> What's included in the subset?  Certainly the bus reset ioctls really
> don't apply, but you'll need to support the full device interface,
> right?  That includes the region info ioctl and access through the vfio
> device file descriptor as well as the interrupt info and setup ioctls.

That is the next level detail Jike will figure out and discuss soon.

yes, basic region info/access should be necessary. For interrupt, could
you elaborate a bit what current interface is doing? If just about creating
an eventfd for virtual interrupt injection, it applies to vgpu too.

> 
> > 	IOMMU
> >
> > 		{ in a new vfio_xxx.c }
> >
> > 		- allocate: struct device & IOMMU group
> 
> It seems like the vgpu instance management would do this.
> 
> > 		- map/unmap functions for vgpu
> > 		- rb-tree to maintain iova/hpa mappings
> 
> Yep, pretty much what type1 does now, but without mapping through the
> IOMMU API.  Essentially just a database of the current userspace
> mappings that can be accessed for page pinning and IOVA->HPA
> translation.

The thought is to reuse iommu_type1.c, by abstracting several underlying
operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
accordingly), etc.

This file will also connect between VFIO and vendor specific vgpu driver,
e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
creating necessary VFIO structures like aforementioned device/IOMMUas...

> 
> > 		- interacts with kvmgt.c
> >
> >
> > 	vgpu instance management
> >
> > 		{ in i915 }
> >
> > 		- path, create/destroy
> >
> 
> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
not good to touch vfio internal structures from another module (such as
i915.ko)

> 
> Nvidia has also been looking at this and has some ideas how we might
> standardize on some of the interfaces and create a vgpu framework to
> help share code between vendors and hopefully make a more consistent
> userspace interface for libvirt as well.  I'll let Neo provide some
> details.  Thanks,
> 

Nice to know that. Neo, please share your thought here.

Jike will provide next level API definitions based on KVMGT requirement. 
We can further refine it to match requirements of multi-vendors.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-25 21:45                 ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-25 21:45 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, January 26, 2016 5:30 AM
> 
> [cc +Neo @Nvidia]
> 
> Hi Jike,
> 
> On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > I would expect we can spell out next level tasks toward above
> > > direction, upon which Alex can easily judge whether there are
> > > some common VFIO framework changes that he can help :-)
> >
> > Hi Alex,
> >
> > Here is a draft task list after a short discussion w/ Kevin,
> > would you please have a look?
> >
> > 	Bus Driver
> >
> > 		{ in i915/vgt/xxx.c }
> >
> > 		- define a subset of vfio_pci interfaces
> > 		- selective pass-through (say aperture)
> > 		- trap MMIO: interface w/ QEMU
> 
> What's included in the subset?  Certainly the bus reset ioctls really
> don't apply, but you'll need to support the full device interface,
> right?  That includes the region info ioctl and access through the vfio
> device file descriptor as well as the interrupt info and setup ioctls.

That is the next level detail Jike will figure out and discuss soon.

yes, basic region info/access should be necessary. For interrupt, could
you elaborate a bit what current interface is doing? If just about creating
an eventfd for virtual interrupt injection, it applies to vgpu too.

> 
> > 	IOMMU
> >
> > 		{ in a new vfio_xxx.c }
> >
> > 		- allocate: struct device & IOMMU group
> 
> It seems like the vgpu instance management would do this.
> 
> > 		- map/unmap functions for vgpu
> > 		- rb-tree to maintain iova/hpa mappings
> 
> Yep, pretty much what type1 does now, but without mapping through the
> IOMMU API.  Essentially just a database of the current userspace
> mappings that can be accessed for page pinning and IOVA->HPA
> translation.

The thought is to reuse iommu_type1.c, by abstracting several underlying
operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
accordingly), etc.

This file will also connect between VFIO and vendor specific vgpu driver,
e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
creating necessary VFIO structures like aforementioned device/IOMMUas...

> 
> > 		- interacts with kvmgt.c
> >
> >
> > 	vgpu instance management
> >
> > 		{ in i915 }
> >
> > 		- path, create/destroy
> >
> 
> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
not good to touch vfio internal structures from another module (such as
i915.ko)

> 
> Nvidia has also been looking at this and has some ideas how we might
> standardize on some of the interfaces and create a vgpu framework to
> help share code between vendors and hopefully make a more consistent
> userspace interface for libvirt as well.  I'll let Neo provide some
> details.  Thanks,
> 

Nice to know that. Neo, please share your thought here.

Jike will provide next level API definitions based on KVMGT requirement. 
We can further refine it to match requirements of multi-vendors.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-25 21:45                 ` [Qemu-devel] " Tian, Kevin
@ 2016-01-25 21:48                   ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-25 21:48 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Song, Jike
  Cc: Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel, Paolo Bonzini

> From: Tian, Kevin
> Sent: Tuesday, January 26, 2016 5:45 AM
> >
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> >
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 

Sorry misunderstood your point. You're correct that struct device for each
vgpu should be managed here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-25 21:48                   ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-25 21:48 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Song, Jike
  Cc: igvt-g@lists.01.org, Neo Jia, qemu-devel, kvm, Paolo Bonzini

> From: Tian, Kevin
> Sent: Tuesday, January 26, 2016 5:45 AM
> >
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> >
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 

Sorry misunderstood your point. You're correct that struct device for each
vgpu should be managed here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-25 21:30               ` [Qemu-devel] " Alex Williamson
@ 2016-01-26  7:41                 ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-26  7:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org, Neo Jia

On 01/26/2016 05:30 AM, Alex Williamson wrote:
> [cc +Neo @Nvidia]
> 
> Hi Jike,
> 
> On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
>> On 01/20/2016 05:05 PM, Tian, Kevin wrote:
>>> I would expect we can spell out next level tasks toward above
>>> direction, upon which Alex can easily judge whether there are
>>> some common VFIO framework changes that he can help :-)
>>
>> Hi Alex,
>>
>> Here is a draft task list after a short discussion w/ Kevin,
>> would you please have a look?
>>
>> 	Bus Driver
>>
>> 		{ in i915/vgt/xxx.c }
>>
>> 		- define a subset of vfio_pci interfaces
>> 		- selective pass-through (say aperture)
>> 		- trap MMIO: interface w/ QEMU
> 
> What's included in the subset?  Certainly the bus reset ioctls really
> don't apply, but you'll need to support the full device interface,
> right?  That includes the region info ioctl and access through the vfio
> device file descriptor as well as the interrupt info and setup ioctls.
> 

[All interfaces I thought are via ioctl:)  For other stuff like file
descriptor we'll definitely keep it.]

The list of ioctl commands provided by vfio_pci:

	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
	- VFIO_DEVICE_PCI_HOT_RESET

As you said, above 2 don't apply. But for this:

	- VFIO_DEVICE_RESET

In my opinion it should be kept, no matter what will be provided in
the bus driver.

	- VFIO_PCI_ROM_REGION_INDEX
	- VFIO_PCI_VGA_REGION_INDEX

I suppose above 2 don't apply neither? For a vgpu we don't provide a
ROM BAR or VGA region.

	- VFIO_DEVICE_GET_INFO
	- VFIO_DEVICE_GET_REGION_INFO
	- VFIO_DEVICE_GET_IRQ_INFO
	- VFIO_DEVICE_SET_IRQS

Above 4 are needed of course.

We will need to extend:

	- VFIO_DEVICE_GET_REGION_INFO


a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
should be trapped instead of being mmap-ed.

b) adding other information. For example, for the OpRegion, QEMU need
to do more than mmap a region, it has to:

	- allocate a region
	- copy contents from somewhere in host to that region
	- mmap it to guest


I remember you already have a prototype for this?


>> 	IOMMU
>>
>> 		{ in a new vfio_xxx.c }
>>
>> 		- allocate: struct device & IOMMU group
> 
> It seems like the vgpu instance management would do this.
>

Yes, it can be removed from here.

>> 		- map/unmap functions for vgpu
>> 		- rb-tree to maintain iova/hpa mappings
> 
> Yep, pretty much what type1 does now, but without mapping through the
> IOMMU API.  Essentially just a database of the current userspace
> mappings that can be accessed for page pinning and IOVA->HPA
> translation.
> 

Yes.

>> 		- interacts with kvmgt.c
>>
>>
>> 	vgpu instance management
>>
>> 		{ in i915 }
>>
>> 		- path, create/destroy
>>
> 
> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".
> 

Yes, just like the SRIOV does.


> Nvidia has also been looking at this and has some ideas how we might
> standardize on some of the interfaces and create a vgpu framework to
> help share code between vendors and hopefully make a more consistent
> userspace interface for libvirt as well.  I'll let Neo provide some
> details.  Thanks,

Good to know that, so we can possibly cooperate on some common part,
e.g. the instance management :)

> 
> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26  7:41                 ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-26  7:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On 01/26/2016 05:30 AM, Alex Williamson wrote:
> [cc +Neo @Nvidia]
> 
> Hi Jike,
> 
> On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
>> On 01/20/2016 05:05 PM, Tian, Kevin wrote:
>>> I would expect we can spell out next level tasks toward above
>>> direction, upon which Alex can easily judge whether there are
>>> some common VFIO framework changes that he can help :-)
>>
>> Hi Alex,
>>
>> Here is a draft task list after a short discussion w/ Kevin,
>> would you please have a look?
>>
>> 	Bus Driver
>>
>> 		{ in i915/vgt/xxx.c }
>>
>> 		- define a subset of vfio_pci interfaces
>> 		- selective pass-through (say aperture)
>> 		- trap MMIO: interface w/ QEMU
> 
> What's included in the subset?  Certainly the bus reset ioctls really
> don't apply, but you'll need to support the full device interface,
> right?  That includes the region info ioctl and access through the vfio
> device file descriptor as well as the interrupt info and setup ioctls.
> 

[All interfaces I thought are via ioctl:)  For other stuff like file
descriptor we'll definitely keep it.]

The list of ioctl commands provided by vfio_pci:

	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
	- VFIO_DEVICE_PCI_HOT_RESET

As you said, above 2 don't apply. But for this:

	- VFIO_DEVICE_RESET

In my opinion it should be kept, no matter what will be provided in
the bus driver.

	- VFIO_PCI_ROM_REGION_INDEX
	- VFIO_PCI_VGA_REGION_INDEX

I suppose above 2 don't apply neither? For a vgpu we don't provide a
ROM BAR or VGA region.

	- VFIO_DEVICE_GET_INFO
	- VFIO_DEVICE_GET_REGION_INFO
	- VFIO_DEVICE_GET_IRQ_INFO
	- VFIO_DEVICE_SET_IRQS

Above 4 are needed of course.

We will need to extend:

	- VFIO_DEVICE_GET_REGION_INFO


a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
should be trapped instead of being mmap-ed.

b) adding other information. For example, for the OpRegion, QEMU need
to do more than mmap a region, it has to:

	- allocate a region
	- copy contents from somewhere in host to that region
	- mmap it to guest


I remember you already have a prototype for this?


>> 	IOMMU
>>
>> 		{ in a new vfio_xxx.c }
>>
>> 		- allocate: struct device & IOMMU group
> 
> It seems like the vgpu instance management would do this.
>

Yes, it can be removed from here.

>> 		- map/unmap functions for vgpu
>> 		- rb-tree to maintain iova/hpa mappings
> 
> Yep, pretty much what type1 does now, but without mapping through the
> IOMMU API.  Essentially just a database of the current userspace
> mappings that can be accessed for page pinning and IOVA->HPA
> translation.
> 

Yes.

>> 		- interacts with kvmgt.c
>>
>>
>> 	vgpu instance management
>>
>> 		{ in i915 }
>>
>> 		- path, create/destroy
>>
> 
> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".
> 

Yes, just like the SRIOV does.


> Nvidia has also been looking at this and has some ideas how we might
> standardize on some of the interfaces and create a vgpu framework to
> help share code between vendors and hopefully make a more consistent
> userspace interface for libvirt as well.  I'll let Neo provide some
> details.  Thanks,

Good to know that, so we can possibly cooperate on some common part,
e.g. the instance management :)

> 
> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-25 21:45                 ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26  9:48                   ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26  9:48 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

[-- Attachment #1: Type: text/plain, Size: 24251 bytes --]

On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > > 	Bus Driver
> > >
> > > 		{ in i915/vgt/xxx.c }
> > >
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > > 	IOMMU
> > >
> > > 		{ in a new vfio_xxx.c }
> > >
> > > 		- allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > > 		- map/unmap functions for vgpu
> > > 		- rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design
==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 4.4.0-rc5

Thanks,
Kirti and Neo


> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin

[-- Attachment #2: 0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch --]
[-- Type: text/plain, Size: 64107 bytes --]

>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
 		if (IS_ERR(data)) {
 			ret = PTR_ERR(data);
 			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
+			continue;
 		}
 
 		ret = __vfio_container_attach_groups(container, driver, data);
 		if (!ret) {
 			container->iommu_driver = driver;
 			container->iommu_data = data;
+			goto skip_drivers_unlock;
 		} else {
 			driver->ops->release(data);
 			module_put(driver->ops->owner);
 		}
-
-		goto skip_drivers_unlock;
 	}
 
 	mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)	+= vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+	uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+	uint32_t len,
+	uint32_t flags
+)
+{
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	unsigned long remote_vaddr = 0;
+	struct vgpu_vfio_dma *vgpu_dma = NULL;
+	struct vm_area_struct *remote_vma = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+	int ret = 0;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	down_write(&mm->mmap_sem);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+	if (!vgpu_dma) {
+		printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+	if (remote_vma == NULL) {
+		printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+	else {
+		printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+		       __FUNCTION__, remote_vma->vm_start);
+	}
+
+	remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+	remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+	ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+			len, remote_vma->vm_page_prot);
+
+	if (ret) {
+		printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret);
+		goto unlock;
+	}
+
+unlock:
+
+	up_write(&mm->mmap_sem);
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+#define VGPU_DEV_NAME		"vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	dev_t               vgpu_devt;
+	struct class        *class;
+	struct cdev         vgpu_cdev;
+	struct list_head    vgpu_devices_list;  // Head entry for the doubly linked vgpu_device list
+	struct mutex        vgpu_devices_lock;
+	struct idr          vgpu_idr;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+		      size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+		       size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret) {
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return ret;
+	}
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+			printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+			vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+	.owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev->dev) {
+		device_destroy(vgpu.class, vgpu_dev->dev->devt);
+		vgpu_dev->dev = NULL;
+	}
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_del(&vgpu_dev->list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->dev == dev) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id)
+{
+	int minor;
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+
+	struct iommu_group *group = NULL;
+	struct device *dev = NULL;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	// check if VM device is present
+	// if not present, create with devt=0 and parent=NULL
+	// create device for instance with devt= MKDEV(vgpu.major, minor)
+	// and parent=VM device
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_dev->vgpu_id = vgpu_id;
+
+	// TODO on removing control device change the 3rd parameter to 0
+	minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL);
+	if (minor < 0) {
+		retval = minor;
+		goto create_failed;
+	}
+
+	dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name);
+	if (IS_ERR(dev)) {
+		retval = PTR_ERR(dev);
+		goto create_failed1;
+	}
+
+	vgpu_dev->dev = dev;
+	vgpu_dev->minor = minor;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev == pdev) {
+			vgpu_dev->gpu_dev = gpu_dev;
+			if (gpu_dev->ops->vgpu_create) {
+				retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid,
+								   instance, vgpu_id);
+				if (retval)
+				{
+					mutex_unlock(&vgpu.gpu_devices_lock);
+					goto create_failed2;
+				}
+			}
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		goto create_failed2;
+	}
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		printk(KERN_ERR "VGPU: failed to allocate group!\n");
+		retval = PTR_ERR(group);
+		goto create_failed2;
+	}
+
+	retval = iommu_group_add_device(group, dev);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+		iommu_group_put(group);
+		goto create_failed2;
+	}
+
+	retval = vgpu_group_init(vgpu_dev, group);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+		iommu_group_put(group);
+		iommu_group_remove_device(dev);
+		goto create_failed2;
+	}
+
+	vgpu_dev->group = group;
+	printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return retval;
+
+create_failed2:
+	vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+	idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct device *dev = vgpu_dev->dev;
+
+	if (!dev) {
+		return;
+	}
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+							      vgpu_dev->vm_uuid,
+							      vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_group_free(vgpu_dev);
+	iommu_group_put(dev->iommu_group);
+	iommu_group_remove_device(dev);
+	vgpu_device_destroy(vgpu_dev);
+	idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	// search VGPU device
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			vgpu_dev = vdev;
+			break;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_start)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	idr_init(&vgpu.vgpu_idr);
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	// get major number from kernel
+	rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc);
+		return rc;
+	}
+
+	cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+	cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+	printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt));
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	vgpu.class = &vgpu_class;
+
+	return rc;
+
+failed1:
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	// TODO: Release all unclosed fd
+	struct vgpu_device *vdev = NULL, *tmp;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+		printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		destroy_vgpu_device(vdev);
+		mutex_lock(&vgpu.vgpu_devices_lock);
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	idr_destroy(&vgpu.vgpu_idr);
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+	class_destroy(vgpu.class);
+	vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *instance_str, *str;
+	uuid_le vm_uuid;
+	uint32_t instance, vgpu_id;
+	struct pci_dev *pdev;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			return -EINVAL;
+		}
+	}
+
+	return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *str;
+	uuid_le vm_uuid;
+	unsigned int instance;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance);
+
+	destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+	return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_vm_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+	uint64_t size = 0;
+
+	switch (bar_index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = 16 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = 256 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+		size = 32 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+		size = 128;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+	return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+		case VFIO_DEVICE_GET_INFO:
+		{
+			struct vfio_device_info info;
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor);
+			minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			info.flags = VFIO_DEVICE_FLAGS_PCI;
+			info.num_regions = VFIO_PCI_NUM_REGIONS;
+			info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_GET_REGION_INFO:
+		{
+			struct vfio_region_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__);
+
+			minsz = offsetofend(struct vfio_region_info, offset);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_CONFIG_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0x100;     // 4K
+					//                    info.size = sizeof(vdev->vgpu_dev->config_space);
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+							VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+				case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = resource_len(vdev->vgpu_dev, info.index);
+					if (!info.size) {
+						info.flags = 0;
+						break;
+					}
+
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+
+					if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) ||
+					     (info.index == VFIO_PCI_BAR2_REGION_INDEX)) {
+						info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+					}
+
+					/* TODO: provides configurable setups to
+					 * GPU vendor
+					 */
+
+					if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+						info.flags = VFIO_REGION_INFO_FLAG_MMAP;
+
+					break;
+				case VFIO_PCI_VGA_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0xc0000;
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+
+				case VFIO_PCI_ROM_REGION_INDEX:
+				default:
+					return -EINVAL;
+			}
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+
+		}
+		case VFIO_DEVICE_GET_IRQ_INFO:
+		{
+			struct vfio_irq_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+			minsz = offsetofend(struct vfio_irq_info, count);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+				case VFIO_PCI_REQ_IRQ_INDEX:
+					break;
+					/* pass thru to return error */
+				default:
+					return -EINVAL;
+			}
+
+			info.count = VFIO_PCI_NUM_IRQS;
+
+			info.flags = VFIO_IRQ_INFO_EVENTFD;
+			info.count = vgpu_get_irq_count(vdev, info.index);
+
+			if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+				info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+						VFIO_IRQ_INFO_AUTOMASKED);
+			else
+				info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_SET_IRQS:
+		{
+			struct vfio_irq_set hdr;
+			u8 *data = NULL;
+			int ret = 0;
+
+			minsz = offsetofend(struct vfio_irq_set, count);
+
+			if (copy_from_user(&hdr, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+					hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+						VFIO_IRQ_SET_ACTION_TYPE_MASK))
+				return -EINVAL;
+
+			if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+				size_t size;
+				int max = vgpu_get_irq_count(vdev, hdr.index);
+
+				if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+					size = sizeof(uint8_t);
+				else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+					size = sizeof(int32_t);
+				else
+					return -EINVAL;
+
+				if (hdr.argsz - minsz < hdr.count * size ||
+				    hdr.start >= max || hdr.start + hdr.count > max)
+					return -EINVAL;
+
+				data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+			ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+			kfree(data);
+
+
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	int cfg_size = sizeof(vgpu_dev->config_space);
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		/* FIXME: Need to save the BAR value properly */
+		switch (pos) {
+		case PCI_BASE_ADDRESS_0:
+			vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_1:
+			vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_2:
+			vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+			break;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_config,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_config,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+		}
+		kfree(ret_data);
+	}
+
+config_rw_exit:
+
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	uint64_t end;
+	int ret = 0;
+
+	if (!vgpu_dev->bar[bar_index].start) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	end = resource_len(vgpu_dev, bar_index);
+
+	if (offset >= end) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vgpu_dev->bar[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_mmio,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_mmio,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu-grp",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+	struct vfio_vgpu_device *vdev;
+	int ret = 0;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->group = group;
+	vdev->vgpu_dev = vgpu_dev;
+
+	ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	vdev = vfio_del_group_dev(vgpu_dev->dev);
+	if (!vdev)
+		return -1;
+
+	kfree(vdev);
+	return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t end;
+	int flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		*dev;
+	int minor;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			vm_uuid;
+	uint32_t		vgpu_instance;
+	uint32_t		vgpu_id;
+	atomic_t		usage_count;
+	char			config_space[0x100];          // 4KB PCI cfg space
+	struct pci_bar_info	bar[VFIO_PCI_NUM_REGIONS];
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@vm_uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_id: This represents the type of vgpu to be
+ *					  created
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@vm_uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@vm_uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@vm_uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+			       uint32_t instance, uint32_t vgpu_id);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+			        uint32_t instance);
+	int     (*vgpu_start)(uuid_le vm_uuid);
+	int     (*vgpu_shutdown)(uuid_le vm_uuid);
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space,loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.8.1.4


[-- Attachment #3: 0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch --]
[-- Type: text/plain, Size: 30722 bytes --]

>From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Sun, 24 Jan 2016 11:24:13 -0800
Subject: [PATCH] Add VGPU VFIO driver class support in QEMU

This is just a quick POV change to allow us experiment the VGPU VFIO support,
the next step is to merge this into the current vfio/pci.c which currently has a
physical backing devices.

Within current POC implementation, we have copy & paste lots function directly
from the vfio/pci.c code, we should merge them together later.

    - Basic MMIO and PCI config apccess are supported

    - MMAP'ed GPU bar is supported

    - INTx and MSI using eventfd is supported, don't think we should support
      interrupt when vector->kvm_interrupt is not enabled.

Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80

Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/Makefile.objs |   1 +
 hw/vfio/vgpu.c        | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h  |   3 +
 3 files changed, 995 insertions(+)
 create mode 100644 hw/vfio/vgpu.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d324863..17f2ef1 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,7 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o
+obj-$(CONFIG_PCI) += vgpu.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c
new file mode 100644
index 0000000..56ebce0
--- /dev/null
+++ b/hw/vfio/vgpu.c
@@ -0,0 +1,991 @@
+/*
+ * vGPU VFIO device
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio-common.h"
+#include "qmp-commands.h"
+
+#define TYPE_VFIO_VGPU "vfio-vgpu"
+
+typedef struct VFIOvGPUDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    unsigned int config_size;
+    char  *vgpu_type;
+    char *vm_uuid;
+    off_t config_offset; /* Offset of config space region within device fd */
+    int msi_cap_size;
+    EventNotifier req_notifier;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOMSIVector *msi_vectors;
+} VFIOvGPUDevice;
+
+/*
+ * Local functions
+ */
+
+// function prototypes
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev);
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+
+// INTx functions
+
+static void vfio_vgpu_intx_interrupt(void *opaque)
+{
+    VFIOvGPUDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    vdev->intx.pending = true;
+    pci_irq_assert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, false);
+
+}
+
+static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev)
+{
+    VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev);
+
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    trace_vfio_intx_eoi(vbasedev->name);
+
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+    vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+}
+
+static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+    struct vfio_irq_set *irq_set;
+    int ret, argsz;
+    int32_t *pfd;
+
+    if (!kvm_irqfds_enabled() ||
+        vdev->intx.route.mode != PCI_INTX_ENABLED ||
+        !kvm_resamplefds_enabled()) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = irqfd.resamplefd;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    vdev->intx.kvm_accel = true;
+
+    trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
+
+    return;
+
+fail_vfio:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+    event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+#endif
+}
+
+static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Tell KVM to stop listening for an INTx irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    vdev->intx.kvm_accel = false;
+
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    trace_vfio_intx_disable_kvm(vdev->vbasedev.name);
+#endif
+}
+
+static void vfio_vgpu_intx_update(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != VFIO_INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+
+    if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
+        return; /* Nothing changed */
+    }
+
+    trace_vfio_intx_update(vdev->vbasedev.name,
+                           vdev->intx.route.irq, route.irq);
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+
+    vdev->intx.route = route;
+
+    if (route.mode != PCI_INTX_ENABLED) {
+        return;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_vgpu_intx_eoi(&vdev->vbasedev);
+}
+
+static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev)
+{
+    uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    int ret, argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    pci_config_set_interrupt_pin(vdev->pdev.config, pin);
+
+#ifdef CONFIG_KVM
+    /*
+     * Only conditional to avoid generating error messages on platforms
+     * where we won't actually use the result anyway.
+     */
+    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
+        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                        vdev->intx.pin);
+    }
+#endif
+
+    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed");
+        return ret;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx fd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->intx.interrupt);
+        return -errno;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    vdev->interrupt = VFIO_INT_INTx;
+
+    trace_vfio_intx_enable(vdev->vbasedev.name);
+
+    return 0;
+}
+
+static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev)
+{
+    int fd;
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, true);
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = VFIO_INT_NONE;
+
+    trace_vfio_intx_disable(vdev->vbasedev.name);
+}
+
+//MSI functions
+static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector)
+{
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                          vector->virq);
+    kvm_irqchip_release_virq(kvm_state, vector->virq);
+    vector->virq = -1;
+    event_notifier_cleanup(&vector->kvm_interrupt);
+}
+
+static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vdev->msi_vectors[i].use) {
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+    vdev->interrupt = VFIO_INT_NONE;
+
+   vfio_vgpu_intx_enable(vdev);
+}
+
+static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev)
+{
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_vgpu_msi_disable_common(vdev);
+}
+
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev)
+{
+
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        vfio_vgpu_msi_disable(vdev);
+    }
+
+    if (vdev->interrupt == VFIO_INT_INTx) {
+        vfio_vgpu_intx_disable(vdev);
+    }
+}
+
+
+static void vfio_vgpu_msi_interrupt(void *opaque)
+{
+    VFIOMSIVector *vector = opaque;
+    VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev;
+    MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector);
+    void (*notify)(PCIDevice *dev, unsigned vector);
+    MSIMessage msg;
+    int nr = vector - vdev->msi_vectors;
+
+    if (!event_notifier_test_and_clear(&vector->interrupt)) {
+        return;
+    }
+
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        get_msg = msix_get_message;
+        notify = msix_notify;
+    } else if (vdev->interrupt == VFIO_INT_MSI) {
+        get_msg = msi_get_message;
+        notify = msi_notify;
+    } else {
+        abort();
+    }
+
+    msg = get_msg(&vdev->pdev, nr);
+    trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data);
+    notify(&vdev->pdev, nr);
+}
+
+static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd = -1;
+
+        /*
+         * MSI vs MSI-X - The guest has direct access to MSI mask and pending
+         * bits, therefore we always use the KVM signaling path when setup.
+         * MSI-X mask and pending bits are emulated, so we want to use the
+         * KVM signaling path only when configured and unmasked.
+         */
+        if (vdev->msi_vectors[i].use) {
+            if (vdev->msi_vectors[i].virq < 0 ||
+                (msix && msix_is_masked(&vdev->pdev, i))) {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            } else {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
+            }
+        }
+
+        fds[i] = fd;
+    }
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    return ret;
+}
+
+static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector *vector,
+                                  MSIMessage *msg, bool msix)
+{
+    int virq;
+
+    if (!msg) {
+        return;
+    }
+
+    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+        return;
+    }
+
+    virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+    if (virq < 0) {
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                       NULL, virq) < 0) {
+        kvm_irqchip_release_virq(kvm_state, virq);
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    vector->virq = virq;
+}
+
+static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
+                                     PCIDevice *pdev)
+{
+    kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+}
+
+
+static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev)
+{
+   int ret, i;
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg = msi_get_message(&vdev->pdev, i);
+
+        vector->vdev = (VFIOPCIDevice *)vdev;
+        vector->virq = -1;
+        vector->use = true;
+
+        if (event_notifier_init(&vector->interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed");
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                            vfio_vgpu_msi_interrupt, NULL, vector);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false);
+    }
+
+    /* Set interrupt type prior to possible interrupts */
+    vdev->interrupt = VFIO_INT_MSI;
+
+    ret = vfio_vgpu_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %m");
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+
+        /*
+         * Failing to setup MSI doesn't really fall within any specification.
+         * Let's try leaving interrupts disabled and hope the guest figures
+         * out to fall back to INTx for this device.
+         */
+        error_report("vfio: Error: Failed to enable MSI");
+        vdev->interrupt = VFIO_INT_NONE;
+
+        return;
+    }
+}
+
+static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg;
+
+        if (!vector->use || vector->virq < 0) {
+            continue;
+        }
+
+        msg = msi_get_message(&vdev->pdev, i);
+        vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev);
+    }
+}
+
+static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -errno;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        if (ret == -ENOTSUP) {
+            return 0;
+        }
+        error_report("vfio: msi_init failed");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+
+static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev)
+{
+    uint8_t pos;
+    int ret;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
+    if (!pos) {
+        return 0;
+    }
+
+    ret = vfio_vgpu_msi_setup(vdev, pos);
+    if (ret < 0) {
+        error_report("vgpu: Error setting MSI@0x%x: %d", pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * VGPU device class functions
+ */
+
+static void vfio_vgpu_reset(DeviceState *dev)
+{
+
+
+}
+
+static void vfio_vgpu_eoi(VFIODevice *vbasedev)
+{
+    return;
+}
+
+static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev)
+{
+    // Nothing to be reset 
+    return 0;
+}
+
+static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_vgpu_ops = {
+    .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi,
+    .vfio_eoi = vfio_vgpu_eoi,
+};
+
+static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = -1;
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            return ret;
+        }
+
+        trace_vfio_populate_device_region(vbasedev->name, i,
+                                          (unsigned long)reg_info.size,
+                                          (unsigned long)reg_info.offset,
+                                          (unsigned long)reg_info.flags);
+
+        vdev->bars[i].region.vbasedev = vbasedev;
+        vdev->bars[i].region.flags = reg_info.flags;
+        vdev->bars[i].region.size = reg_info.size;
+        vdev->bars[i].region.fd_offset = reg_info.offset;
+        vdev->bars[i].region.nr = i;
+        QLIST_INIT(&vdev->bars[i].quirks);
+    }
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %m");
+        return ret;
+    }
+
+    vdev->config_size = reg_info.size;
+    if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
+        vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
+    }
+    vdev->config_offset = reg_info.offset;
+
+    return 0;
+}
+
+static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size = bar->region.size;
+    char name[64];
+    uint32_t pci_bar;
+    uint8_t type;
+    int ret;
+
+    /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
+    if (!size) 
+        return;
+
+    /* Determine what type of BAR this is for registration */
+    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+    if (ret != sizeof(pci_bar)) {
+        error_report("vfio: Failed to read BAR %d (%m)", nr);
+        return;
+    }
+
+    pci_bar = le32_to_cpu(pci_bar);
+    bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
+    bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
+    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops,
+                          bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
+
+    // Create an invalid BAR1 mapping
+    if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+        vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                         &bar->region.mmap_mem, &bar->region.mmap,
+                         size, 0, name);
+    }
+}
+
+static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev)
+{
+
+    int i = 0;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_vgpu_create_virtual_bar(vdev, i);
+    }
+}
+
+static int vfio_vgpu_initfn(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    VFIOGroup *group;
+    ssize_t len;
+    int groupid;
+    struct stat st;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    int ret;
+    UuidInfo *uuid_info;
+
+    uuid_info = qmp_query_uuid(NULL);
+    if (strcmp(uuid_info->UUID, UUID_NONE) == 0) {
+        return -EINVAL;
+    } else {
+        vdev->vm_uuid = uuid_info->UUID;
+    }
+
+
+    snprintf(path, sizeof(path), 
+             "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio-vgpu: error: no such vgpu device: %s", path);
+        return -errno;
+    } 
+
+    vdev->vbasedev.ops = &vfio_vgpu_ops;
+
+    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
+    vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid);
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio-vgpu: error no iommu_group for device");
+        return len < 0 ? -errno : -ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio-vgpu: error reading %s: %m", path);
+        return -errno;
+    }
+
+    // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled
+
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid);
+
+    ret = vfio_get_device(group, path, &vdev->vbasedev);
+    if (ret) {
+        error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type);
+        vfio_put_group(group);
+        return ret;
+    }
+
+    ret = vfio_vgpu_populate_device(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Get a copy of config space */
+    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
+                vdev->config_offset);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        ret = ret < 0 ? -errno : -EFAULT;
+        error_report("vfio: Failed to read device config space");
+        return ret;
+    }
+
+    vfio_vgpu_create_virtual_bars(vdev);
+
+    ret = vfio_vgpu_msi_init(vdev);
+    if (ret < 0) {
+        error_report("%s: Error setting MSI %d", __FUNCTION__, ret);
+        return ret;
+    }
+
+    if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_vgpu_intx_update);
+        ret = vfio_vgpu_intx_enable(vdev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+
+static void vfio_vgpu_exitfn(PCIDevice *pdev)
+{
+
+
+}
+
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+    uint32_t val = 0;
+
+    ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x %m", __func__, addr);
+        return 0xFFFFFFFF;
+    }
+
+    // memcpy(&vdev->emulated_config_bits + addr, &val, len);
+    return val;
+}
+
+static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+
+    ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x, val:0x%0x %m",
+                     __func__, addr, val);
+        return;
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled) {
+            if (is_enabled) {
+                vfio_vgpu_msi_enable(vdev);
+            }
+        } else {
+            if (!is_enabled) {
+                vfio_vgpu_msi_disable(vdev);
+            } else {
+                vfio_vgpu_update_msi(vdev);
+            }
+        }
+    }
+    else {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+    }
+
+    pci_default_write_config(pdev, addr, val, len);
+
+    return;
+}
+
+static const VMStateDescription vfio_vgpu_vmstate = {
+    .name = TYPE_VFIO_VGPU,
+    .unmigratable = 1,
+};
+
+//
+// We don't actually need the vfio_vgpu_properties
+// as we can just simply rely on VM UUID to find
+// the IOMMU group for this VM
+//
+
+
+static Property vfio_vgpu_properties[] = {
+
+    DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type),
+    DEFINE_PROP_END_OF_LIST()
+};
+
+#if 0
+
+static void vfio_vgpu_instance_init(Object *obj)
+{
+
+}
+
+static void vfio_vgpu_instance_finalize(Object *obj)
+{
+
+
+}
+
+#endif
+
+static void vfio_vgpu_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+    // vgpudc->parent_realize = dc->realize;
+    // dc->realize = calxeda_xgmac_realize;
+    dc->desc = "VFIO-based vGPU";
+    dc->vmsd = &vfio_vgpu_vmstate;
+    dc->reset = vfio_vgpu_reset;
+    // dc->cannot_instantiate_with_device_add_yet = true; 
+    dc->props = vfio_vgpu_properties;
+    set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+    pdc->init = vfio_vgpu_initfn;
+    pdc->exit = vfio_vgpu_exitfn;
+    pdc->config_read = vfio_vgpu_read_config;
+    pdc->config_write = vfio_vgpu_write_config;
+    pdc->is_express = 0; /* For now, we are not */
+
+    pdc->vendor_id = PCI_DEVICE_ID_NVIDIA;
+    // pdc->device_id = 0x11B0;
+    pdc->class_id = PCI_CLASS_DISPLAY_VGA;
+}
+
+static const TypeInfo vfio_vgpu_dev_info = {
+    .name = TYPE_VFIO_VGPU,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIOvGPUDevice),
+    .class_init = vfio_vgpu_class_init,
+};
+
+static void register_vgpu_dev_type(void)
+{
+    type_register_static(&vfio_vgpu_dev_info);
+}
+
+type_init(register_vgpu_dev_type)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 379b6e1..9af5e17 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -64,6 +64,9 @@
 #define PCI_DEVICE_ID_VMWARE_IDE         0x1729
 #define PCI_DEVICE_ID_VMWARE_VMXNET3     0x07B0
 
+/* NVIDIA (0x10de) */
+#define PCI_DEVICE_ID_NVIDIA             0x10de
+
 /* Intel (0x8086) */
 #define PCI_DEVICE_ID_INTEL_82551IT      0x1209
 #define PCI_DEVICE_ID_INTEL_82557        0x1229
-- 
1.8.3.1


[-- Attachment #4: vgpu_diagram.png --]
[-- Type: image/png, Size: 6816 bytes --]

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26  9:48                   ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26  9:48 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

[-- Attachment #1: Type: text/plain, Size: 24251 bytes --]

On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > > 	Bus Driver
> > >
> > > 		{ in i915/vgt/xxx.c }
> > >
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > > 	IOMMU
> > >
> > > 		{ in a new vfio_xxx.c }
> > >
> > > 		- allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > > 		- map/unmap functions for vgpu
> > > 		- rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design
==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 4.4.0-rc5

Thanks,
Kirti and Neo


> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin

[-- Attachment #2: 0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch --]
[-- Type: text/plain, Size: 64107 bytes --]

>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
 		if (IS_ERR(data)) {
 			ret = PTR_ERR(data);
 			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
+			continue;
 		}
 
 		ret = __vfio_container_attach_groups(container, driver, data);
 		if (!ret) {
 			container->iommu_driver = driver;
 			container->iommu_data = data;
+			goto skip_drivers_unlock;
 		} else {
 			driver->ops->release(data);
 			module_put(driver->ops->owner);
 		}
-
-		goto skip_drivers_unlock;
 	}
 
 	mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)	+= vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+	uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+	uint32_t len,
+	uint32_t flags
+)
+{
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	unsigned long remote_vaddr = 0;
+	struct vgpu_vfio_dma *vgpu_dma = NULL;
+	struct vm_area_struct *remote_vma = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+	int ret = 0;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	down_write(&mm->mmap_sem);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+	if (!vgpu_dma) {
+		printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+	if (remote_vma == NULL) {
+		printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+	else {
+		printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+		       __FUNCTION__, remote_vma->vm_start);
+	}
+
+	remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+	remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+	ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+			len, remote_vma->vm_page_prot);
+
+	if (ret) {
+		printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret);
+		goto unlock;
+	}
+
+unlock:
+
+	up_write(&mm->mmap_sem);
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+#define VGPU_DEV_NAME		"vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	dev_t               vgpu_devt;
+	struct class        *class;
+	struct cdev         vgpu_cdev;
+	struct list_head    vgpu_devices_list;  // Head entry for the doubly linked vgpu_device list
+	struct mutex        vgpu_devices_lock;
+	struct idr          vgpu_idr;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+		      size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+		       size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret) {
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return ret;
+	}
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+			printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+			vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+	.owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev->dev) {
+		device_destroy(vgpu.class, vgpu_dev->dev->devt);
+		vgpu_dev->dev = NULL;
+	}
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_del(&vgpu_dev->list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->dev == dev) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id)
+{
+	int minor;
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+
+	struct iommu_group *group = NULL;
+	struct device *dev = NULL;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	// check if VM device is present
+	// if not present, create with devt=0 and parent=NULL
+	// create device for instance with devt= MKDEV(vgpu.major, minor)
+	// and parent=VM device
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_dev->vgpu_id = vgpu_id;
+
+	// TODO on removing control device change the 3rd parameter to 0
+	minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL);
+	if (minor < 0) {
+		retval = minor;
+		goto create_failed;
+	}
+
+	dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name);
+	if (IS_ERR(dev)) {
+		retval = PTR_ERR(dev);
+		goto create_failed1;
+	}
+
+	vgpu_dev->dev = dev;
+	vgpu_dev->minor = minor;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev == pdev) {
+			vgpu_dev->gpu_dev = gpu_dev;
+			if (gpu_dev->ops->vgpu_create) {
+				retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid,
+								   instance, vgpu_id);
+				if (retval)
+				{
+					mutex_unlock(&vgpu.gpu_devices_lock);
+					goto create_failed2;
+				}
+			}
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		goto create_failed2;
+	}
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		printk(KERN_ERR "VGPU: failed to allocate group!\n");
+		retval = PTR_ERR(group);
+		goto create_failed2;
+	}
+
+	retval = iommu_group_add_device(group, dev);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+		iommu_group_put(group);
+		goto create_failed2;
+	}
+
+	retval = vgpu_group_init(vgpu_dev, group);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+		iommu_group_put(group);
+		iommu_group_remove_device(dev);
+		goto create_failed2;
+	}
+
+	vgpu_dev->group = group;
+	printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return retval;
+
+create_failed2:
+	vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+	idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct device *dev = vgpu_dev->dev;
+
+	if (!dev) {
+		return;
+	}
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+							      vgpu_dev->vm_uuid,
+							      vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_group_free(vgpu_dev);
+	iommu_group_put(dev->iommu_group);
+	iommu_group_remove_device(dev);
+	vgpu_device_destroy(vgpu_dev);
+	idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	// search VGPU device
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			vgpu_dev = vdev;
+			break;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_start)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	idr_init(&vgpu.vgpu_idr);
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	// get major number from kernel
+	rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc);
+		return rc;
+	}
+
+	cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+	cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+	printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt));
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	vgpu.class = &vgpu_class;
+
+	return rc;
+
+failed1:
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	// TODO: Release all unclosed fd
+	struct vgpu_device *vdev = NULL, *tmp;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+		printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		destroy_vgpu_device(vdev);
+		mutex_lock(&vgpu.vgpu_devices_lock);
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	idr_destroy(&vgpu.vgpu_idr);
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+	class_destroy(vgpu.class);
+	vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *instance_str, *str;
+	uuid_le vm_uuid;
+	uint32_t instance, vgpu_id;
+	struct pci_dev *pdev;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			return -EINVAL;
+		}
+	}
+
+	return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *str;
+	uuid_le vm_uuid;
+	unsigned int instance;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance);
+
+	destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+	return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_vm_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+	uint64_t size = 0;
+
+	switch (bar_index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = 16 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = 256 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+		size = 32 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+		size = 128;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+	return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+		case VFIO_DEVICE_GET_INFO:
+		{
+			struct vfio_device_info info;
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor);
+			minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			info.flags = VFIO_DEVICE_FLAGS_PCI;
+			info.num_regions = VFIO_PCI_NUM_REGIONS;
+			info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_GET_REGION_INFO:
+		{
+			struct vfio_region_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__);
+
+			minsz = offsetofend(struct vfio_region_info, offset);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_CONFIG_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0x100;     // 4K
+					//                    info.size = sizeof(vdev->vgpu_dev->config_space);
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+							VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+				case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = resource_len(vdev->vgpu_dev, info.index);
+					if (!info.size) {
+						info.flags = 0;
+						break;
+					}
+
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+
+					if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) ||
+					     (info.index == VFIO_PCI_BAR2_REGION_INDEX)) {
+						info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+					}
+
+					/* TODO: provides configurable setups to
+					 * GPU vendor
+					 */
+
+					if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+						info.flags = VFIO_REGION_INFO_FLAG_MMAP;
+
+					break;
+				case VFIO_PCI_VGA_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0xc0000;
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+
+				case VFIO_PCI_ROM_REGION_INDEX:
+				default:
+					return -EINVAL;
+			}
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+
+		}
+		case VFIO_DEVICE_GET_IRQ_INFO:
+		{
+			struct vfio_irq_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+			minsz = offsetofend(struct vfio_irq_info, count);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+				case VFIO_PCI_REQ_IRQ_INDEX:
+					break;
+					/* pass thru to return error */
+				default:
+					return -EINVAL;
+			}
+
+			info.count = VFIO_PCI_NUM_IRQS;
+
+			info.flags = VFIO_IRQ_INFO_EVENTFD;
+			info.count = vgpu_get_irq_count(vdev, info.index);
+
+			if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+				info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+						VFIO_IRQ_INFO_AUTOMASKED);
+			else
+				info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_SET_IRQS:
+		{
+			struct vfio_irq_set hdr;
+			u8 *data = NULL;
+			int ret = 0;
+
+			minsz = offsetofend(struct vfio_irq_set, count);
+
+			if (copy_from_user(&hdr, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+					hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+						VFIO_IRQ_SET_ACTION_TYPE_MASK))
+				return -EINVAL;
+
+			if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+				size_t size;
+				int max = vgpu_get_irq_count(vdev, hdr.index);
+
+				if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+					size = sizeof(uint8_t);
+				else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+					size = sizeof(int32_t);
+				else
+					return -EINVAL;
+
+				if (hdr.argsz - minsz < hdr.count * size ||
+				    hdr.start >= max || hdr.start + hdr.count > max)
+					return -EINVAL;
+
+				data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+			ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+			kfree(data);
+
+
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	int cfg_size = sizeof(vgpu_dev->config_space);
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		/* FIXME: Need to save the BAR value properly */
+		switch (pos) {
+		case PCI_BASE_ADDRESS_0:
+			vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_1:
+			vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_2:
+			vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+			break;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_config,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_config,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+		}
+		kfree(ret_data);
+	}
+
+config_rw_exit:
+
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	uint64_t end;
+	int ret = 0;
+
+	if (!vgpu_dev->bar[bar_index].start) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	end = resource_len(vgpu_dev, bar_index);
+
+	if (offset >= end) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vgpu_dev->bar[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_mmio,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_mmio,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu-grp",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+	struct vfio_vgpu_device *vdev;
+	int ret = 0;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->group = group;
+	vdev->vgpu_dev = vgpu_dev;
+
+	ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	vdev = vfio_del_group_dev(vgpu_dev->dev);
+	if (!vdev)
+		return -1;
+
+	kfree(vdev);
+	return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t end;
+	int flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		*dev;
+	int minor;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			vm_uuid;
+	uint32_t		vgpu_instance;
+	uint32_t		vgpu_id;
+	atomic_t		usage_count;
+	char			config_space[0x100];          // 4KB PCI cfg space
+	struct pci_bar_info	bar[VFIO_PCI_NUM_REGIONS];
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@vm_uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_id: This represents the type of vgpu to be
+ *					  created
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@vm_uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@vm_uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@vm_uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+			       uint32_t instance, uint32_t vgpu_id);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+			        uint32_t instance);
+	int     (*vgpu_start)(uuid_le vm_uuid);
+	int     (*vgpu_shutdown)(uuid_le vm_uuid);
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space,loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.8.1.4


[-- Attachment #3: 0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch --]
[-- Type: text/plain, Size: 30722 bytes --]

>From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Sun, 24 Jan 2016 11:24:13 -0800
Subject: [PATCH] Add VGPU VFIO driver class support in QEMU

This is just a quick POV change to allow us experiment the VGPU VFIO support,
the next step is to merge this into the current vfio/pci.c which currently has a
physical backing devices.

Within current POC implementation, we have copy & paste lots function directly
from the vfio/pci.c code, we should merge them together later.

    - Basic MMIO and PCI config apccess are supported

    - MMAP'ed GPU bar is supported

    - INTx and MSI using eventfd is supported, don't think we should support
      interrupt when vector->kvm_interrupt is not enabled.

Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80

Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/Makefile.objs |   1 +
 hw/vfio/vgpu.c        | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h  |   3 +
 3 files changed, 995 insertions(+)
 create mode 100644 hw/vfio/vgpu.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d324863..17f2ef1 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,7 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o
+obj-$(CONFIG_PCI) += vgpu.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c
new file mode 100644
index 0000000..56ebce0
--- /dev/null
+++ b/hw/vfio/vgpu.c
@@ -0,0 +1,991 @@
+/*
+ * vGPU VFIO device
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio-common.h"
+#include "qmp-commands.h"
+
+#define TYPE_VFIO_VGPU "vfio-vgpu"
+
+typedef struct VFIOvGPUDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    unsigned int config_size;
+    char  *vgpu_type;
+    char *vm_uuid;
+    off_t config_offset; /* Offset of config space region within device fd */
+    int msi_cap_size;
+    EventNotifier req_notifier;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOMSIVector *msi_vectors;
+} VFIOvGPUDevice;
+
+/*
+ * Local functions
+ */
+
+// function prototypes
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev);
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+
+// INTx functions
+
+static void vfio_vgpu_intx_interrupt(void *opaque)
+{
+    VFIOvGPUDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    vdev->intx.pending = true;
+    pci_irq_assert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, false);
+
+}
+
+static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev)
+{
+    VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev);
+
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    trace_vfio_intx_eoi(vbasedev->name);
+
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+    vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+}
+
+static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+    struct vfio_irq_set *irq_set;
+    int ret, argsz;
+    int32_t *pfd;
+
+    if (!kvm_irqfds_enabled() ||
+        vdev->intx.route.mode != PCI_INTX_ENABLED ||
+        !kvm_resamplefds_enabled()) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = irqfd.resamplefd;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    vdev->intx.kvm_accel = true;
+
+    trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
+
+    return;
+
+fail_vfio:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+    event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+#endif
+}
+
+static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Tell KVM to stop listening for an INTx irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    vdev->intx.kvm_accel = false;
+
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    trace_vfio_intx_disable_kvm(vdev->vbasedev.name);
+#endif
+}
+
+static void vfio_vgpu_intx_update(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != VFIO_INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+
+    if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
+        return; /* Nothing changed */
+    }
+
+    trace_vfio_intx_update(vdev->vbasedev.name,
+                           vdev->intx.route.irq, route.irq);
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+
+    vdev->intx.route = route;
+
+    if (route.mode != PCI_INTX_ENABLED) {
+        return;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_vgpu_intx_eoi(&vdev->vbasedev);
+}
+
+static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev)
+{
+    uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    int ret, argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    pci_config_set_interrupt_pin(vdev->pdev.config, pin);
+
+#ifdef CONFIG_KVM
+    /*
+     * Only conditional to avoid generating error messages on platforms
+     * where we won't actually use the result anyway.
+     */
+    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
+        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                        vdev->intx.pin);
+    }
+#endif
+
+    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed");
+        return ret;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx fd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->intx.interrupt);
+        return -errno;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    vdev->interrupt = VFIO_INT_INTx;
+
+    trace_vfio_intx_enable(vdev->vbasedev.name);
+
+    return 0;
+}
+
+static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev)
+{
+    int fd;
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, true);
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = VFIO_INT_NONE;
+
+    trace_vfio_intx_disable(vdev->vbasedev.name);
+}
+
+//MSI functions
+static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector)
+{
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                          vector->virq);
+    kvm_irqchip_release_virq(kvm_state, vector->virq);
+    vector->virq = -1;
+    event_notifier_cleanup(&vector->kvm_interrupt);
+}
+
+static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vdev->msi_vectors[i].use) {
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+    vdev->interrupt = VFIO_INT_NONE;
+
+   vfio_vgpu_intx_enable(vdev);
+}
+
+static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev)
+{
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_vgpu_msi_disable_common(vdev);
+}
+
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev)
+{
+
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        vfio_vgpu_msi_disable(vdev);
+    }
+
+    if (vdev->interrupt == VFIO_INT_INTx) {
+        vfio_vgpu_intx_disable(vdev);
+    }
+}
+
+
+static void vfio_vgpu_msi_interrupt(void *opaque)
+{
+    VFIOMSIVector *vector = opaque;
+    VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev;
+    MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector);
+    void (*notify)(PCIDevice *dev, unsigned vector);
+    MSIMessage msg;
+    int nr = vector - vdev->msi_vectors;
+
+    if (!event_notifier_test_and_clear(&vector->interrupt)) {
+        return;
+    }
+
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        get_msg = msix_get_message;
+        notify = msix_notify;
+    } else if (vdev->interrupt == VFIO_INT_MSI) {
+        get_msg = msi_get_message;
+        notify = msi_notify;
+    } else {
+        abort();
+    }
+
+    msg = get_msg(&vdev->pdev, nr);
+    trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data);
+    notify(&vdev->pdev, nr);
+}
+
+static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd = -1;
+
+        /*
+         * MSI vs MSI-X - The guest has direct access to MSI mask and pending
+         * bits, therefore we always use the KVM signaling path when setup.
+         * MSI-X mask and pending bits are emulated, so we want to use the
+         * KVM signaling path only when configured and unmasked.
+         */
+        if (vdev->msi_vectors[i].use) {
+            if (vdev->msi_vectors[i].virq < 0 ||
+                (msix && msix_is_masked(&vdev->pdev, i))) {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            } else {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
+            }
+        }
+
+        fds[i] = fd;
+    }
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    return ret;
+}
+
+static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector *vector,
+                                  MSIMessage *msg, bool msix)
+{
+    int virq;
+
+    if (!msg) {
+        return;
+    }
+
+    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+        return;
+    }
+
+    virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+    if (virq < 0) {
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                       NULL, virq) < 0) {
+        kvm_irqchip_release_virq(kvm_state, virq);
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    vector->virq = virq;
+}
+
+static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
+                                     PCIDevice *pdev)
+{
+    kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+}
+
+
+static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev)
+{
+   int ret, i;
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg = msi_get_message(&vdev->pdev, i);
+
+        vector->vdev = (VFIOPCIDevice *)vdev;
+        vector->virq = -1;
+        vector->use = true;
+
+        if (event_notifier_init(&vector->interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed");
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                            vfio_vgpu_msi_interrupt, NULL, vector);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false);
+    }
+
+    /* Set interrupt type prior to possible interrupts */
+    vdev->interrupt = VFIO_INT_MSI;
+
+    ret = vfio_vgpu_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %m");
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+
+        /*
+         * Failing to setup MSI doesn't really fall within any specification.
+         * Let's try leaving interrupts disabled and hope the guest figures
+         * out to fall back to INTx for this device.
+         */
+        error_report("vfio: Error: Failed to enable MSI");
+        vdev->interrupt = VFIO_INT_NONE;
+
+        return;
+    }
+}
+
+static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg;
+
+        if (!vector->use || vector->virq < 0) {
+            continue;
+        }
+
+        msg = msi_get_message(&vdev->pdev, i);
+        vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev);
+    }
+}
+
+static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -errno;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        if (ret == -ENOTSUP) {
+            return 0;
+        }
+        error_report("vfio: msi_init failed");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+
+static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev)
+{
+    uint8_t pos;
+    int ret;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
+    if (!pos) {
+        return 0;
+    }
+
+    ret = vfio_vgpu_msi_setup(vdev, pos);
+    if (ret < 0) {
+        error_report("vgpu: Error setting MSI@0x%x: %d", pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * VGPU device class functions
+ */
+
+static void vfio_vgpu_reset(DeviceState *dev)
+{
+
+
+}
+
+static void vfio_vgpu_eoi(VFIODevice *vbasedev)
+{
+    return;
+}
+
+static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev)
+{
+    // Nothing to be reset 
+    return 0;
+}
+
+static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_vgpu_ops = {
+    .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi,
+    .vfio_eoi = vfio_vgpu_eoi,
+};
+
+static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = -1;
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            return ret;
+        }
+
+        trace_vfio_populate_device_region(vbasedev->name, i,
+                                          (unsigned long)reg_info.size,
+                                          (unsigned long)reg_info.offset,
+                                          (unsigned long)reg_info.flags);
+
+        vdev->bars[i].region.vbasedev = vbasedev;
+        vdev->bars[i].region.flags = reg_info.flags;
+        vdev->bars[i].region.size = reg_info.size;
+        vdev->bars[i].region.fd_offset = reg_info.offset;
+        vdev->bars[i].region.nr = i;
+        QLIST_INIT(&vdev->bars[i].quirks);
+    }
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %m");
+        return ret;
+    }
+
+    vdev->config_size = reg_info.size;
+    if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
+        vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
+    }
+    vdev->config_offset = reg_info.offset;
+
+    return 0;
+}
+
+static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size = bar->region.size;
+    char name[64];
+    uint32_t pci_bar;
+    uint8_t type;
+    int ret;
+
+    /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
+    if (!size) 
+        return;
+
+    /* Determine what type of BAR this is for registration */
+    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+    if (ret != sizeof(pci_bar)) {
+        error_report("vfio: Failed to read BAR %d (%m)", nr);
+        return;
+    }
+
+    pci_bar = le32_to_cpu(pci_bar);
+    bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
+    bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
+    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops,
+                          bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
+
+    // Create an invalid BAR1 mapping
+    if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+        vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                         &bar->region.mmap_mem, &bar->region.mmap,
+                         size, 0, name);
+    }
+}
+
+static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev)
+{
+
+    int i = 0;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_vgpu_create_virtual_bar(vdev, i);
+    }
+}
+
+static int vfio_vgpu_initfn(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    VFIOGroup *group;
+    ssize_t len;
+    int groupid;
+    struct stat st;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    int ret;
+    UuidInfo *uuid_info;
+
+    uuid_info = qmp_query_uuid(NULL);
+    if (strcmp(uuid_info->UUID, UUID_NONE) == 0) {
+        return -EINVAL;
+    } else {
+        vdev->vm_uuid = uuid_info->UUID;
+    }
+
+
+    snprintf(path, sizeof(path), 
+             "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio-vgpu: error: no such vgpu device: %s", path);
+        return -errno;
+    } 
+
+    vdev->vbasedev.ops = &vfio_vgpu_ops;
+
+    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
+    vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid);
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio-vgpu: error no iommu_group for device");
+        return len < 0 ? -errno : -ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio-vgpu: error reading %s: %m", path);
+        return -errno;
+    }
+
+    // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled
+
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid);
+
+    ret = vfio_get_device(group, path, &vdev->vbasedev);
+    if (ret) {
+        error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type);
+        vfio_put_group(group);
+        return ret;
+    }
+
+    ret = vfio_vgpu_populate_device(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Get a copy of config space */
+    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
+                vdev->config_offset);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        ret = ret < 0 ? -errno : -EFAULT;
+        error_report("vfio: Failed to read device config space");
+        return ret;
+    }
+
+    vfio_vgpu_create_virtual_bars(vdev);
+
+    ret = vfio_vgpu_msi_init(vdev);
+    if (ret < 0) {
+        error_report("%s: Error setting MSI %d", __FUNCTION__, ret);
+        return ret;
+    }
+
+    if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_vgpu_intx_update);
+        ret = vfio_vgpu_intx_enable(vdev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+
+static void vfio_vgpu_exitfn(PCIDevice *pdev)
+{
+
+
+}
+
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+    uint32_t val = 0;
+
+    ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x %m", __func__, addr);
+        return 0xFFFFFFFF;
+    }
+
+    // memcpy(&vdev->emulated_config_bits + addr, &val, len);
+    return val;
+}
+
+static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+
+    ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x, val:0x%0x %m",
+                     __func__, addr, val);
+        return;
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled) {
+            if (is_enabled) {
+                vfio_vgpu_msi_enable(vdev);
+            }
+        } else {
+            if (!is_enabled) {
+                vfio_vgpu_msi_disable(vdev);
+            } else {
+                vfio_vgpu_update_msi(vdev);
+            }
+        }
+    }
+    else {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+    }
+
+    pci_default_write_config(pdev, addr, val, len);
+
+    return;
+}
+
+static const VMStateDescription vfio_vgpu_vmstate = {
+    .name = TYPE_VFIO_VGPU,
+    .unmigratable = 1,
+};
+
+//
+// We don't actually need the vfio_vgpu_properties
+// as we can just simply rely on VM UUID to find
+// the IOMMU group for this VM
+//
+
+
+static Property vfio_vgpu_properties[] = {
+
+    DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type),
+    DEFINE_PROP_END_OF_LIST()
+};
+
+#if 0
+
+static void vfio_vgpu_instance_init(Object *obj)
+{
+
+}
+
+static void vfio_vgpu_instance_finalize(Object *obj)
+{
+
+
+}
+
+#endif
+
+static void vfio_vgpu_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+    // vgpudc->parent_realize = dc->realize;
+    // dc->realize = calxeda_xgmac_realize;
+    dc->desc = "VFIO-based vGPU";
+    dc->vmsd = &vfio_vgpu_vmstate;
+    dc->reset = vfio_vgpu_reset;
+    // dc->cannot_instantiate_with_device_add_yet = true; 
+    dc->props = vfio_vgpu_properties;
+    set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+    pdc->init = vfio_vgpu_initfn;
+    pdc->exit = vfio_vgpu_exitfn;
+    pdc->config_read = vfio_vgpu_read_config;
+    pdc->config_write = vfio_vgpu_write_config;
+    pdc->is_express = 0; /* For now, we are not */
+
+    pdc->vendor_id = PCI_DEVICE_ID_NVIDIA;
+    // pdc->device_id = 0x11B0;
+    pdc->class_id = PCI_CLASS_DISPLAY_VGA;
+}
+
+static const TypeInfo vfio_vgpu_dev_info = {
+    .name = TYPE_VFIO_VGPU,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIOvGPUDevice),
+    .class_init = vfio_vgpu_class_init,
+};
+
+static void register_vgpu_dev_type(void)
+{
+    type_register_static(&vfio_vgpu_dev_info);
+}
+
+type_init(register_vgpu_dev_type)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 379b6e1..9af5e17 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -64,6 +64,9 @@
 #define PCI_DEVICE_ID_VMWARE_IDE         0x1729
 #define PCI_DEVICE_ID_VMWARE_VMXNET3     0x07B0
 
+/* NVIDIA (0x10de) */
+#define PCI_DEVICE_ID_NVIDIA             0x10de
+
 /* Intel (0x8086) */
 #define PCI_DEVICE_ID_INTEL_82551IT      0x1209
 #define PCI_DEVICE_ID_INTEL_82557        0x1229
-- 
1.8.3.1


[-- Attachment #4: vgpu_diagram.png --]
[-- Type: image/png, Size: 6816 bytes --]

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-25 21:45                 ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 10:20                   ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 10:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > > 	Bus Driver
> > >
> > > 		{ in i915/vgt/xxx.c }
> > >
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > > 	IOMMU
> > >
> > > 		{ in a new vfio_xxx.c }
> > >
> > > 		- allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > > 		- map/unmap functions for vgpu
> > > 		- rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

(Seems I shouldn't use attachment, resend it again to the list, patches are
inline at the end)

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design
==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 4.4.0-rc5

Thanks,
Kirti and Neo

>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
 		if (IS_ERR(data)) {
 			ret = PTR_ERR(data);
 			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
+			continue;
 		}
 
 		ret = __vfio_container_attach_groups(container, driver, data);
 		if (!ret) {
 			container->iommu_driver = driver;
 			container->iommu_data = data;
+			goto skip_drivers_unlock;
 		} else {
 			driver->ops->release(data);
 			module_put(driver->ops->owner);
 		}
-
-		goto skip_drivers_unlock;
 	}
 
 	mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)	+= vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+	uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+	uint32_t len,
+	uint32_t flags
+)
+{
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	unsigned long remote_vaddr = 0;
+	struct vgpu_vfio_dma *vgpu_dma = NULL;
+	struct vm_area_struct *remote_vma = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+	int ret = 0;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	down_write(&mm->mmap_sem);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+	if (!vgpu_dma) {
+		printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+	if (remote_vma == NULL) {
+		printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+	else {
+		printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+		       __FUNCTION__, remote_vma->vm_start);
+	}
+
+	remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+	remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+	ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+			len, remote_vma->vm_page_prot);
+
+	if (ret) {
+		printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret);
+		goto unlock;
+	}
+
+unlock:
+
+	up_write(&mm->mmap_sem);
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+#define VGPU_DEV_NAME		"vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	dev_t               vgpu_devt;
+	struct class        *class;
+	struct cdev         vgpu_cdev;
+	struct list_head    vgpu_devices_list;  // Head entry for the doubly linked vgpu_device list
+	struct mutex        vgpu_devices_lock;
+	struct idr          vgpu_idr;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+		      size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+		       size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret) {
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return ret;
+	}
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+			printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+			vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+	.owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev->dev) {
+		device_destroy(vgpu.class, vgpu_dev->dev->devt);
+		vgpu_dev->dev = NULL;
+	}
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_del(&vgpu_dev->list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->dev == dev) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id)
+{
+	int minor;
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+
+	struct iommu_group *group = NULL;
+	struct device *dev = NULL;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	// check if VM device is present
+	// if not present, create with devt=0 and parent=NULL
+	// create device for instance with devt= MKDEV(vgpu.major, minor)
+	// and parent=VM device
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_dev->vgpu_id = vgpu_id;
+
+	// TODO on removing control device change the 3rd parameter to 0
+	minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL);
+	if (minor < 0) {
+		retval = minor;
+		goto create_failed;
+	}
+
+	dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name);
+	if (IS_ERR(dev)) {
+		retval = PTR_ERR(dev);
+		goto create_failed1;
+	}
+
+	vgpu_dev->dev = dev;
+	vgpu_dev->minor = minor;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev == pdev) {
+			vgpu_dev->gpu_dev = gpu_dev;
+			if (gpu_dev->ops->vgpu_create) {
+				retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid,
+								   instance, vgpu_id);
+				if (retval)
+				{
+					mutex_unlock(&vgpu.gpu_devices_lock);
+					goto create_failed2;
+				}
+			}
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		goto create_failed2;
+	}
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		printk(KERN_ERR "VGPU: failed to allocate group!\n");
+		retval = PTR_ERR(group);
+		goto create_failed2;
+	}
+
+	retval = iommu_group_add_device(group, dev);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+		iommu_group_put(group);
+		goto create_failed2;
+	}
+
+	retval = vgpu_group_init(vgpu_dev, group);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+		iommu_group_put(group);
+		iommu_group_remove_device(dev);
+		goto create_failed2;
+	}
+
+	vgpu_dev->group = group;
+	printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return retval;
+
+create_failed2:
+	vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+	idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct device *dev = vgpu_dev->dev;
+
+	if (!dev) {
+		return;
+	}
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+							      vgpu_dev->vm_uuid,
+							      vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_group_free(vgpu_dev);
+	iommu_group_put(dev->iommu_group);
+	iommu_group_remove_device(dev);
+	vgpu_device_destroy(vgpu_dev);
+	idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	// search VGPU device
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			vgpu_dev = vdev;
+			break;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_start)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	idr_init(&vgpu.vgpu_idr);
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	// get major number from kernel
+	rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc);
+		return rc;
+	}
+
+	cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+	cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+	printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt));
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	vgpu.class = &vgpu_class;
+
+	return rc;
+
+failed1:
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	// TODO: Release all unclosed fd
+	struct vgpu_device *vdev = NULL, *tmp;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+		printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		destroy_vgpu_device(vdev);
+		mutex_lock(&vgpu.vgpu_devices_lock);
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	idr_destroy(&vgpu.vgpu_idr);
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+	class_destroy(vgpu.class);
+	vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *instance_str, *str;
+	uuid_le vm_uuid;
+	uint32_t instance, vgpu_id;
+	struct pci_dev *pdev;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			return -EINVAL;
+		}
+	}
+
+	return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *str;
+	uuid_le vm_uuid;
+	unsigned int instance;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance);
+
+	destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+	return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_vm_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+	uint64_t size = 0;
+
+	switch (bar_index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = 16 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = 256 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+		size = 32 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+		size = 128;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+	return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+		case VFIO_DEVICE_GET_INFO:
+		{
+			struct vfio_device_info info;
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor);
+			minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			info.flags = VFIO_DEVICE_FLAGS_PCI;
+			info.num_regions = VFIO_PCI_NUM_REGIONS;
+			info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_GET_REGION_INFO:
+		{
+			struct vfio_region_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__);
+
+			minsz = offsetofend(struct vfio_region_info, offset);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_CONFIG_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0x100;     // 4K
+					//                    info.size = sizeof(vdev->vgpu_dev->config_space);
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+							VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+				case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = resource_len(vdev->vgpu_dev, info.index);
+					if (!info.size) {
+						info.flags = 0;
+						break;
+					}
+
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+
+					if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) ||
+					     (info.index == VFIO_PCI_BAR2_REGION_INDEX)) {
+						info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+					}
+
+					/* TODO: provides configurable setups to
+					 * GPU vendor
+					 */
+
+					if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+						info.flags = VFIO_REGION_INFO_FLAG_MMAP;
+
+					break;
+				case VFIO_PCI_VGA_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0xc0000;
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+
+				case VFIO_PCI_ROM_REGION_INDEX:
+				default:
+					return -EINVAL;
+			}
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+
+		}
+		case VFIO_DEVICE_GET_IRQ_INFO:
+		{
+			struct vfio_irq_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+			minsz = offsetofend(struct vfio_irq_info, count);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+				case VFIO_PCI_REQ_IRQ_INDEX:
+					break;
+					/* pass thru to return error */
+				default:
+					return -EINVAL;
+			}
+
+			info.count = VFIO_PCI_NUM_IRQS;
+
+			info.flags = VFIO_IRQ_INFO_EVENTFD;
+			info.count = vgpu_get_irq_count(vdev, info.index);
+
+			if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+				info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+						VFIO_IRQ_INFO_AUTOMASKED);
+			else
+				info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_SET_IRQS:
+		{
+			struct vfio_irq_set hdr;
+			u8 *data = NULL;
+			int ret = 0;
+
+			minsz = offsetofend(struct vfio_irq_set, count);
+
+			if (copy_from_user(&hdr, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+					hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+						VFIO_IRQ_SET_ACTION_TYPE_MASK))
+				return -EINVAL;
+
+			if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+				size_t size;
+				int max = vgpu_get_irq_count(vdev, hdr.index);
+
+				if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+					size = sizeof(uint8_t);
+				else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+					size = sizeof(int32_t);
+				else
+					return -EINVAL;
+
+				if (hdr.argsz - minsz < hdr.count * size ||
+				    hdr.start >= max || hdr.start + hdr.count > max)
+					return -EINVAL;
+
+				data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+			ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+			kfree(data);
+
+
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	int cfg_size = sizeof(vgpu_dev->config_space);
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		/* FIXME: Need to save the BAR value properly */
+		switch (pos) {
+		case PCI_BASE_ADDRESS_0:
+			vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_1:
+			vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_2:
+			vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+			break;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_config,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_config,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+		}
+		kfree(ret_data);
+	}
+
+config_rw_exit:
+
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	uint64_t end;
+	int ret = 0;
+
+	if (!vgpu_dev->bar[bar_index].start) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	end = resource_len(vgpu_dev, bar_index);
+
+	if (offset >= end) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vgpu_dev->bar[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_mmio,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_mmio,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu-grp",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+	struct vfio_vgpu_device *vdev;
+	int ret = 0;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->group = group;
+	vdev->vgpu_dev = vgpu_dev;
+
+	ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	vdev = vfio_del_group_dev(vgpu_dev->dev);
+	if (!vdev)
+		return -1;
+
+	kfree(vdev);
+	return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t end;
+	int flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		*dev;
+	int minor;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			vm_uuid;
+	uint32_t		vgpu_instance;
+	uint32_t		vgpu_id;
+	atomic_t		usage_count;
+	char			config_space[0x100];          // 4KB PCI cfg space
+	struct pci_bar_info	bar[VFIO_PCI_NUM_REGIONS];
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@vm_uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_id: This represents the type of vgpu to be
+ *					  created
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@vm_uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@vm_uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@vm_uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+			       uint32_t instance, uint32_t vgpu_id);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+			        uint32_t instance);
+	int     (*vgpu_start)(uuid_le vm_uuid);
+	int     (*vgpu_shutdown)(uuid_le vm_uuid);
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space,loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.8.1.4

>From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Sun, 24 Jan 2016 11:24:13 -0800
Subject: [PATCH] Add VGPU VFIO driver class support in QEMU

This is just a quick POV change to allow us experiment the VGPU VFIO support,
the next step is to merge this into the current vfio/pci.c which currently has a
physical backing devices.

Within current POC implementation, we have copy & paste lots function directly
from the vfio/pci.c code, we should merge them together later.

    - Basic MMIO and PCI config apccess are supported

    - MMAP'ed GPU bar is supported

    - INTx and MSI using eventfd is supported, don't think we should support
      interrupt when vector->kvm_interrupt is not enabled.

Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80

Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/Makefile.objs |   1 +
 hw/vfio/vgpu.c        | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h  |   3 +
 3 files changed, 995 insertions(+)
 create mode 100644 hw/vfio/vgpu.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d324863..17f2ef1 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,7 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o
+obj-$(CONFIG_PCI) += vgpu.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c
new file mode 100644
index 0000000..56ebce0
--- /dev/null
+++ b/hw/vfio/vgpu.c
@@ -0,0 +1,991 @@
+/*
+ * vGPU VFIO device
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio-common.h"
+#include "qmp-commands.h"
+
+#define TYPE_VFIO_VGPU "vfio-vgpu"
+
+typedef struct VFIOvGPUDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    unsigned int config_size;
+    char  *vgpu_type;
+    char *vm_uuid;
+    off_t config_offset; /* Offset of config space region within device fd */
+    int msi_cap_size;
+    EventNotifier req_notifier;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOMSIVector *msi_vectors;
+} VFIOvGPUDevice;
+
+/*
+ * Local functions
+ */
+
+// function prototypes
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev);
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+
+// INTx functions
+
+static void vfio_vgpu_intx_interrupt(void *opaque)
+{
+    VFIOvGPUDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    vdev->intx.pending = true;
+    pci_irq_assert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, false);
+
+}
+
+static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev)
+{
+    VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev);
+
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    trace_vfio_intx_eoi(vbasedev->name);
+
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+    vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+}
+
+static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+    struct vfio_irq_set *irq_set;
+    int ret, argsz;
+    int32_t *pfd;
+
+    if (!kvm_irqfds_enabled() ||
+        vdev->intx.route.mode != PCI_INTX_ENABLED ||
+        !kvm_resamplefds_enabled()) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = irqfd.resamplefd;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    vdev->intx.kvm_accel = true;
+
+    trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
+
+    return;
+
+fail_vfio:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+    event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+#endif
+}
+
+static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Tell KVM to stop listening for an INTx irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    vdev->intx.kvm_accel = false;
+
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    trace_vfio_intx_disable_kvm(vdev->vbasedev.name);
+#endif
+}
+
+static void vfio_vgpu_intx_update(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != VFIO_INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+
+    if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
+        return; /* Nothing changed */
+    }
+
+    trace_vfio_intx_update(vdev->vbasedev.name,
+                           vdev->intx.route.irq, route.irq);
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+
+    vdev->intx.route = route;
+
+    if (route.mode != PCI_INTX_ENABLED) {
+        return;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_vgpu_intx_eoi(&vdev->vbasedev);
+}
+
+static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev)
+{
+    uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    int ret, argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    pci_config_set_interrupt_pin(vdev->pdev.config, pin);
+
+#ifdef CONFIG_KVM
+    /*
+     * Only conditional to avoid generating error messages on platforms
+     * where we won't actually use the result anyway.
+     */
+    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
+        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                        vdev->intx.pin);
+    }
+#endif
+
+    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed");
+        return ret;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx fd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->intx.interrupt);
+        return -errno;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    vdev->interrupt = VFIO_INT_INTx;
+
+    trace_vfio_intx_enable(vdev->vbasedev.name);
+
+    return 0;
+}
+
+static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev)
+{
+    int fd;
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, true);
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = VFIO_INT_NONE;
+
+    trace_vfio_intx_disable(vdev->vbasedev.name);
+}
+
+//MSI functions
+static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector)
+{
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                          vector->virq);
+    kvm_irqchip_release_virq(kvm_state, vector->virq);
+    vector->virq = -1;
+    event_notifier_cleanup(&vector->kvm_interrupt);
+}
+
+static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vdev->msi_vectors[i].use) {
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+    vdev->interrupt = VFIO_INT_NONE;
+
+   vfio_vgpu_intx_enable(vdev);
+}
+
+static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev)
+{
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_vgpu_msi_disable_common(vdev);
+}
+
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev)
+{
+
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        vfio_vgpu_msi_disable(vdev);
+    }
+
+    if (vdev->interrupt == VFIO_INT_INTx) {
+        vfio_vgpu_intx_disable(vdev);
+    }
+}
+
+
+static void vfio_vgpu_msi_interrupt(void *opaque)
+{
+    VFIOMSIVector *vector = opaque;
+    VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev;
+    MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector);
+    void (*notify)(PCIDevice *dev, unsigned vector);
+    MSIMessage msg;
+    int nr = vector - vdev->msi_vectors;
+
+    if (!event_notifier_test_and_clear(&vector->interrupt)) {
+        return;
+    }
+
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        get_msg = msix_get_message;
+        notify = msix_notify;
+    } else if (vdev->interrupt == VFIO_INT_MSI) {
+        get_msg = msi_get_message;
+        notify = msi_notify;
+    } else {
+        abort();
+    }
+
+    msg = get_msg(&vdev->pdev, nr);
+    trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data);
+    notify(&vdev->pdev, nr);
+}
+
+static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd = -1;
+
+        /*
+         * MSI vs MSI-X - The guest has direct access to MSI mask and pending
+         * bits, therefore we always use the KVM signaling path when setup.
+         * MSI-X mask and pending bits are emulated, so we want to use the
+         * KVM signaling path only when configured and unmasked.
+         */
+        if (vdev->msi_vectors[i].use) {
+            if (vdev->msi_vectors[i].virq < 0 ||
+                (msix && msix_is_masked(&vdev->pdev, i))) {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            } else {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
+            }
+        }
+
+        fds[i] = fd;
+    }
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    return ret;
+}
+
+static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector *vector,
+                                  MSIMessage *msg, bool msix)
+{
+    int virq;
+
+    if (!msg) {
+        return;
+    }
+
+    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+        return;
+    }
+
+    virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+    if (virq < 0) {
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                       NULL, virq) < 0) {
+        kvm_irqchip_release_virq(kvm_state, virq);
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    vector->virq = virq;
+}
+
+static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
+                                     PCIDevice *pdev)
+{
+    kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+}
+
+
+static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev)
+{
+   int ret, i;
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg = msi_get_message(&vdev->pdev, i);
+
+        vector->vdev = (VFIOPCIDevice *)vdev;
+        vector->virq = -1;
+        vector->use = true;
+
+        if (event_notifier_init(&vector->interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed");
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                            vfio_vgpu_msi_interrupt, NULL, vector);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false);
+    }
+
+    /* Set interrupt type prior to possible interrupts */
+    vdev->interrupt = VFIO_INT_MSI;
+
+    ret = vfio_vgpu_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %m");
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+
+        /*
+         * Failing to setup MSI doesn't really fall within any specification.
+         * Let's try leaving interrupts disabled and hope the guest figures
+         * out to fall back to INTx for this device.
+         */
+        error_report("vfio: Error: Failed to enable MSI");
+        vdev->interrupt = VFIO_INT_NONE;
+
+        return;
+    }
+}
+
+static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg;
+
+        if (!vector->use || vector->virq < 0) {
+            continue;
+        }
+
+        msg = msi_get_message(&vdev->pdev, i);
+        vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev);
+    }
+}
+
+static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -errno;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        if (ret == -ENOTSUP) {
+            return 0;
+        }
+        error_report("vfio: msi_init failed");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+
+static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev)
+{
+    uint8_t pos;
+    int ret;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
+    if (!pos) {
+        return 0;
+    }
+
+    ret = vfio_vgpu_msi_setup(vdev, pos);
+    if (ret < 0) {
+        error_report("vgpu: Error setting MSI@0x%x: %d", pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * VGPU device class functions
+ */
+
+static void vfio_vgpu_reset(DeviceState *dev)
+{
+
+
+}
+
+static void vfio_vgpu_eoi(VFIODevice *vbasedev)
+{
+    return;
+}
+
+static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev)
+{
+    // Nothing to be reset 
+    return 0;
+}
+
+static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_vgpu_ops = {
+    .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi,
+    .vfio_eoi = vfio_vgpu_eoi,
+};
+
+static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = -1;
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            return ret;
+        }
+
+        trace_vfio_populate_device_region(vbasedev->name, i,
+                                          (unsigned long)reg_info.size,
+                                          (unsigned long)reg_info.offset,
+                                          (unsigned long)reg_info.flags);
+
+        vdev->bars[i].region.vbasedev = vbasedev;
+        vdev->bars[i].region.flags = reg_info.flags;
+        vdev->bars[i].region.size = reg_info.size;
+        vdev->bars[i].region.fd_offset = reg_info.offset;
+        vdev->bars[i].region.nr = i;
+        QLIST_INIT(&vdev->bars[i].quirks);
+    }
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %m");
+        return ret;
+    }
+
+    vdev->config_size = reg_info.size;
+    if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
+        vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
+    }
+    vdev->config_offset = reg_info.offset;
+
+    return 0;
+}
+
+static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size = bar->region.size;
+    char name[64];
+    uint32_t pci_bar;
+    uint8_t type;
+    int ret;
+
+    /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
+    if (!size) 
+        return;
+
+    /* Determine what type of BAR this is for registration */
+    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+    if (ret != sizeof(pci_bar)) {
+        error_report("vfio: Failed to read BAR %d (%m)", nr);
+        return;
+    }
+
+    pci_bar = le32_to_cpu(pci_bar);
+    bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
+    bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
+    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops,
+                          bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
+
+    // Create an invalid BAR1 mapping
+    if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+        vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                         &bar->region.mmap_mem, &bar->region.mmap,
+                         size, 0, name);
+    }
+}
+
+static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev)
+{
+
+    int i = 0;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_vgpu_create_virtual_bar(vdev, i);
+    }
+}
+
+static int vfio_vgpu_initfn(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    VFIOGroup *group;
+    ssize_t len;
+    int groupid;
+    struct stat st;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    int ret;
+    UuidInfo *uuid_info;
+
+    uuid_info = qmp_query_uuid(NULL);
+    if (strcmp(uuid_info->UUID, UUID_NONE) == 0) {
+        return -EINVAL;
+    } else {
+        vdev->vm_uuid = uuid_info->UUID;
+    }
+
+
+    snprintf(path, sizeof(path), 
+             "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio-vgpu: error: no such vgpu device: %s", path);
+        return -errno;
+    } 
+
+    vdev->vbasedev.ops = &vfio_vgpu_ops;
+
+    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
+    vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid);
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio-vgpu: error no iommu_group for device");
+        return len < 0 ? -errno : -ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio-vgpu: error reading %s: %m", path);
+        return -errno;
+    }
+
+    // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled
+
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid);
+
+    ret = vfio_get_device(group, path, &vdev->vbasedev);
+    if (ret) {
+        error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type);
+        vfio_put_group(group);
+        return ret;
+    }
+
+    ret = vfio_vgpu_populate_device(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Get a copy of config space */
+    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
+                vdev->config_offset);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        ret = ret < 0 ? -errno : -EFAULT;
+        error_report("vfio: Failed to read device config space");
+        return ret;
+    }
+
+    vfio_vgpu_create_virtual_bars(vdev);
+
+    ret = vfio_vgpu_msi_init(vdev);
+    if (ret < 0) {
+        error_report("%s: Error setting MSI %d", __FUNCTION__, ret);
+        return ret;
+    }
+
+    if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_vgpu_intx_update);
+        ret = vfio_vgpu_intx_enable(vdev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+
+static void vfio_vgpu_exitfn(PCIDevice *pdev)
+{
+
+
+}
+
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+    uint32_t val = 0;
+
+    ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x %m", __func__, addr);
+        return 0xFFFFFFFF;
+    }
+
+    // memcpy(&vdev->emulated_config_bits + addr, &val, len);
+    return val;
+}
+
+static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+
+    ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x, val:0x%0x %m",
+                     __func__, addr, val);
+        return;
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled) {
+            if (is_enabled) {
+                vfio_vgpu_msi_enable(vdev);
+            }
+        } else {
+            if (!is_enabled) {
+                vfio_vgpu_msi_disable(vdev);
+            } else {
+                vfio_vgpu_update_msi(vdev);
+            }
+        }
+    }
+    else {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+    }
+
+    pci_default_write_config(pdev, addr, val, len);
+
+    return;
+}
+
+static const VMStateDescription vfio_vgpu_vmstate = {
+    .name = TYPE_VFIO_VGPU,
+    .unmigratable = 1,
+};
+
+//
+// We don't actually need the vfio_vgpu_properties
+// as we can just simply rely on VM UUID to find
+// the IOMMU group for this VM
+//
+
+
+static Property vfio_vgpu_properties[] = {
+
+    DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type),
+    DEFINE_PROP_END_OF_LIST()
+};
+
+#if 0
+
+static void vfio_vgpu_instance_init(Object *obj)
+{
+
+}
+
+static void vfio_vgpu_instance_finalize(Object *obj)
+{
+
+
+}
+
+#endif
+
+static void vfio_vgpu_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+    // vgpudc->parent_realize = dc->realize;
+    // dc->realize = calxeda_xgmac_realize;
+    dc->desc = "VFIO-based vGPU";
+    dc->vmsd = &vfio_vgpu_vmstate;
+    dc->reset = vfio_vgpu_reset;
+    // dc->cannot_instantiate_with_device_add_yet = true; 
+    dc->props = vfio_vgpu_properties;
+    set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+    pdc->init = vfio_vgpu_initfn;
+    pdc->exit = vfio_vgpu_exitfn;
+    pdc->config_read = vfio_vgpu_read_config;
+    pdc->config_write = vfio_vgpu_write_config;
+    pdc->is_express = 0; /* For now, we are not */
+
+    pdc->vendor_id = PCI_DEVICE_ID_NVIDIA;
+    // pdc->device_id = 0x11B0;
+    pdc->class_id = PCI_CLASS_DISPLAY_VGA;
+}
+
+static const TypeInfo vfio_vgpu_dev_info = {
+    .name = TYPE_VFIO_VGPU,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIOvGPUDevice),
+    .class_init = vfio_vgpu_class_init,
+};
+
+static void register_vgpu_dev_type(void)
+{
+    type_register_static(&vfio_vgpu_dev_info);
+}
+
+type_init(register_vgpu_dev_type)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 379b6e1..9af5e17 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -64,6 +64,9 @@
 #define PCI_DEVICE_ID_VMWARE_IDE         0x1729
 #define PCI_DEVICE_ID_VMWARE_VMXNET3     0x07B0
 
+/* NVIDIA (0x10de) */
+#define PCI_DEVICE_ID_NVIDIA             0x10de
+
 /* Intel (0x8086) */
 #define PCI_DEVICE_ID_INTEL_82551IT      0x1209
 #define PCI_DEVICE_ID_INTEL_82557        0x1229
-- 
1.8.3.1



> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 10:20                   ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 10:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, January 26, 2016 5:30 AM
> > 
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > >
> > > Hi Alex,
> > >
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > >
> > > 	Bus Driver
> > >
> > > 		{ in i915/vgt/xxx.c }
> > >
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> 
> That is the next level detail Jike will figure out and discuss soon.
> 
> yes, basic region info/access should be necessary. For interrupt, could
> you elaborate a bit what current interface is doing? If just about creating
> an eventfd for virtual interrupt injection, it applies to vgpu too.
> 
> > 
> > > 	IOMMU
> > >
> > > 		{ in a new vfio_xxx.c }
> > >
> > > 		- allocate: struct device & IOMMU group
> > 
> > It seems like the vgpu instance management would do this.
> > 
> > > 		- map/unmap functions for vgpu
> > > 		- rb-tree to maintain iova/hpa mappings
> > 
> > Yep, pretty much what type1 does now, but without mapping through the
> > IOMMU API.  Essentially just a database of the current userspace
> > mappings that can be accessed for page pinning and IOVA->HPA
> > translation.
> 
> The thought is to reuse iommu_type1.c, by abstracting several underlying
> operations and then put vgpu specific implementation in a vfio_vgpu.c (e.g.
> for map/unmap instead of using IOMMU API, an iova/hpa mapping is updated
> accordingly), etc.
> 
> This file will also connect between VFIO and vendor specific vgpu driver,
> e.g. exposing interfaces to allow the latter querying iova<->hpa and also 
> creating necessary VFIO structures like aforementioned device/IOMMUas...
> 
> > 
> > > 		- interacts with kvmgt.c
> > >
> > >
> > > 	vgpu instance management
> > >
> > > 		{ in i915 }
> > >
> > > 		- path, create/destroy
> > >
> > 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> It's invoked here, but expecting the function exposed by vfio_vgpu.c. It's
> not good to touch vfio internal structures from another module (such as
> i915.ko)
> 
> > 
> > Nvidia has also been looking at this and has some ideas how we might
> > standardize on some of the interfaces and create a vgpu framework to
> > help share code between vendors and hopefully make a more consistent
> > userspace interface for libvirt as well.  I'll let Neo provide some
> > details.  Thanks,
> > 
> 
> Nice to know that. Neo, please share your thought here.

Hi Alex, Kevin and Jike,

(Seems I shouldn't use attachment, resend it again to the list, patches are
inline at the end)

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.

> Yes, and since you're creating and destroying the vgpu here, this is
> where I'd expect a struct device to be created and added to an IOMMU
> group.  The lifecycle management should really include links between
> the vGPU and physical GPU, which would be much, much easier to do with
> struct devices create here rather than at the point where we start
> doing vfio "stuff".

Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Graphics driver can register with vfio-vgpu to get management and emulation call
backs to graphics driver.   

We already have struct vgpu_device in our proposal that keeps pointer to
physical device.  

> - vfio_pci will inject an IRQ to guest only when physical IRQ
> generated; whereas vfio_vgpu may inject an IRQ for emulation
> purpose. Anyway they can share the same injection interface;

eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
available to graphics driver so that graphics driver can inject interrupts
directly when physical device triggers interrupt. 

Here is the proposal we have, please review.

Please note the patches we have put out here is mainly for POC purpose to
verify our understanding also can serve the purpose to reduce confusions and speed up 
our design, although we are very happy to refine that to something eventually
can be used for both parties and upstreamed.

Linux vGPU kernel design
==================================================================================

Here we are proposing a generic Linux kernel module based on VFIO framework
which allows different GPU vendors to plugin and provide their GPU virtualization
solution on KVM, the benefits of having such generic kernel module are:

1) Reuse QEMU VFIO driver, supporting VFIO UAPI

2) GPU HW agnostic management API for upper layer software such as libvirt

3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor

0. High level overview
==================================================================================

 
  user space:
                                +-----------+  VFIO IOMMU IOCTLs
                      +---------| QEMU VFIO |-------------------------+
        VFIO IOCTLs   |         +-----------+                         |
                      |                                               | 
 ---------------------|-----------------------------------------------|---------
                      |                                               |
  kernel space:       |  +--->----------->---+  (callback)            V
                      |  |                   v                 +------V-----+
  +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
  |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
  | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
  |          |   |          |     | (register)           ^         ||
  +----------+   +-------+--+     |    +-----------+     |         ||
                         V        +----| i915.ko   +-----+     +---VV-------+ 
                         |             +-----^-----+           | TYPE1      |
                         |  (callback)       |                 | IOMMU      |
                         +-->------------>---+                 +------------+
 access flow:

  Guest MMIO / PCI config access
  |
  -------------------------------------------------
  |
  +-----> KVM VM_EXITs  (kernel)
          |
  -------------------------------------------------
          |
          +-----> QEMU VFIO driver (user)
                  | 
  -------------------------------------------------
                  |
                  +---->  VGPU kernel driver (kernel)
                          |  
                          | 
                          +----> vendor driver callback


1. VGPU management interface
==================================================================================

This is the interface allows upper layer software (mostly libvirt) to query and
configure virtual GPU device in a HW agnostic fashion. Also, this management
interface has provided flexibility to underlying GPU vendor to support virtual
device hotplug, multiple virtual devices per VM, multiple virtual devices from
different physical devices, etc.

1.1 Under per-physical device sysfs:
----------------------------------------------------------------------------------

vgpu_supported_types - RO, list the current supported virtual GPU types and its
VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
"vgpu_supported_types".
                            
vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
gpu device on a target physical GPU. idx: virtual device index inside a VM

vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
target physical GPU

1.3 Under vgpu class sysfs:
----------------------------------------------------------------------------------

vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to commit virtual GPU resource for
this target VM. 

Also, the vgpu_start function is a synchronized call, the successful return of
this call will indicate all the requested vGPU resource has been fully
committed, the VMM should continue.

vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
interface to notify the GPU vendor driver to release virtual GPU resource of
this target VM.

1.4 Virtual device Hotplug
----------------------------------------------------------------------------------

To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
accessed during VM runtime, and the corresponding registration callback will be
invoked to allow GPU vendor support hotplug.

To support hotplug, vendor driver would take necessary action to handle the
situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
implies both create and start for that vgpu device.

Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
supports vgpu hotplug.

If hotplug is not supported and VM is still running, vendor driver can return
error code to indicate not supported.

Separate create from start gives flixibility to have:

- multiple vgpu instances for single VM and
- hotplug feature.

2. GPU driver vendor registration interface
==================================================================================

2.1 Registration interface definition (include/linux/vgpu.h)
----------------------------------------------------------------------------------

extern int vgpu_register_device(struct pci_dev *dev, 
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu
 * types.
 *                              @dev : pci device structure of physical GPU. 
 *                              @config: should return string listing supported
 *                              config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which
 *                              vgpu 
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended
 *                              to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs
 *                              to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that 
 *                              means the vGPU is being hotunpluged. Return
 *                              error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read 
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address
 *                              space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or
 *                              error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set. 
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API. 
 *
 * Physical GPU that support vGPU should be register with vgpu module with 
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

2.2 Details for callbacks we haven't mentioned above.
---------------------------------------------------------------------------------

vgpu_supported_config: allows the vendor driver to specify the supported vGPU
                       type/configuration

vgpu_create          : create a virtual GPU device, can be used for device hotplug.

vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.

vgpu_start           : callback function to notify vendor driver vgpu device
                       come to live for a given virtual machine.

vgpu_shutdown        : callback function to notify vendor driver 

read                 : callback to vendor driver to handle virtual device config
                       space or MMIO read access

write                : callback to vendor driver to handle virtual device config
                       space or MMIO write access

vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
                       information for the target virtual device, then vendor
                       driver can inject interrupt into virtual machine for this
                       device.

2.3 Potential additional virtual device configuration registration interface:
---------------------------------------------------------------------------------

callback function to describe the MMAP behavior of the virtual GPU 

callback function to allow GPU vendor driver to provide PCI config space backing
memory.

3. VGPU TYPE1 IOMMU
==================================================================================

Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
<iova, hva, size, flag> and save the QEMU mm for later reference.

You can find the quick/ugly implementation in the attached patch file, which is
actually just a simple version Alex's type1 IOMMU without actual real
mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 

We have thought about providing another vendor driver registration interface so
such tracking information will be sent to vendor driver and he will use the QEMU
mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
quick implementation within our driver, I noticed following issues:

1) OS/VFIO logic into vendor driver which will be a maintenance issue.

2) Every driver vendor has to implement their own RB tree, instead of reusing
the common existing VFIO code (vfio_find/link/unlink_dma) 

3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
better not have anything inside a vendor driver that the VFIO caller immediately
depends on.

Based on the above consideration, we decide to implement the DMA tracking logic
within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
IOMMU code) and expose two symbols to outside for MMIO mapping and page
translation and pinning. 

Also, with a mmap MMIO interface between virtual and physical, this allows
para-virtualized guest driver can access his virtual MMIO without taking a MMAP
fault hit, also we can support different MMIO size between virtual and physical
device.

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

EXPORT_SYMBOL(vgpu_map_virtual_bar);

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

EXPORT_SYMBOL(vgpu_dma_do_translate);

Still a lot to be added and modified, such as supporting multiple VMs and 
multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
kernel driver, error handling, roll-back and locked memory size per user, etc. 

4. Modules
==================================================================================

Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko

vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
                           TYPE1 v1 and v2 interface. 

vgpu.ko                  - provide registration interface and virtual device
                           VFIO access.

5. QEMU note
==================================================================================

To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
use it as a reference for our implementation. It is basically just a quick c & p
from vfio/pci.c to quickly meet our needs.

Once this proposal is finalized, we will move to vfio/pci.c instead of a new
class, and probably the only thing required is to have a new way to discover the
device.

6. Examples
==================================================================================

On this server, we have two NVIDIA M60 GPUs.

[root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

After nvidia.ko gets initialized, we can query the supported vGPU type by
accessing the "vgpu_supported_types" like following:

[root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
11:GRID M60-0B
12:GRID M60-0Q
13:GRID M60-1B
14:GRID M60-1Q
15:GRID M60-2B
16:GRID M60-2Q
17:GRID M60-4Q
18:GRID M60-8Q

For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
like to create "GRID M60-4Q" VM on it.

echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create

Note: the number 0 here is for vGPU device index. So far the change is not tested
for multiple vgpu devices yet, but we will support it.

At this moment, if you query the "vgpu_supported_types" it will still show all
supported virtual GPU types as no virtual GPU resource is committed yet.

Starting VM:

echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start

then, the supported vGPU type query will return:

[root@cjia-vgx-kvm /home/cjia]$
> cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
17:GRID M60-4Q

So vgpu_supported_config needs to be called whenever a new virtual device gets
created as the underlying HW might limit the supported types if there are
any existing VM runnings.

Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
GPU driver vendor to clean up resource.

Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
device sysfs.

7. What is not covered:
==================================================================================

7.1 QEMU console VNC

QEMU console VNC is not covered in this RFC as it is a pretty isolated module
and not impacting the basic vGPU functionality, also we already have a good
discussion about the new VFIO interface that Alex is going to introduce to allow us 
describe a region for VM surface.

8 Patches
==================================================================================

0001-Add-VGPU-VFIO-driver-class-support-in-QEMU.patch - against QEMU 2.5.0

0001-Add-VGPU-and-its-TYPE1-IOMMU-kernel-module-support.patch  - against 4.4.0-rc5

Thanks,
Kirti and Neo

>From dc8ca387f7b06c6dfc85fb4bd79a760dca76e831 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Tue, 26 Jan 2016 01:21:11 -0800
Subject: [PATCH] Add VGPU and its TYPE1 IOMMU kernel module support

This is just a quick POV implementation to allow GPU driver vendor to plugin
into VFIO framework to provide their virtual GPU support. This kernel is
providing a registration interface for GPU vendor and generic DMA tracking APIs.

extern int vgpu_register_device(struct pci_dev *dev,
                                const struct gpu_device_ops *ops);

extern void vgpu_unregister_device(struct pci_dev *dev);

/**
 * struct gpu_device_ops - Structure to be registered for each physical GPU to
 * register the device to vgpu module.
 *
 * @owner:                      The module owner.
 * @vgpu_supported_config:      Called to get information about supported vgpu types.
 *                              @dev : pci device structure of physical GPU.
 *                              @config: should return string listing supported config
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_create:                Called to allocate basic resouces in graphics
 *                              driver for a particular vgpu.
 *                              @dev: physical pci device structure on which vgpu
 *                                    should be created
 *                              @vm_uuid: VM's uuid for which VM it is intended to
 *                              @instance: vgpu instance in that VM
 *                              @vgpu_id: This represents the type of vgpu to be
 *                                        created
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_destroy:               Called to free resources in graphics driver for
 *                              a vgpu instance of that VM.
 *                              @dev: physical pci device structure to which
 *                              this vgpu points to.
 *                              @vm_uuid: VM's uuid for which the vgpu belongs to.
 *                              @instance: vgpu instance in that VM
 *                              Returns integer: success (0) or error (< 0)
 *                              If VM is running and vgpu_destroy is called that
 *                              means the vGPU is being hotunpluged. Return error
 *                              if VM is running and graphics driver doesn't
 *                              support vgpu hotplug.
 * @vgpu_start:                 Called to do initiate vGPU initialization
 *                              process in graphics driver when VM boots before
 *                              qemu starts.
 *                              @vm_uuid: VM's UUID which is booting.
 *                              Returns integer: success (0) or error (< 0)
 * @vgpu_shutdown:              Called to teardown vGPU related resources for
 *                              the VM
 *                              @vm_uuid: VM's UUID which is shutting down .
 *                              Returns integer: success (0) or error (< 0)
 * @read:                       Read emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: read buffer
 *                              @count: number bytes to read
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes read on success or error.
 * @write:                      Write emulation callback
 *                              @vdev: vgpu device structure
 *                              @buf: write buffer
 *                              @count: number bytes to be written
 *                              @address_space: specifies for which address space
 *                              the request is: pci_config_space, IO register
 *                              space or MMIO space.
 *                              Retuns number on bytes written on success or error.
 * @vgpu_set_irqs:              Called to send about interrupts configuration
 *                              information that qemu set.
 *                              @vdev: vgpu device structure
 *                              @flags, index, start, count and *data : same as
 *                              that of struct vfio_irq_set of
 *                              VFIO_DEVICE_SET_IRQS API.
 *
 * Physical GPU that support vGPU should be register with vgpu module with
 * gpu_device_ops structure.
 */

struct gpu_device_ops {
        struct module   *owner;
        int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
        int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
                               uint32_t instance, uint32_t vgpu_id);
        int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
                                uint32_t instance);
        int     (*vgpu_start)(uuid_le vm_uuid);
        int     (*vgpu_shutdown)(uuid_le vm_uuid);
        ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space, loff_t pos);
        ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
                         uint32_t address_space,loff_t pos);
        int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
                                 unsigned index, unsigned start, unsigned count,
                                 void *data);

};

int vgpu_map_virtual_bar
(
    uint64_t virt_bar_addr,
    uint64_t phys_bar_addr,
    uint32_t len,
    uint32_t flags
)

int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)

Change-Id: Ib70304d9a600c311d5107a94b3fffa938926275b
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig                      |   2 +
 drivers/Makefile                     |   1 +
 drivers/vfio/vfio.c                  |   5 +-
 drivers/vgpu/Kconfig                 |  26 ++
 drivers/vgpu/Makefile                |   5 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c | 511 ++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_dev.c              | 550 +++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h          |  47 +++
 drivers/vgpu/vgpu_sysfs.c            | 322 ++++++++++++++++++++
 drivers/vgpu/vgpu_vfio.c             | 521 +++++++++++++++++++++++++++++++++
 include/linux/vgpu.h                 | 157 ++++++++++
 11 files changed, 2144 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
 create mode 100644 drivers/vgpu/vgpu_dev.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_sysfs.c
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..5fd9eae79914 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca714bf..142256b4358b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VGPU)              += vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b793cbcb..af3ab413e119 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -947,19 +947,18 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container,
 		if (IS_ERR(data)) {
 			ret = PTR_ERR(data);
 			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
+			continue;
 		}
 
 		ret = __vfio_container_attach_groups(container, driver, data);
 		if (!ret) {
 			container->iommu_driver = driver;
 			container->iommu_data = data;
+			goto skip_drivers_unlock;
 		} else {
 			driver->ops->release(data);
 			module_put(driver->ops->owner);
 		}
-
-		goto skip_drivers_unlock;
 	}
 
 	mutex_unlock(&vfio.iommu_drivers_lock);
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 000000000000..698ddf907a16
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 000000000000..098a3591a535
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,5 @@
+
+vgpu-y := vgpu_sysfs.o vgpu_dev.o vgpu_vfio.o
+
+obj-$(CONFIG_VGPU)	+= vgpu.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000000000000..6b20f1374b3b
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,511 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_map_virtual_bar
+(
+	uint64_t virt_bar_addr,
+        uint64_t phys_bar_addr,
+	uint32_t len,
+	uint32_t flags
+)
+{
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	unsigned long remote_vaddr = 0;
+	struct vgpu_vfio_dma *vgpu_dma = NULL;
+	struct vm_area_struct *remote_vma = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+	int ret = 0;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	down_write(&mm->mmap_sem);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, virt_bar_addr, len /*  size */);
+	if (!vgpu_dma) {
+		printk(KERN_INFO "%s: fail locate guest physical:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	remote_vaddr = vgpu_dma->vaddr + virt_bar_addr - vgpu_dma->iova;
+
+        remote_vma = find_vma(mm, remote_vaddr);
+
+	if (remote_vma == NULL) {
+		printk(KERN_INFO "%s: fail locate vma, physical addr:0x%llx\n",
+		       __FUNCTION__, virt_bar_addr);
+		ret = -EINVAL;
+		goto unlock;
+	}
+	else {
+		printk(KERN_INFO "%s: locate vma, addr:0x%lx\n",
+		       __FUNCTION__, remote_vma->vm_start);
+	}
+
+	remote_vma->vm_page_prot = pgprot_noncached(remote_vma->vm_page_prot);
+
+	remote_vma->vm_pgoff = phys_bar_addr >> PAGE_SHIFT;
+
+	ret = remap_pfn_range(remote_vma, virt_bar_addr, remote_vma->vm_pgoff,
+			len, remote_vma->vm_page_prot);
+
+	if (ret) {
+		printk(KERN_INFO "%s: fail to remap vma:%d\n", __FUNCTION__, ret);
+		goto unlock;
+	}
+
+unlock:
+
+	up_write(&mm->mmap_sem);
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_map_virtual_bar);
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		printk(KERN_INFO "%s index %d", __FUNCTION__, vgpu_dev->minor);
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/drivers/vgpu/vgpu_dev.c b/drivers/vgpu/vgpu_dev.c
new file mode 100644
index 000000000000..1d4eb235122c
--- /dev/null
+++ b/drivers/vgpu/vgpu_dev.c
@@ -0,0 +1,550 @@
+/*
+ * VGPU core
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+#define VGPU_DEV_NAME		"vgpu"
+
+// TODO remove these defines
+// minor number reserved for control device
+#define VGPU_CONTROL_DEVICE       0
+
+#define VGPU_CONTROL_DEVICE_NAME  "vgpuctl"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	dev_t               vgpu_devt;
+	struct class        *class;
+	struct cdev         vgpu_cdev;
+	struct list_head    vgpu_devices_list;  // Head entry for the doubly linked vgpu_device list
+	struct mutex        vgpu_devices_lock;
+	struct idr          vgpu_idr;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+
+/*
+ * Function prototypes
+ */
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev);
+
+unsigned int vgpu_poll(struct file *file, poll_table *wait);
+long vgpu_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long i_arg);
+int vgpu_mmap(struct file *file, struct vm_area_struct *vma);
+
+int vgpu_open(struct inode *inode, struct file *file);
+int vgpu_close(struct inode *inode, struct file *file);
+ssize_t vgpu_read(struct file *file, char __user * buf,
+		      size_t len, loff_t * ppos);
+ssize_t vgpu_write(struct file *file, const char __user *data,
+		       size_t len, loff_t *ppos);
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return -EINVAL;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret) {
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return ret;
+	}
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+                if (gpu_dev->dev == dev) {
+			printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n", dev->vendor, dev->device, dev->class);
+			vgpu_remove_pci_device_files(dev);
+                        list_del(&gpu_dev->gpu_next);
+                        mutex_unlock(&vgpu.gpu_devices_lock);
+                        kfree(gpu_dev);
+                        return;
+                }
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+
+/*
+ *  Static functions
+ */
+
+static struct file_operations vgpu_fops = {
+	.owner          = THIS_MODULE,
+};
+
+static void  vgpu_device_destroy(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev->dev) {
+		device_destroy(vgpu.class, vgpu_dev->dev->devt);
+		vgpu_dev->dev = NULL;
+	}
+}
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->vm_uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_del(&vgpu_dev->list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	kfree(vgpu_dev);
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+struct vgpu_device *find_vgpu_device(struct device *dev)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->dev == dev) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id)
+{
+	int minor;
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+
+	struct iommu_group *group = NULL;
+	struct device *dev = NULL;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", vm_uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(vm_uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	// check if VM device is present
+	// if not present, create with devt=0 and parent=NULL
+	// create device for instance with devt= MKDEV(vgpu.major, minor)
+	// and parent=VM device
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_dev->vgpu_id = vgpu_id;
+
+	// TODO on removing control device change the 3rd parameter to 0
+	minor = idr_alloc(&vgpu.vgpu_idr, vgpu_dev, 1, MINORMASK + 1, GFP_KERNEL);
+	if (minor < 0) {
+		retval = minor;
+		goto create_failed;
+	}
+
+	dev = device_create(vgpu.class, NULL, MKDEV(MAJOR(vgpu.vgpu_devt), minor), NULL, "%s", name);
+	if (IS_ERR(dev)) {
+		retval = PTR_ERR(dev);
+		goto create_failed1;
+	}
+
+	vgpu_dev->dev = dev;
+	vgpu_dev->minor = minor;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev == pdev) {
+			vgpu_dev->gpu_dev = gpu_dev;
+			if (gpu_dev->ops->vgpu_create) {
+				retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->vm_uuid,
+								   instance, vgpu_id);
+				if (retval)
+				{
+					mutex_unlock(&vgpu.gpu_devices_lock);
+					goto create_failed2;
+				}
+			}
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		goto create_failed2;
+	}
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->vm_uuid.b);
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		printk(KERN_ERR "VGPU: failed to allocate group!\n");
+		retval = PTR_ERR(group);
+		goto create_failed2;
+	}
+
+	retval = iommu_group_add_device(group, dev);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+		iommu_group_put(group);
+		goto create_failed2;
+	}
+
+	retval = vgpu_group_init(vgpu_dev, group);
+	if (retval) {
+		printk(KERN_ERR "VGPU: failed vgpu_group_init \n");
+		iommu_group_put(group);
+		iommu_group_remove_device(dev);
+		goto create_failed2;
+	}
+
+	vgpu_dev->group = group;
+	printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return retval;
+
+create_failed2:
+	vgpu_device_destroy(vgpu_dev);
+
+create_failed1:
+	idr_remove(&vgpu.vgpu_idr, minor);
+
+create_failed:
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct device *dev = vgpu_dev->dev;
+
+	if (!dev) {
+		return;
+	}
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (vgpu_dev->gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = vgpu_dev->gpu_dev->ops->vgpu_destroy(vgpu_dev->gpu_dev->dev,
+							      vgpu_dev->vm_uuid,
+							      vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	vgpu_group_free(vgpu_dev);
+	iommu_group_put(dev->iommu_group);
+	iommu_group_remove_device(dev);
+	vgpu_device_destroy(vgpu_dev);
+	idr_remove(&vgpu.vgpu_idr, vgpu_dev->minor);
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	vgpu_device_free(vgpu_dev);
+}
+
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev, *vgpu_dev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+
+	// search VGPU device
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->vm_uuid, uuid) == 0) &&
+				(vdev->vgpu_instance == instance)) {
+			vgpu_dev = vdev;
+			break;
+		}
+	}
+
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_start)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_start(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (vgpu_dev->gpu_dev->ops->vgpu_shutdown)
+		ret = vgpu_dev->gpu_dev->ops->vgpu_shutdown(vgpu_dev->vm_uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data)
+{
+       int ret = 0;
+
+       mutex_lock(&vgpu.gpu_devices_lock);
+       if (vgpu_dev->gpu_dev->ops->vgpu_set_irqs)
+               ret = vgpu_dev->gpu_dev->ops->vgpu_set_irqs(vgpu_dev, flags,
+                                                          index, start, count, data);
+       mutex_unlock(&vgpu.gpu_devices_lock);
+       return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	idr_init(&vgpu.vgpu_idr);
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	// get major number from kernel
+	rc = alloc_chrdev_region(&vgpu.vgpu_devt, 0, MINORMASK, VGPU_DEV_NAME);
+
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu drv, err:%d\n", rc);
+		return rc;
+	}
+
+	cdev_init(&vgpu.vgpu_cdev, &vgpu_fops);
+	cdev_add(&vgpu.vgpu_cdev, vgpu.vgpu_devt, MINORMASK);
+
+	printk(KERN_ALERT "major_number:%d is allocated for vgpu\n", MAJOR(vgpu.vgpu_devt));
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	vgpu.class = &vgpu_class;
+
+	return rc;
+
+failed1:
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	// TODO: Release all unclosed fd
+	struct vgpu_device *vdev = NULL, *tmp;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry_safe(vdev, tmp, &vgpu.vgpu_devices_list, list) {
+		printk(KERN_INFO "VGPU: exit destroying device %s ", vdev->dev_name);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		destroy_vgpu_device(vdev);
+		mutex_lock(&vgpu.vgpu_devices_lock);
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	idr_destroy(&vgpu.vgpu_idr);
+	cdev_del(&vgpu.vgpu_cdev);
+	unregister_chrdev_region(vgpu.vgpu_devt, MINORMASK);
+	class_destroy(vgpu.class);
+	vgpu.class = NULL;
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 000000000000..7e3c400d29f7
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,47 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group * group);
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev);
+
+struct vgpu_device *find_vgpu_device(struct device *dev);
+
+struct vgpu_device *vgpu_drv_get_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le vm_uuid, uint32_t instance, uint32_t vgpu_id);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+void destroy_vgpu_device_by_uuid(uuid_le uuid, int instance);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int vgpu_create_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_notify_status_file(struct vgpu_device *vgpu_dev);
+void vgpu_remove_status_file(struct vgpu_device *vgpu_dev);
+
+int vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+int vgpu_set_irqs_callback(struct vgpu_device *vgpu_dev, uint32_t flags,
+                           unsigned index, unsigned start, unsigned count,
+                           void *data);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/drivers/vgpu/vgpu_sysfs.c b/drivers/vgpu/vgpu_sysfs.c
new file mode 100644
index 000000000000..e48cbcd6948d
--- /dev/null
+++ b/drivers/vgpu/vgpu_sysfs.c
@@ -0,0 +1,322 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *instance_str, *str;
+	uuid_le vm_uuid;
+	uint32_t instance, vgpu_id;
+	struct pci_dev *pdev;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type and instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s vgpu type not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	vgpu_id = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, vm_uuid, instance, vgpu_id) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			return -EINVAL;
+		}
+	}
+
+	return count;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+{
+	char *vm_uuid_str, *str;
+	uuid_le vm_uuid;
+	unsigned int instance;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((vm_uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, vm_uuid.b, instance);
+
+	destroy_vgpu_device_by_uuid(vm_uuid, instance);
+
+	return count;
+}
+
+static ssize_t
+vgpu_vm_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->vm_uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_vm_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = find_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_vm_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+		const char *buf, size_t count)
+{
+	char *vm_uuid_str;
+	uuid_le vm_uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	vm_uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!vm_uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(vm_uuid_str, &vm_uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device_by_uuid(vm_uuid, 0);
+
+	if (vgpu_dev && vgpu_dev->dev) {
+		kobject_uevent(&vgpu_dev->dev->kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n", __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000000000000..ef0833140d84
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,521 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+};
+
+static int vgpu_dev_open(void *device_data)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+
+}
+
+static uint64_t resource_len(struct vgpu_device *vgpu_dev, int bar_index)
+{
+	uint64_t size = 0;
+
+	switch (bar_index) {
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = 16 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = 256 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR2_REGION_INDEX:
+		size = 32 * 1024 * 1024;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+		size = 128;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+	return size;
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+		case VFIO_DEVICE_GET_INFO:
+		{
+			struct vfio_device_info info;
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index = %d", __FUNCTION__, vdev->vgpu_dev->minor);
+			minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			info.flags = VFIO_DEVICE_FLAGS_PCI;
+			info.num_regions = VFIO_PCI_NUM_REGIONS;
+			info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_GET_REGION_INFO:
+		{
+			struct vfio_region_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd", __FUNCTION__);
+
+			minsz = offsetofend(struct vfio_region_info, offset);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_CONFIG_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0x100;     // 4K
+					//                    info.size = sizeof(vdev->vgpu_dev->config_space);
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+							VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+				case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = resource_len(vdev->vgpu_dev, info.index);
+					if (!info.size) {
+						info.flags = 0;
+						break;
+					}
+
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+
+					if ((info.index == VFIO_PCI_BAR1_REGION_INDEX) ||
+					     (info.index == VFIO_PCI_BAR2_REGION_INDEX)) {
+						info.flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+					}
+
+					/* TODO: provides configurable setups to
+					 * GPU vendor
+					 */
+
+					if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+						info.flags = VFIO_REGION_INFO_FLAG_MMAP;
+
+					break;
+				case VFIO_PCI_VGA_REGION_INDEX:
+					info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+					info.size = 0xc0000;
+					info.flags = VFIO_REGION_INFO_FLAG_READ |
+						VFIO_REGION_INFO_FLAG_WRITE;
+					break;
+
+				case VFIO_PCI_ROM_REGION_INDEX:
+				default:
+					return -EINVAL;
+			}
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+
+		}
+		case VFIO_DEVICE_GET_IRQ_INFO:
+		{
+			struct vfio_irq_info info;
+
+			printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+			minsz = offsetofend(struct vfio_irq_info, count);
+
+			if (copy_from_user(&info, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+				return -EINVAL;
+
+			switch (info.index) {
+				case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+				case VFIO_PCI_REQ_IRQ_INDEX:
+					break;
+					/* pass thru to return error */
+				default:
+					return -EINVAL;
+			}
+
+			info.count = VFIO_PCI_NUM_IRQS;
+
+			info.flags = VFIO_IRQ_INFO_EVENTFD;
+			info.count = vgpu_get_irq_count(vdev, info.index);
+
+			if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+				info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+						VFIO_IRQ_INFO_AUTOMASKED);
+			else
+				info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+			return copy_to_user((void __user *)arg, &info, minsz);
+		}
+
+		case VFIO_DEVICE_SET_IRQS:
+		{
+			struct vfio_irq_set hdr;
+			u8 *data = NULL;
+			int ret = 0;
+
+			minsz = offsetofend(struct vfio_irq_set, count);
+
+			if (copy_from_user(&hdr, (void __user *)arg, minsz))
+				return -EFAULT;
+
+			if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+					hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+						VFIO_IRQ_SET_ACTION_TYPE_MASK))
+				return -EINVAL;
+
+			if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+				size_t size;
+				int max = vgpu_get_irq_count(vdev, hdr.index);
+
+				if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+					size = sizeof(uint8_t);
+				else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+					size = sizeof(int32_t);
+				else
+					return -EINVAL;
+
+				if (hdr.argsz - minsz < hdr.count * size ||
+				    hdr.start >= max || hdr.start + hdr.count > max)
+					return -EINVAL;
+
+				data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+			ret = vgpu_set_irqs_callback(vdev->vgpu_dev, hdr.flags, hdr.index,
+					hdr.start, hdr.count, data);
+			kfree(data);
+
+
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	int cfg_size = sizeof(vgpu_dev->config_space);
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		/* FIXME: Need to save the BAR value properly */
+		switch (pos) {
+		case PCI_BASE_ADDRESS_0:
+			vgpu_dev->bar[0].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_1:
+			vgpu_dev->bar[1].start = *((uint32_t *)user_data);
+			break;
+		case PCI_BASE_ADDRESS_2:
+			vgpu_dev->bar[2].start = *((uint32_t *)user_data);
+			break;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_config,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_config,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+		}
+		kfree(ret_data);
+	}
+
+config_rw_exit:
+
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	uint64_t end;
+	int ret = 0;
+
+	if (!vgpu_dev->bar[bar_index].start) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	end = resource_len(vgpu_dev, bar_index);
+
+	if (offset >= end) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vgpu_dev->bar[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (vgpu_dev->gpu_dev->ops->write) {
+			ret = vgpu_dev->gpu_dev->ops->write(vgpu_dev,
+							    user_data,
+							    count,
+							    vgpu_emul_space_mmio,
+							    pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (vgpu_dev->gpu_dev->ops->read) {
+			ret = vgpu_dev->gpu_dev->ops->read(vgpu_dev,
+							   ret_data,
+							   count,
+							   vgpu_emul_space_mmio,
+							   pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+			return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		case VFIO_PCI_VGA_REGION_INDEX:
+			break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+/* Just create an invalid mapping without providing a fault handler */
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	printk(KERN_INFO "%s ", __FUNCTION__);
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu-grp",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_group_init(struct vgpu_device *vgpu_dev, struct iommu_group *group)
+{
+	struct vfio_vgpu_device *vdev;
+	int ret = 0;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->group = group;
+	vdev->vgpu_dev = vgpu_dev;
+
+	ret = vfio_add_group_dev(vgpu_dev->dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+
+int vgpu_group_free(struct vgpu_device *vgpu_dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	vdev = vfio_del_group_dev(vgpu_dev->dev);
+	if (!vdev)
+		return -1;
+
+	kfree(vdev);
+	return 0;
+}
+
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 000000000000..a2861c3f42e5
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,157 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t end;
+	int flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		*dev;
+	int minor;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			vm_uuid;
+	uint32_t		vgpu_instance;
+	uint32_t		vgpu_id;
+	atomic_t		usage_count;
+	char			config_space[0x100];          // 4KB PCI cfg space
+	struct pci_bar_info	bar[VFIO_PCI_NUM_REGIONS];
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@vm_uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_id: This represents the type of vgpu to be
+ *					  created
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@vm_uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@vm_uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@vm_uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
+			       uint32_t instance, uint32_t vgpu_id);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
+			        uint32_t instance);
+	int     (*vgpu_start)(uuid_le vm_uuid);
+	int     (*vgpu_shutdown)(uuid_le vm_uuid);
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space,loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+extern int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr, uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.8.1.4

>From 380156ade7053664bdb318af0659708357f40050 Mon Sep 17 00:00:00 2001
From: Neo Jia <cjia@nvidia.com>
Date: Sun, 24 Jan 2016 11:24:13 -0800
Subject: [PATCH] Add VGPU VFIO driver class support in QEMU

This is just a quick POV change to allow us experiment the VGPU VFIO support,
the next step is to merge this into the current vfio/pci.c which currently has a
physical backing devices.

Within current POC implementation, we have copy & paste lots function directly
from the vfio/pci.c code, we should merge them together later.

    - Basic MMIO and PCI config apccess are supported

    - MMAP'ed GPU bar is supported

    - INTx and MSI using eventfd is supported, don't think we should support
      interrupt when vector->kvm_interrupt is not enabled.

Change-Id: I99c34ac44524cd4d7d2abbcc4d43634297b96e80

Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 hw/vfio/Makefile.objs |   1 +
 hw/vfio/vgpu.c        | 991 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci.h  |   3 +
 3 files changed, 995 insertions(+)
 create mode 100644 hw/vfio/vgpu.c

diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
index d324863..17f2ef1 100644
--- a/hw/vfio/Makefile.objs
+++ b/hw/vfio/Makefile.objs
@@ -1,6 +1,7 @@
 ifeq ($(CONFIG_LINUX), y)
 obj-$(CONFIG_SOFTMMU) += common.o
 obj-$(CONFIG_PCI) += pci.o pci-quirks.o
+obj-$(CONFIG_PCI) += vgpu.o
 obj-$(CONFIG_SOFTMMU) += platform.o
 obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
 endif
diff --git a/hw/vfio/vgpu.c b/hw/vfio/vgpu.c
new file mode 100644
index 0000000..56ebce0
--- /dev/null
+++ b/hw/vfio/vgpu.c
@@ -0,0 +1,991 @@
+/*
+ * vGPU VFIO device
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <dirent.h>
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "config.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/pci.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+#include "qemu/queue.h"
+#include "qemu/range.h"
+#include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
+#include "trace.h"
+#include "hw/vfio/vfio.h"
+#include "hw/vfio/pci.h"
+#include "hw/vfio/vfio-common.h"
+#include "qmp-commands.h"
+
+#define TYPE_VFIO_VGPU "vfio-vgpu"
+
+typedef struct VFIOvGPUDevice {
+    PCIDevice pdev;
+    VFIODevice vbasedev;
+    VFIOINTx intx;
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    unsigned int config_size;
+    char  *vgpu_type;
+    char *vm_uuid;
+    off_t config_offset; /* Offset of config space region within device fd */
+    int msi_cap_size;
+    EventNotifier req_notifier;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOMSIVector *msi_vectors;
+} VFIOvGPUDevice;
+
+/*
+ * Local functions
+ */
+
+// function prototypes
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev);
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+
+// INTx functions
+
+static void vfio_vgpu_intx_interrupt(void *opaque)
+{
+    VFIOvGPUDevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    vdev->intx.pending = true;
+    pci_irq_assert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, false);
+
+}
+
+static void vfio_vgpu_intx_eoi(VFIODevice *vbasedev)
+{
+    VFIOvGPUDevice *vdev = container_of(vbasedev, VFIOvGPUDevice, vbasedev);
+
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    trace_vfio_intx_eoi(vbasedev->name);
+
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+    vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+}
+
+static void vfio_vgpu_intx_enable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_RESAMPLE,
+    };
+    struct vfio_irq_set *irq_set;
+    int ret, argsz;
+    int32_t *pfd;
+
+    if (!kvm_irqfds_enabled() ||
+        vdev->intx.route.mode != PCI_INTX_ENABLED ||
+        !kvm_resamplefds_enabled()) {
+        return;
+    }
+
+    /* Get to a known interrupt state */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Get an eventfd for resample/unmask */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi");
+        goto fail;
+    }
+
+    /* KVM triggers it, VFIO listens for it */
+    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to setup resample irqfd: %m");
+        goto fail_irqfd;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = irqfd.resamplefd;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %m");
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    vdev->intx.kvm_accel = true;
+
+    trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
+
+    return;
+
+fail_vfio:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+    event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+#endif
+}
+
+static void vfio_vgpu_intx_disable_kvm(VFIOvGPUDevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, QEMU ready to accept new
+     * interrupts, QEMU IRQ de-asserted.
+     */
+    vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+
+    /* Tell KVM to stop listening for an INTx irqfd */
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %m");
+    }
+
+    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /* QEMU starts listening for interrupt events. */
+    qemu_set_fd_handler(irqfd.fd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    vdev->intx.kvm_accel = false;
+
+    /* If we've missed an event, let it re-fire through QEMU */
+    vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    trace_vfio_intx_disable_kvm(vdev->vbasedev.name);
+#endif
+}
+
+static void vfio_vgpu_intx_update(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != VFIO_INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+
+    if (!pci_intx_route_changed(&vdev->intx.route, &route)) {
+        return; /* Nothing changed */
+    }
+
+    trace_vfio_intx_update(vdev->vbasedev.name,
+                           vdev->intx.route.irq, route.irq);
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+
+    vdev->intx.route = route;
+
+    if (route.mode != PCI_INTX_ENABLED) {
+        return;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_vgpu_intx_eoi(&vdev->vbasedev);
+}
+
+static int vfio_vgpu_intx_enable(VFIOvGPUDevice *vdev)
+{
+    uint8_t pin = vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+    int ret, argsz;
+    struct vfio_irq_set *irq_set;
+    int32_t *pfd;
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    pci_config_set_interrupt_pin(vdev->pdev.config, pin);
+
+#ifdef CONFIG_KVM
+    /*
+     * Only conditional to avoid generating error messages on platforms
+     * where we won't actually use the result anyway.
+     */
+    if (kvm_irqfds_enabled() && kvm_resamplefds_enabled()) {
+        vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                        vdev->intx.pin);
+    }
+#endif
+
+    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    if (ret) {
+        error_report("vfio: Error: event_notifier_init failed");
+        return ret;
+    }
+
+    argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = 1;
+    pfd = (int32_t *)&irq_set->data;
+
+    *pfd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(*pfd, vfio_vgpu_intx_interrupt, NULL, vdev);
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+    g_free(irq_set);
+    if (ret) {
+        error_report("vfio: Error: Failed to setup INTx fd: %m");
+        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
+        event_notifier_cleanup(&vdev->intx.interrupt);
+        return -errno;
+    }
+
+    vfio_vgpu_intx_enable_kvm(vdev);
+
+    vdev->interrupt = VFIO_INT_INTx;
+
+    trace_vfio_intx_enable(vdev->vbasedev.name);
+
+    return 0;
+}
+
+static void vfio_vgpu_intx_disable(VFIOvGPUDevice *vdev)
+{
+    int fd;
+
+    vfio_vgpu_intx_disable_kvm(vdev);
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+    vdev->intx.pending = false;
+    pci_irq_deassert(&vdev->pdev);
+//    vfio_mmap_set_enabled(vdev, true);
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = VFIO_INT_NONE;
+
+    trace_vfio_intx_disable(vdev->vbasedev.name);
+}
+
+//MSI functions
+static void vfio_vgpu_remove_kvm_msi_virq(VFIOMSIVector *vector)
+{
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                          vector->virq);
+    kvm_irqchip_release_virq(kvm_state, vector->virq);
+    vector->virq = -1;
+    event_notifier_cleanup(&vector->kvm_interrupt);
+}
+
+static void vfio_vgpu_msi_disable_common(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vdev->msi_vectors[i].use) {
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+    vdev->interrupt = VFIO_INT_NONE;
+
+   vfio_vgpu_intx_enable(vdev);
+}
+
+static void vfio_vgpu_msi_disable(VFIOvGPUDevice *vdev)
+{
+    vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSI_IRQ_INDEX);
+    vfio_vgpu_msi_disable_common(vdev);
+}
+
+static void vfio_vgpu_disable_interrupts(VFIOvGPUDevice *vdev)
+{
+
+    if (vdev->interrupt == VFIO_INT_MSI) {
+        vfio_vgpu_msi_disable(vdev);
+    }
+
+    if (vdev->interrupt == VFIO_INT_INTx) {
+        vfio_vgpu_intx_disable(vdev);
+    }
+}
+
+
+static void vfio_vgpu_msi_interrupt(void *opaque)
+{
+    VFIOMSIVector *vector = opaque;
+    VFIOvGPUDevice *vdev = (VFIOvGPUDevice *)vector->vdev;
+    MSIMessage (*get_msg)(PCIDevice *dev, unsigned vector);
+    void (*notify)(PCIDevice *dev, unsigned vector);
+    MSIMessage msg;
+    int nr = vector - vdev->msi_vectors;
+
+    if (!event_notifier_test_and_clear(&vector->interrupt)) {
+        return;
+    }
+
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        get_msg = msix_get_message;
+        notify = msix_notify;
+    } else if (vdev->interrupt == VFIO_INT_MSI) {
+        get_msg = msi_get_message;
+        notify = msi_notify;
+    } else {
+        abort();
+    }
+
+    msg = get_msg(&vdev->pdev, nr);
+    trace_vfio_msi_interrupt(vdev->vbasedev.name, nr, msg.address, msg.data);
+    notify(&vdev->pdev, nr);
+}
+
+static int vfio_vgpu_enable_vectors(VFIOvGPUDevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd = -1;
+
+        /*
+         * MSI vs MSI-X - The guest has direct access to MSI mask and pending
+         * bits, therefore we always use the KVM signaling path when setup.
+         * MSI-X mask and pending bits are emulated, so we want to use the
+         * KVM signaling path only when configured and unmasked.
+         */
+        if (vdev->msi_vectors[i].use) {
+            if (vdev->msi_vectors[i].virq < 0 ||
+                (msix && msix_is_masked(&vdev->pdev, i))) {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            } else {
+                fd = event_notifier_get_fd(&vdev->msi_vectors[i].kvm_interrupt);
+            }
+        }
+
+        fds[i] = fd;
+    }
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    return ret;
+}
+
+static void vfio_vgpu_add_kvm_msi_virq(VFIOvGPUDevice *vdev, VFIOMSIVector *vector,
+                                  MSIMessage *msg, bool msix)
+{
+    int virq;
+
+    if (!msg) {
+        return;
+    }
+
+    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+        return;
+    }
+
+    virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+    if (virq < 0) {
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
+                                       NULL, virq) < 0) {
+        kvm_irqchip_release_virq(kvm_state, virq);
+        event_notifier_cleanup(&vector->kvm_interrupt);
+        return;
+    }
+
+    vector->virq = virq;
+}
+
+static void vfio_vgpu_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
+                                     PCIDevice *pdev)
+{
+    kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+}
+
+
+static void vfio_vgpu_msi_enable(VFIOvGPUDevice *vdev)
+{
+   int ret, i;
+
+    vfio_vgpu_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg = msi_get_message(&vdev->pdev, i);
+
+        vector->vdev = (VFIOPCIDevice *)vdev;
+        vector->virq = -1;
+        vector->use = true;
+
+        if (event_notifier_init(&vector->interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed");
+        }
+        qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                            vfio_vgpu_msi_interrupt, NULL, vector);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vfio_vgpu_add_kvm_msi_virq(vdev, vector, &msg, false);
+    }
+
+    /* Set interrupt type prior to possible interrupts */
+    vdev->interrupt = VFIO_INT_MSI;
+
+    ret = vfio_vgpu_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %m");
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
+            if (vector->virq >= 0) {
+                vfio_vgpu_remove_kvm_msi_virq(vector);
+            }
+            qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
+                                NULL, NULL, NULL);
+            event_notifier_cleanup(&vector->interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+
+        /*
+         * Failing to setup MSI doesn't really fall within any specification.
+         * Let's try leaving interrupts disabled and hope the guest figures
+         * out to fall back to INTx for this device.
+         */
+        error_report("vfio: Error: Failed to enable MSI");
+        vdev->interrupt = VFIO_INT_NONE;
+
+        return;
+    }
+}
+
+static void vfio_vgpu_update_msi(VFIOvGPUDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        MSIMessage msg;
+
+        if (!vector->use || vector->virq < 0) {
+            continue;
+        }
+
+        msg = msi_get_message(&vdev->pdev, i);
+        vfio_vgpu_update_kvm_msi_virq(vector, msg, &vdev->pdev);
+    }
+}
+
+static int vfio_vgpu_msi_setup(VFIOvGPUDevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -errno;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        if (ret == -ENOTSUP) {
+            return 0;
+        }
+        error_report("vfio: msi_init failed");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+
+static int vfio_vgpu_msi_init(VFIOvGPUDevice *vdev)
+{
+    uint8_t pos;
+    int ret;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSI);
+    if (!pos) {
+        return 0;
+    }
+
+    ret = vfio_vgpu_msi_setup(vdev, pos);
+    if (ret < 0) {
+        error_report("vgpu: Error setting MSI@0x%x: %d", pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * VGPU device class functions
+ */
+
+static void vfio_vgpu_reset(DeviceState *dev)
+{
+
+
+}
+
+static void vfio_vgpu_eoi(VFIODevice *vbasedev)
+{
+    return;
+}
+
+static int vfio_vgpu_hot_reset_multi(VFIODevice *vbasedev)
+{
+    // Nothing to be reset 
+    return 0;
+}
+
+static void vfio_vgpu_compute_needs_reset(VFIODevice *vbasedev)
+{
+    vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_vgpu_ops = {
+    .vfio_compute_needs_reset = vfio_vgpu_compute_needs_reset,
+    .vfio_hot_reset_multi = vfio_vgpu_hot_reset_multi,
+    .vfio_eoi = vfio_vgpu_eoi,
+};
+
+static int vfio_vgpu_populate_device(VFIOvGPUDevice *vdev)
+{
+    VFIODevice *vbasedev = &vdev->vbasedev;
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int i, ret = -1;
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %m", i);
+            return ret;
+        }
+
+        trace_vfio_populate_device_region(vbasedev->name, i,
+                                          (unsigned long)reg_info.size,
+                                          (unsigned long)reg_info.offset,
+                                          (unsigned long)reg_info.flags);
+
+        vdev->bars[i].region.vbasedev = vbasedev;
+        vdev->bars[i].region.flags = reg_info.flags;
+        vdev->bars[i].region.size = reg_info.size;
+        vdev->bars[i].region.fd_offset = reg_info.offset;
+        vdev->bars[i].region.nr = i;
+        QLIST_INIT(&vdev->bars[i].quirks);
+    }
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %m");
+        return ret;
+    }
+
+    vdev->config_size = reg_info.size;
+    if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) {
+        vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS;
+    }
+    vdev->config_offset = reg_info.offset;
+
+    return 0;
+}
+
+static void vfio_vgpu_create_virtual_bar(VFIOvGPUDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size = bar->region.size;
+    char name[64];
+    uint32_t pci_bar;
+    uint8_t type;
+    int ret;
+
+    /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
+    if (!size) 
+        return;
+
+    /* Determine what type of BAR this is for registration */
+    ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
+                vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+    if (ret != sizeof(pci_bar)) {
+        error_report("vfio: Failed to read BAR %d (%m)", nr);
+        return;
+    }
+
+    pci_bar = le32_to_cpu(pci_bar);
+    bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
+    bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
+    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_region_ops,
+                          bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem);
+
+    // Create an invalid BAR1 mapping
+    if (bar->region.flags & VFIO_REGION_INFO_FLAG_MMAP) {
+        strncat(name, " mmap", sizeof(name) - strlen(name) - 1);
+        vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem,
+                         &bar->region.mmap_mem, &bar->region.mmap,
+                         size, 0, name);
+    }
+}
+
+static void vfio_vgpu_create_virtual_bars(VFIOvGPUDevice *vdev)
+{
+
+    int i = 0;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_vgpu_create_virtual_bar(vdev, i);
+    }
+}
+
+static int vfio_vgpu_initfn(PCIDevice *pdev)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    VFIOGroup *group;
+    ssize_t len;
+    int groupid;
+    struct stat st;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    int ret;
+    UuidInfo *uuid_info;
+
+    uuid_info = qmp_query_uuid(NULL);
+    if (strcmp(uuid_info->UUID, UUID_NONE) == 0) {
+        return -EINVAL;
+    } else {
+        vdev->vm_uuid = uuid_info->UUID;
+    }
+
+
+    snprintf(path, sizeof(path), 
+             "/sys/devices/virtual/vgpu/%s-0/", vdev->vm_uuid);
+
+    if (stat(path, &st) < 0) {
+        error_report("vfio-vgpu: error: no such vgpu device: %s", path);
+        return -errno;
+    } 
+
+    vdev->vbasedev.ops = &vfio_vgpu_ops;
+
+    vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
+    vdev->vbasedev.name = g_strdup_printf("%s-0", vdev->vm_uuid);
+
+    strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);
+
+    len = readlink(path, iommu_group_path, sizeof(path));
+    if (len <= 0 || len >= sizeof(path)) {
+        error_report("vfio-vgpu: error no iommu_group for device");
+        return len < 0 ? -errno : -ENAMETOOLONG;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio-vgpu: error reading %s: %m", path);
+        return -errno;
+    }
+
+    // TODO: This will only work if we *only* have VFIO_VGPU_IOMMU enabled
+
+    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -ENOENT;
+    }
+
+    snprintf(path, sizeof(path), "%s-0", vdev->vm_uuid);
+
+    ret = vfio_get_device(group, path, &vdev->vbasedev);
+    if (ret) {
+        error_report("vfio-vgpu; failed to get device %s", vdev->vgpu_type);
+        vfio_put_group(group);
+        return ret;
+    }
+
+    ret = vfio_vgpu_populate_device(vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Get a copy of config space */
+    ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+                MIN(pci_config_size(&vdev->pdev), vdev->config_size),
+                vdev->config_offset);
+    if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+        ret = ret < 0 ? -errno : -EFAULT;
+        error_report("vfio: Failed to read device config space");
+        return ret;
+    }
+
+    vfio_vgpu_create_virtual_bars(vdev);
+
+    ret = vfio_vgpu_msi_init(vdev);
+    if (ret < 0) {
+        error_report("%s: Error setting MSI %d", __FUNCTION__, ret);
+        return ret;
+    }
+
+    if (vfio_vgpu_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_vgpu_intx_update);
+        ret = vfio_vgpu_intx_enable(vdev);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+
+static void vfio_vgpu_exitfn(PCIDevice *pdev)
+{
+
+
+}
+
+static uint32_t vfio_vgpu_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+    uint32_t val = 0;
+
+    ret = pread(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x %m", __func__, addr);
+        return 0xFFFFFFFF;
+    }
+
+    // memcpy(&vdev->emulated_config_bits + addr, &val, len);
+    return val;
+}
+
+static void vfio_vgpu_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIOvGPUDevice *vdev = DO_UPCAST(VFIOvGPUDevice, pdev, pdev);
+    ssize_t ret;
+
+    ret = pwrite(vdev->vbasedev.fd, &val, len, vdev->config_offset + addr);
+
+    if (ret != len) {
+        error_report("%s: failed at offset:0x%0x, val:0x%0x %m",
+                     __func__, addr, val);
+        return;
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled) {
+            if (is_enabled) {
+                vfio_vgpu_msi_enable(vdev);
+            }
+        } else {
+            if (!is_enabled) {
+                vfio_vgpu_msi_disable(vdev);
+            } else {
+                vfio_vgpu_update_msi(vdev);
+            }
+        }
+    }
+    else {
+        /* Write everything to QEMU to keep emulated bits correct */
+        pci_default_write_config(pdev, addr, val, len);
+    }
+
+    pci_default_write_config(pdev, addr, val, len);
+
+    return;
+}
+
+static const VMStateDescription vfio_vgpu_vmstate = {
+    .name = TYPE_VFIO_VGPU,
+    .unmigratable = 1,
+};
+
+//
+// We don't actually need the vfio_vgpu_properties
+// as we can just simply rely on VM UUID to find
+// the IOMMU group for this VM
+//
+
+
+static Property vfio_vgpu_properties[] = {
+
+    DEFINE_PROP_STRING("vgpu", VFIOvGPUDevice, vgpu_type),
+    DEFINE_PROP_END_OF_LIST()
+};
+
+#if 0
+
+static void vfio_vgpu_instance_init(Object *obj)
+{
+
+}
+
+static void vfio_vgpu_instance_finalize(Object *obj)
+{
+
+
+}
+
+#endif
+
+static void vfio_vgpu_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+    // vgpudc->parent_realize = dc->realize;
+    // dc->realize = calxeda_xgmac_realize;
+    dc->desc = "VFIO-based vGPU";
+    dc->vmsd = &vfio_vgpu_vmstate;
+    dc->reset = vfio_vgpu_reset;
+    // dc->cannot_instantiate_with_device_add_yet = true; 
+    dc->props = vfio_vgpu_properties;
+    set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+    pdc->init = vfio_vgpu_initfn;
+    pdc->exit = vfio_vgpu_exitfn;
+    pdc->config_read = vfio_vgpu_read_config;
+    pdc->config_write = vfio_vgpu_write_config;
+    pdc->is_express = 0; /* For now, we are not */
+
+    pdc->vendor_id = PCI_DEVICE_ID_NVIDIA;
+    // pdc->device_id = 0x11B0;
+    pdc->class_id = PCI_CLASS_DISPLAY_VGA;
+}
+
+static const TypeInfo vfio_vgpu_dev_info = {
+    .name = TYPE_VFIO_VGPU,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIOvGPUDevice),
+    .class_init = vfio_vgpu_class_init,
+};
+
+static void register_vgpu_dev_type(void)
+{
+    type_register_static(&vfio_vgpu_dev_info);
+}
+
+type_init(register_vgpu_dev_type)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 379b6e1..9af5e17 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -64,6 +64,9 @@
 #define PCI_DEVICE_ID_VMWARE_IDE         0x1729
 #define PCI_DEVICE_ID_VMWARE_VMXNET3     0x07B0
 
+/* NVIDIA (0x10de) */
+#define PCI_DEVICE_ID_NVIDIA             0x10de
+
 /* Intel (0x8086) */
 #define PCI_DEVICE_ID_INTEL_82551IT      0x1209
 #define PCI_DEVICE_ID_INTEL_82557        0x1229
-- 
1.8.3.1



> 
> Jike will provide next level API definitions based on KVMGT requirement. 
> We can further refine it to match requirements of multi-vendors.
> 
> Thanks
> Kevin

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26  7:41                 ` [Qemu-devel] " Jike Song
@ 2016-01-26 14:05                   ` Yang Zhang
  -1 siblings, 0 replies; 118+ messages in thread
From: Yang Zhang @ 2016-01-26 14:05 UTC (permalink / raw)
  To: Jike Song, Alex Williamson
  Cc: Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org, Neo Jia

On 2016/1/26 15:41, Jike Song wrote:
> On 01/26/2016 05:30 AM, Alex Williamson wrote:
>> [cc +Neo @Nvidia]
>>
>> Hi Jike,
>>
>> On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
>>> On 01/20/2016 05:05 PM, Tian, Kevin wrote:
>>>> I would expect we can spell out next level tasks toward above
>>>> direction, upon which Alex can easily judge whether there are
>>>> some common VFIO framework changes that he can help :-)
>>>
>>> Hi Alex,
>>>
>>> Here is a draft task list after a short discussion w/ Kevin,
>>> would you please have a look?
>>>
>>> 	Bus Driver
>>>
>>> 		{ in i915/vgt/xxx.c }
>>>
>>> 		- define a subset of vfio_pci interfaces
>>> 		- selective pass-through (say aperture)
>>> 		- trap MMIO: interface w/ QEMU
>>
>> What's included in the subset?  Certainly the bus reset ioctls really
>> don't apply, but you'll need to support the full device interface,
>> right?  That includes the region info ioctl and access through the vfio
>> device file descriptor as well as the interrupt info and setup ioctls.
>>
>
> [All interfaces I thought are via ioctl:)  For other stuff like file
> descriptor we'll definitely keep it.]
>
> The list of ioctl commands provided by vfio_pci:
>
> 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> 	- VFIO_DEVICE_PCI_HOT_RESET
>
> As you said, above 2 don't apply. But for this:
>
> 	- VFIO_DEVICE_RESET
>
> In my opinion it should be kept, no matter what will be provided in
> the bus driver.
>
> 	- VFIO_PCI_ROM_REGION_INDEX
> 	- VFIO_PCI_VGA_REGION_INDEX
>
> I suppose above 2 don't apply neither? For a vgpu we don't provide a
> ROM BAR or VGA region.
>
> 	- VFIO_DEVICE_GET_INFO
> 	- VFIO_DEVICE_GET_REGION_INFO
> 	- VFIO_DEVICE_GET_IRQ_INFO
> 	- VFIO_DEVICE_SET_IRQS
>
> Above 4 are needed of course.
>
> We will need to extend:
>
> 	- VFIO_DEVICE_GET_REGION_INFO
>
>
> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> should be trapped instead of being mmap-ed.

I may not in the context, but i am curious how to handle the DONT_MAP in 
vfio driver? Since there are no real MMIO maps into the region and i 
suppose the access to the region should be handled by vgpu in i915 
driver, but currently most of the mmio accesses are handled by Qemu.


>
> b) adding other information. For example, for the OpRegion, QEMU need
> to do more than mmap a region, it has to:
>
> 	- allocate a region
> 	- copy contents from somewhere in host to that region
> 	- mmap it to guest
>
>
> I remember you already have a prototype for this?
>
>
>>> 	IOMMU
>>>
>>> 		{ in a new vfio_xxx.c }
>>>
>>> 		- allocate: struct device & IOMMU group
>>
>> It seems like the vgpu instance management would do this.
>>
>
> Yes, it can be removed from here.
>
>>> 		- map/unmap functions for vgpu
>>> 		- rb-tree to maintain iova/hpa mappings
>>
>> Yep, pretty much what type1 does now, but without mapping through the
>> IOMMU API.  Essentially just a database of the current userspace
>> mappings that can be accessed for page pinning and IOVA->HPA
>> translation.
>>
>
> Yes.
>
>>> 		- interacts with kvmgt.c
>>>
>>>
>>> 	vgpu instance management
>>>
>>> 		{ in i915 }
>>>
>>> 		- path, create/destroy
>>>
>>
>> Yes, and since you're creating and destroying the vgpu here, this is
>> where I'd expect a struct device to be created and added to an IOMMU
>> group.  The lifecycle management should really include links between
>> the vGPU and physical GPU, which would be much, much easier to do with
>> struct devices create here rather than at the point where we start
>> doing vfio "stuff".
>>
>
> Yes, just like the SRIOV does.
>
>
>> Nvidia has also been looking at this and has some ideas how we might
>> standardize on some of the interfaces and create a vgpu framework to
>> help share code between vendors and hopefully make a more consistent
>> userspace interface for libvirt as well.  I'll let Neo provide some
>> details.  Thanks,
>
> Good to know that, so we can possibly cooperate on some common part,
> e.g. the instance management :)
>
>>
>> Alex
>>
>
> --
> Thanks,
> Jike
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 14:05                   ` Yang Zhang
  0 siblings, 0 replies; 118+ messages in thread
From: Yang Zhang @ 2016-01-26 14:05 UTC (permalink / raw)
  To: Jike Song, Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On 2016/1/26 15:41, Jike Song wrote:
> On 01/26/2016 05:30 AM, Alex Williamson wrote:
>> [cc +Neo @Nvidia]
>>
>> Hi Jike,
>>
>> On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
>>> On 01/20/2016 05:05 PM, Tian, Kevin wrote:
>>>> I would expect we can spell out next level tasks toward above
>>>> direction, upon which Alex can easily judge whether there are
>>>> some common VFIO framework changes that he can help :-)
>>>
>>> Hi Alex,
>>>
>>> Here is a draft task list after a short discussion w/ Kevin,
>>> would you please have a look?
>>>
>>> 	Bus Driver
>>>
>>> 		{ in i915/vgt/xxx.c }
>>>
>>> 		- define a subset of vfio_pci interfaces
>>> 		- selective pass-through (say aperture)
>>> 		- trap MMIO: interface w/ QEMU
>>
>> What's included in the subset?  Certainly the bus reset ioctls really
>> don't apply, but you'll need to support the full device interface,
>> right?  That includes the region info ioctl and access through the vfio
>> device file descriptor as well as the interrupt info and setup ioctls.
>>
>
> [All interfaces I thought are via ioctl:)  For other stuff like file
> descriptor we'll definitely keep it.]
>
> The list of ioctl commands provided by vfio_pci:
>
> 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> 	- VFIO_DEVICE_PCI_HOT_RESET
>
> As you said, above 2 don't apply. But for this:
>
> 	- VFIO_DEVICE_RESET
>
> In my opinion it should be kept, no matter what will be provided in
> the bus driver.
>
> 	- VFIO_PCI_ROM_REGION_INDEX
> 	- VFIO_PCI_VGA_REGION_INDEX
>
> I suppose above 2 don't apply neither? For a vgpu we don't provide a
> ROM BAR or VGA region.
>
> 	- VFIO_DEVICE_GET_INFO
> 	- VFIO_DEVICE_GET_REGION_INFO
> 	- VFIO_DEVICE_GET_IRQ_INFO
> 	- VFIO_DEVICE_SET_IRQS
>
> Above 4 are needed of course.
>
> We will need to extend:
>
> 	- VFIO_DEVICE_GET_REGION_INFO
>
>
> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> should be trapped instead of being mmap-ed.

I may not in the context, but i am curious how to handle the DONT_MAP in 
vfio driver? Since there are no real MMIO maps into the region and i 
suppose the access to the region should be handled by vgpu in i915 
driver, but currently most of the mmio accesses are handled by Qemu.


>
> b) adding other information. For example, for the OpRegion, QEMU need
> to do more than mmap a region, it has to:
>
> 	- allocate a region
> 	- copy contents from somewhere in host to that region
> 	- mmap it to guest
>
>
> I remember you already have a prototype for this?
>
>
>>> 	IOMMU
>>>
>>> 		{ in a new vfio_xxx.c }
>>>
>>> 		- allocate: struct device & IOMMU group
>>
>> It seems like the vgpu instance management would do this.
>>
>
> Yes, it can be removed from here.
>
>>> 		- map/unmap functions for vgpu
>>> 		- rb-tree to maintain iova/hpa mappings
>>
>> Yep, pretty much what type1 does now, but without mapping through the
>> IOMMU API.  Essentially just a database of the current userspace
>> mappings that can be accessed for page pinning and IOVA->HPA
>> translation.
>>
>
> Yes.
>
>>> 		- interacts with kvmgt.c
>>>
>>>
>>> 	vgpu instance management
>>>
>>> 		{ in i915 }
>>>
>>> 		- path, create/destroy
>>>
>>
>> Yes, and since you're creating and destroying the vgpu here, this is
>> where I'd expect a struct device to be created and added to an IOMMU
>> group.  The lifecycle management should really include links between
>> the vGPU and physical GPU, which would be much, much easier to do with
>> struct devices create here rather than at the point where we start
>> doing vfio "stuff".
>>
>
> Yes, just like the SRIOV does.
>
>
>> Nvidia has also been looking at this and has some ideas how we might
>> standardize on some of the interfaces and create a vgpu framework to
>> help share code between vendors and hopefully make a more consistent
>> userspace interface for libvirt as well.  I'll let Neo provide some
>> details.  Thanks,
>
> Good to know that, so we can possibly cooperate on some common part,
> e.g. the instance management :)
>
>>
>> Alex
>>
>
> --
> Thanks,
> Jike
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26  7:41                 ` [Qemu-devel] " Jike Song
@ 2016-01-26 16:12                   ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 16:12 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org, Neo Jia

On Tue, 2016-01-26 at 15:41 +0800, Jike Song wrote:
> On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > > 
> > > Hi Alex,
> > > 
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > > 
> > > 	Bus Driver
> > > 
> > > 		{ in i915/vgt/xxx.c }
> > > 
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> > 
> 
> [All interfaces I thought are via ioctl:)  For other stuff like file
> descriptor we'll definitely keep it.]
> 
> The list of ioctl commands provided by vfio_pci:
> 
> 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> 	- VFIO_DEVICE_PCI_HOT_RESET
> 
> As you said, above 2 don't apply. But for this:
> 
> 	- VFIO_DEVICE_RESET
> 
> In my opinion it should be kept, no matter what will be provided in
> the bus driver.

Yes, the DEVICE_INFO ioctl describes whether it's present, I would
encourage implementing it.

> 	- VFIO_PCI_ROM_REGION_INDEX
> 	- VFIO_PCI_VGA_REGION_INDEX
> 
> I suppose above 2 don't apply neither? For a vgpu we don't provide a
> ROM BAR or VGA region.

Right, these aren't ioctls, just indexes into the REGION_INFO ioctl,
they're optional.

> 	- VFIO_DEVICE_GET_INFO
> 	- VFIO_DEVICE_GET_REGION_INFO
> 	- VFIO_DEVICE_GET_IRQ_INFO
> 	- VFIO_DEVICE_SET_IRQS
> 
> Above 4 are needed of course.
> 
> We will need to extend:
> 
> 	- VFIO_DEVICE_GET_REGION_INFO
> 
> 
> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> should be trapped instead of being mmap-ed.

There's already an MMAP flag, mmap is only allowed when this is set, so
there's no need for the anti-flag.  I'm also working on support for
sparse mmap capabilities so that within a region some portions can
support mmap.

> b) adding other information. For example, for the OpRegion, QEMU need
> to do more than mmap a region, it has to:
> 
> 	- allocate a region
> 	- copy contents from somewhere in host to that region
> 	- mmap it to guest
> 
> 
> I remember you already have a prototype for this?

Yes, I'm working on this currently, it will by a device specific region
and QEMU can either copy the contents to a new buffer in guest memory
or provided trapped access to the host opregion.  I thought vgpus
weren't going to need opregions though, I figured it was more for GVT-d 
support.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 16:12                   ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 16:12 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 15:41 +0800, Jike Song wrote:
> On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > [cc +Neo @Nvidia]
> > 
> > Hi Jike,
> > 
> > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > I would expect we can spell out next level tasks toward above
> > > > direction, upon which Alex can easily judge whether there are
> > > > some common VFIO framework changes that he can help :-)
> > > 
> > > Hi Alex,
> > > 
> > > Here is a draft task list after a short discussion w/ Kevin,
> > > would you please have a look?
> > > 
> > > 	Bus Driver
> > > 
> > > 		{ in i915/vgt/xxx.c }
> > > 
> > > 		- define a subset of vfio_pci interfaces
> > > 		- selective pass-through (say aperture)
> > > 		- trap MMIO: interface w/ QEMU
> > 
> > What's included in the subset?  Certainly the bus reset ioctls really
> > don't apply, but you'll need to support the full device interface,
> > right?  That includes the region info ioctl and access through the vfio
> > device file descriptor as well as the interrupt info and setup ioctls.
> > 
> 
> [All interfaces I thought are via ioctl:)  For other stuff like file
> descriptor we'll definitely keep it.]
> 
> The list of ioctl commands provided by vfio_pci:
> 
> 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> 	- VFIO_DEVICE_PCI_HOT_RESET
> 
> As you said, above 2 don't apply. But for this:
> 
> 	- VFIO_DEVICE_RESET
> 
> In my opinion it should be kept, no matter what will be provided in
> the bus driver.

Yes, the DEVICE_INFO ioctl describes whether it's present, I would
encourage implementing it.

> 	- VFIO_PCI_ROM_REGION_INDEX
> 	- VFIO_PCI_VGA_REGION_INDEX
> 
> I suppose above 2 don't apply neither? For a vgpu we don't provide a
> ROM BAR or VGA region.

Right, these aren't ioctls, just indexes into the REGION_INFO ioctl,
they're optional.

> 	- VFIO_DEVICE_GET_INFO
> 	- VFIO_DEVICE_GET_REGION_INFO
> 	- VFIO_DEVICE_GET_IRQ_INFO
> 	- VFIO_DEVICE_SET_IRQS
> 
> Above 4 are needed of course.
> 
> We will need to extend:
> 
> 	- VFIO_DEVICE_GET_REGION_INFO
> 
> 
> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> should be trapped instead of being mmap-ed.

There's already an MMAP flag, mmap is only allowed when this is set, so
there's no need for the anti-flag.  I'm also working on support for
sparse mmap capabilities so that within a region some portions can
support mmap.

> b) adding other information. For example, for the OpRegion, QEMU need
> to do more than mmap a region, it has to:
> 
> 	- allocate a region
> 	- copy contents from somewhere in host to that region
> 	- mmap it to guest
> 
> 
> I remember you already have a prototype for this?

Yes, I'm working on this currently, it will by a device specific region
and QEMU can either copy the contents to a new buffer in guest memory
or provided trapped access to the host opregion.  I thought vgpus
weren't going to need opregions though, I figured it was more for GVT-d 
support.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 14:05                   ` [Qemu-devel] " Yang Zhang
@ 2016-01-26 16:37                     ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 16:37 UTC (permalink / raw)
  To: Yang Zhang, Jike Song
  Cc: Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org, Neo Jia

On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> On 2016/1/26 15:41, Jike Song wrote:
> > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > [cc +Neo @Nvidia]
> > > 
> > > Hi Jike,
> > > 
> > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > I would expect we can spell out next level tasks toward above
> > > > > direction, upon which Alex can easily judge whether there are
> > > > > some common VFIO framework changes that he can help :-)
> > > > 
> > > > Hi Alex,
> > > > 
> > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > would you please have a look?
> > > > 
> > > > 	Bus Driver
> > > > 
> > > > 		{ in i915/vgt/xxx.c }
> > > > 
> > > > 		- define a subset of vfio_pci interfaces
> > > > 		- selective pass-through (say aperture)
> > > > 		- trap MMIO: interface w/ QEMU
> > > 
> > > What's included in the subset?  Certainly the bus reset ioctls really
> > > don't apply, but you'll need to support the full device interface,
> > > right?  That includes the region info ioctl and access through the vfio
> > > device file descriptor as well as the interrupt info and setup ioctls.
> > > 
> > 
> > [All interfaces I thought are via ioctl:)  For other stuff like file
> > descriptor we'll definitely keep it.]
> > 
> > The list of ioctl commands provided by vfio_pci:
> > 
> > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > 	- VFIO_DEVICE_PCI_HOT_RESET
> > 
> > As you said, above 2 don't apply. But for this:
> > 
> > 	- VFIO_DEVICE_RESET
> > 
> > In my opinion it should be kept, no matter what will be provided in
> > the bus driver.
> > 
> > 	- VFIO_PCI_ROM_REGION_INDEX
> > 	- VFIO_PCI_VGA_REGION_INDEX
> > 
> > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > ROM BAR or VGA region.
> > 
> > 	- VFIO_DEVICE_GET_INFO
> > 	- VFIO_DEVICE_GET_REGION_INFO
> > 	- VFIO_DEVICE_GET_IRQ_INFO
> > 	- VFIO_DEVICE_SET_IRQS
> > 
> > Above 4 are needed of course.
> > 
> > We will need to extend:
> > 
> > 	- VFIO_DEVICE_GET_REGION_INFO
> > 
> > 
> > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > should be trapped instead of being mmap-ed.
> 
> I may not in the context, but i am curious how to handle the DONT_MAP in 
> vfio driver? Since there are no real MMIO maps into the region and i 
> suppose the access to the region should be handled by vgpu in i915 
> driver, but currently most of the mmio accesses are handled by Qemu.

VFIO supports the following region attributes:

#define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */

If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
the specified offsets of the device file descriptor, depending on what
accesses are supported.  This is all reported through the REGION_INFO
ioctl for a given index.  If mmap is supported, the VM will have direct
access to the area, without faulting to KVM other than to populate the
mapping.  Without mmap support, a VM MMIO access traps into KVM, which
returns out to QEMU to service the request, which then finds the
MemoryRegion serviced through vfio, which will then perform a
pread/pwrite through to the kernel vfio bus driver to handle the
access.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 16:37                     ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 16:37 UTC (permalink / raw)
  To: Yang Zhang, Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> On 2016/1/26 15:41, Jike Song wrote:
> > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > [cc +Neo @Nvidia]
> > > 
> > > Hi Jike,
> > > 
> > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > I would expect we can spell out next level tasks toward above
> > > > > direction, upon which Alex can easily judge whether there are
> > > > > some common VFIO framework changes that he can help :-)
> > > > 
> > > > Hi Alex,
> > > > 
> > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > would you please have a look?
> > > > 
> > > > 	Bus Driver
> > > > 
> > > > 		{ in i915/vgt/xxx.c }
> > > > 
> > > > 		- define a subset of vfio_pci interfaces
> > > > 		- selective pass-through (say aperture)
> > > > 		- trap MMIO: interface w/ QEMU
> > > 
> > > What's included in the subset?  Certainly the bus reset ioctls really
> > > don't apply, but you'll need to support the full device interface,
> > > right?  That includes the region info ioctl and access through the vfio
> > > device file descriptor as well as the interrupt info and setup ioctls.
> > > 
> > 
> > [All interfaces I thought are via ioctl:)  For other stuff like file
> > descriptor we'll definitely keep it.]
> > 
> > The list of ioctl commands provided by vfio_pci:
> > 
> > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > 	- VFIO_DEVICE_PCI_HOT_RESET
> > 
> > As you said, above 2 don't apply. But for this:
> > 
> > 	- VFIO_DEVICE_RESET
> > 
> > In my opinion it should be kept, no matter what will be provided in
> > the bus driver.
> > 
> > 	- VFIO_PCI_ROM_REGION_INDEX
> > 	- VFIO_PCI_VGA_REGION_INDEX
> > 
> > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > ROM BAR or VGA region.
> > 
> > 	- VFIO_DEVICE_GET_INFO
> > 	- VFIO_DEVICE_GET_REGION_INFO
> > 	- VFIO_DEVICE_GET_IRQ_INFO
> > 	- VFIO_DEVICE_SET_IRQS
> > 
> > Above 4 are needed of course.
> > 
> > We will need to extend:
> > 
> > 	- VFIO_DEVICE_GET_REGION_INFO
> > 
> > 
> > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > should be trapped instead of being mmap-ed.
> 
> I may not in the context, but i am curious how to handle the DONT_MAP in 
> vfio driver? Since there are no real MMIO maps into the region and i 
> suppose the access to the region should be handled by vgpu in i915 
> driver, but currently most of the mmio accesses are handled by Qemu.

VFIO supports the following region attributes:

#define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */

If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
the specified offsets of the device file descriptor, depending on what
accesses are supported.  This is all reported through the REGION_INFO
ioctl for a given index.  If mmap is supported, the VM will have direct
access to the area, without faulting to KVM other than to populate the
mapping.  Without mmap support, a VM MMIO access traps into KVM, which
returns out to QEMU to service the request, which then finds the
MemoryRegion serviced through vfio, which will then perform a
pread/pwrite through to the kernel vfio bus driver to handle the
access.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 10:20                   ` [Qemu-devel] " Neo Jia
@ 2016-01-26 19:24                     ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 19:24 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Kirti Wankhede

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, January 26, 2016 6:21 PM
> 
> 0. High level overview
> =====================================================
> =============================
> 
> 
>   user space:
>                                 +-----------+  VFIO IOMMU IOCTLs
>                       +---------| QEMU VFIO |-------------------------+
>         VFIO IOCTLs   |         +-----------+                         |
>                       |                                               |
>  ---------------------|-----------------------------------------------|---------
>                       |                                               |
>   kernel space:       |  +--->----------->---+  (callback)            V
>                       |  |                   v                 +------V-----+
>   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
>   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
>   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+
>   |          |   |          |     | (register)           ^         ||
>   +----------+   +-------+--+     |    +-----------+     |         ||
>                          V        +----| i915.ko   +-----+     +---VV-------+
>                          |             +-----^-----+           | TYPE1      |
>                          |  (callback)       |                 | IOMMU      |
>                          +-->------------>---+                 +------------+
>  access flow:
> 
>   Guest MMIO / PCI config access
>   |
>   -------------------------------------------------
>   |
>   +-----> KVM VM_EXITs  (kernel)
>           |
>   -------------------------------------------------
>           |
>           +-----> QEMU VFIO driver (user)
>                   |
>   -------------------------------------------------
>                   |
>                   +---->  VGPU kernel driver (kernel)
>                           |
>                           |
>                           +----> vendor driver callback
> 
> 

There is one difference between nvidia and intel implementations. We have
vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
are forwarded directly in kernel side. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 19:24                     ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 19:24 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, January 26, 2016 6:21 PM
> 
> 0. High level overview
> =====================================================
> =============================
> 
> 
>   user space:
>                                 +-----------+  VFIO IOMMU IOCTLs
>                       +---------| QEMU VFIO |-------------------------+
>         VFIO IOCTLs   |         +-----------+                         |
>                       |                                               |
>  ---------------------|-----------------------------------------------|---------
>                       |                                               |
>   kernel space:       |  +--->----------->---+  (callback)            V
>                       |  |                   v                 +------V-----+
>   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
>   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
>   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+
>   |          |   |          |     | (register)           ^         ||
>   +----------+   +-------+--+     |    +-----------+     |         ||
>                          V        +----| i915.ko   +-----+     +---VV-------+
>                          |             +-----^-----+           | TYPE1      |
>                          |  (callback)       |                 | IOMMU      |
>                          +-->------------>---+                 +------------+
>  access flow:
> 
>   Guest MMIO / PCI config access
>   |
>   -------------------------------------------------
>   |
>   +-----> KVM VM_EXITs  (kernel)
>           |
>   -------------------------------------------------
>           |
>           +-----> QEMU VFIO driver (user)
>                   |
>   -------------------------------------------------
>                   |
>                   +---->  VGPU kernel driver (kernel)
>                           |
>                           |
>                           +----> vendor driver callback
> 
> 

There is one difference between nvidia and intel implementations. We have
vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
are forwarded directly in kernel side. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 19:24                     ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 19:29                       ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 19:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Kirti Wankhede

On Tue, Jan 26, 2016 at 07:24:52PM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, January 26, 2016 6:21 PM
> > 
> > 0. High level overview
> > =====================================================
> > =============================
> > 
> > 
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               |
> >  ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     +---VV-------+
> >                          |             +-----^-----+           | TYPE1      |
> >                          |  (callback)       |                 | IOMMU      |
> >                          +-->------------>---+                 +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   |
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |
> >                           |
> >                           +----> vendor driver callback
> > 
> > 
> 
> There is one difference between nvidia and intel implementations. We have
> vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
> are forwarded directly in kernel side. 

Hi Kevin,

With the vendor driver callback, it will always forward to the kernel driver. If
you are talking about the QEMU VFIO driver (user) part I put on the above
diagram, that is how QEMU VFIO handles MMIO or pci config access today, which we
don't change anything here in this design.

Thanks,
Neo


> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 19:29                       ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 19:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Tue, Jan 26, 2016 at 07:24:52PM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, January 26, 2016 6:21 PM
> > 
> > 0. High level overview
> > =====================================================
> > =============================
> > 
> > 
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               |
> >  ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     +---VV-------+
> >                          |             +-----^-----+           | TYPE1      |
> >                          |  (callback)       |                 | IOMMU      |
> >                          +-->------------>---+                 +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   |
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |
> >                           |
> >                           +----> vendor driver callback
> > 
> > 
> 
> There is one difference between nvidia and intel implementations. We have
> vgpu device model in kernel, as part of i915.ko. So I/O emulation requests
> are forwarded directly in kernel side. 

Hi Kevin,

With the vendor driver callback, it will always forward to the kernel driver. If
you are talking about the QEMU VFIO driver (user) part I put on the above
diagram, that is how QEMU VFIO handles MMIO or pci config access today, which we
don't change anything here in this design.

Thanks,
Neo


> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 10:20                   ` [Qemu-devel] " Neo Jia
@ 2016-01-26 20:06                     ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 20:06 UTC (permalink / raw)
  To: Neo Jia, Tian, Kevin
  Cc: Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org, Kirti Wankhede

On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> 
> Hi Alex, Kevin and Jike,
> 
> (Seems I shouldn't use attachment, resend it again to the list, patches are
> inline at the end)
> 
> Thanks for adding me to this technical discussion, a great opportunity
> for us to design together which can bring both Intel and NVIDIA vGPU solution to
> KVM platform.
> 
> Instead of directly jumping to the proposal that we have been working on
> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> quick comments / thoughts regarding the existing discussions on this thread as
> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> 
> Then we can look at what we have, hopefully we can reach some consensus soon.
> 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> group and VFIO group.

Is this really a good idea?  The concept of a vgpu is not unique to
vfio, we want vfio to be a driver for a vgpu, not an integral part of
the lifecycle of a vgpu.  That certainly doesn't exclude adding
infrastructure to make lifecycle management of a vgpu more consistent
between drivers, but it should be done independently of vfio.  I'll go
back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
does not create the VF, that's done in coordination with the PF making
use of some PCI infrastructure for consistency between drivers.

It seems like we need to take more advantage of the class and driver
core support to perhaps setup a vgpu bus and class with vfio-vgpu just
being a driver for those devices.

> Graphics driver can register with vfio-vgpu to get management and emulation call
> backs to graphics driver.   
> 
> We already have struct vgpu_device in our proposal that keeps pointer to
> physical device.  
> 
> > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > purpose. Anyway they can share the same injection interface;
> 
> eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> available to graphics driver so that graphics driver can inject interrupts
> directly when physical device triggers interrupt. 
> 
> Here is the proposal we have, please review.
> 
> Please note the patches we have put out here is mainly for POC purpose to
> verify our understanding also can serve the purpose to reduce confusions and speed up 
> our design, although we are very happy to refine that to something eventually
> can be used for both parties and upstreamed.
> 
> Linux vGPU kernel design
> ==================================================================================
> 
> Here we are proposing a generic Linux kernel module based on VFIO framework
> which allows different GPU vendors to plugin and provide their GPU virtualization
> solution on KVM, the benefits of having such generic kernel module are:
> 
> 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> 
> 2) GPU HW agnostic management API for upper layer software such as libvirt
> 
> 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor
> 
> 0. High level overview
> ==================================================================================
> 
>  
>   user space:
>                                 +-----------+  VFIO IOMMU IOCTLs
>                       +---------| QEMU VFIO |-------------------------+
>         VFIO IOCTLs   |         +-----------+                         |
>                       |                                               | 
>  ---------------------|-----------------------------------------------|---------
>                       |                                               |
>   kernel space:       |  +--->----------->---+  (callback)            V
>                       |  |                   v                 +------V-----+
>   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
>   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
>   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
>   |          |   |          |     | (register)           ^         ||
>   +----------+   +-------+--+     |    +-----------+     |         ||
>                          V        +----| i915.ko   +-----+     +---VV-------+ 
>                          |             +-----^-----+           | TYPE1      |
>                          |  (callback)       |                 | IOMMU      |
>                          +-->------------>---+                 +------------+
>  access flow:
> 
>   Guest MMIO / PCI config access
>   |
>   -------------------------------------------------
>   |
>   +-----> KVM VM_EXITs  (kernel)
>           |
>   -------------------------------------------------
>           |
>           +-----> QEMU VFIO driver (user)
>                   | 
>   -------------------------------------------------
>                   |
>                   +---->  VGPU kernel driver (kernel)
>                           |  
>                           | 
>                           +----> vendor driver callback
> 
> 
> 1. VGPU management interface
> ==================================================================================
> 
> This is the interface allows upper layer software (mostly libvirt) to query and
> configure virtual GPU device in a HW agnostic fashion. Also, this management
> interface has provided flexibility to underlying GPU vendor to support virtual
> device hotplug, multiple virtual devices per VM, multiple virtual devices from
> different physical devices, etc.
> 
> 1.1 Under per-physical device sysfs:
> ----------------------------------------------------------------------------------
> 
> vgpu_supported_types - RO, list the current supported virtual GPU types and its
> VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> "vgpu_supported_types".
>                             
> vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> gpu device on a target physical GPU. idx: virtual device index inside a VM
> 
> vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> target physical GPU


I've noted in previous discussions that we need to separate user policy
from kernel policy here, the kernel policy should not require a "VM
UUID".  A UUID simply represents a set of one or more devices and an
index picks the device within the set.  Whether that UUID matches a VM
or is independently used is up to the user policy when creating the
device.

Personally I'd also prefer to get rid of the concept of indexes within a
UUID set of devices and instead have each device be independent.  This
seems to be an imposition on the nvidia implementation into the kernel
interface design.


> 1.3 Under vgpu class sysfs:
> ----------------------------------------------------------------------------------
> 
> vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> interface to notify the GPU vendor driver to commit virtual GPU resource for
> this target VM. 
> 
> Also, the vgpu_start function is a synchronized call, the successful return of
> this call will indicate all the requested vGPU resource has been fully
> committed, the VMM should continue.
> 
> vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
> interface to notify the GPU vendor driver to release virtual GPU resource of
> this target VM.
> 
> 1.4 Virtual device Hotplug
> ----------------------------------------------------------------------------------
> 
> To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> accessed during VM runtime, and the corresponding registration callback will be
> invoked to allow GPU vendor support hotplug.
> 
> To support hotplug, vendor driver would take necessary action to handle the
> situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> implies both create and start for that vgpu device.
> 
> Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
> supports vgpu hotplug.
> 
> If hotplug is not supported and VM is still running, vendor driver can return
> error code to indicate not supported.
> 
> Separate create from start gives flixibility to have:
> 
> - multiple vgpu instances for single VM and
> - hotplug feature.
> 
> 2. GPU driver vendor registration interface
> ==================================================================================
> 
> 2.1 Registration interface definition (include/linux/vgpu.h)
> ----------------------------------------------------------------------------------
> 
> extern int vgpu_register_device(struct pci_dev *dev, 
>                                 const struct gpu_device_ops *ops);
> 
> extern void vgpu_unregister_device(struct pci_dev *dev);
> 
> /**
>  * struct gpu_device_ops - Structure to be registered for each physical GPU to
>  * register the device to vgpu module.
>  *
>  * @owner:                      The module owner.
>  * @vgpu_supported_config:      Called to get information about supported vgpu
>  * types.
>  *                              @dev : pci device structure of physical GPU. 
>  *                              @config: should return string listing supported
>  *                              config
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_create:                Called to allocate basic resouces in graphics
>  *                              driver for a particular vgpu.
>  *                              @dev: physical pci device structure on which
>  *                              vgpu 
>  *                                    should be created
>  *                              @vm_uuid: VM's uuid for which VM it is intended
>  *                              to
>  *                              @instance: vgpu instance in that VM
>  *                              @vgpu_id: This represents the type of vgpu to be
>  *                                        created
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_destroy:               Called to free resources in graphics driver for
>  *                              a vgpu instance of that VM.
>  *                              @dev: physical pci device structure to which
>  *                              this vgpu points to.
>  *                              @vm_uuid: VM's uuid for which the vgpu belongs
>  *                              to.
>  *                              @instance: vgpu instance in that VM
>  *                              Returns integer: success (0) or error (< 0)
>  *                              If VM is running and vgpu_destroy is called that 
>  *                              means the vGPU is being hotunpluged. Return
>  *                              error
>  *                              if VM is running and graphics driver doesn't
>  *                              support vgpu hotplug.
>  * @vgpu_start:                 Called to do initiate vGPU initialization
>  *                              process in graphics driver when VM boots before
>  *                              qemu starts.
>  *                              @vm_uuid: VM's UUID which is booting.
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_shutdown:              Called to teardown vGPU related resources for
>  *                              the VM
>  *                              @vm_uuid: VM's UUID which is shutting down .
>  *                              Returns integer: success (0) or error (< 0)
>  * @read:                       Read emulation callback
>  *                              @vdev: vgpu device structure
>  *                              @buf: read buffer
>  *                              @count: number bytes to read 
>  *                              @address_space: specifies for which address
>  *                              space
>  *                              the request is: pci_config_space, IO register
>  *                              space or MMIO space.
>  *                              Retuns number on bytes read on success or error.
>  * @write:                      Write emulation callback
>  *                              @vdev: vgpu device structure
>  *                              @buf: write buffer
>  *                              @count: number bytes to be written
>  *                              @address_space: specifies for which address
>  *                              space
>  *                              the request is: pci_config_space, IO register
>  *                              space or MMIO space.
>  *                              Retuns number on bytes written on success or
>  *                              error.
>  * @vgpu_set_irqs:              Called to send about interrupts configuration
>  *                              information that qemu set. 
>  *                              @vdev: vgpu device structure
>  *                              @flags, index, start, count and *data : same as
>  *                              that of struct vfio_irq_set of
>  *                              VFIO_DEVICE_SET_IRQS API. 
>  *
>  * Physical GPU that support vGPU should be register with vgpu module with 
>  * gpu_device_ops structure.
>  */
> 
> struct gpu_device_ops {
>         struct module   *owner;
>         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
>         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
>                                uint32_t instance, uint32_t vgpu_id);
>         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
>                                 uint32_t instance);
>         int     (*vgpu_start)(uuid_le vm_uuid);
>         int     (*vgpu_shutdown)(uuid_le vm_uuid);
>         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
>                          uint32_t address_space, loff_t pos);
>         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
>                          uint32_t address_space,loff_t pos);
>         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
>                                  unsigned index, unsigned start, unsigned count,
>                                  void *data);
> 
> };


I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
that register these ops with the main vfio-vgpu driver and they should
also include a probe() function which allows us to associate a given
vgpu device with a set of vendor ops.


> 
> 2.2 Details for callbacks we haven't mentioned above.
> ---------------------------------------------------------------------------------
> 
> vgpu_supported_config: allows the vendor driver to specify the supported vGPU
>                        type/configuration
> 
> vgpu_create          : create a virtual GPU device, can be used for device hotplug.
> 
> vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.
> 
> vgpu_start           : callback function to notify vendor driver vgpu device
>                        come to live for a given virtual machine.
> 
> vgpu_shutdown        : callback function to notify vendor driver 
> 
> read                 : callback to vendor driver to handle virtual device config
>                        space or MMIO read access
> 
> write                : callback to vendor driver to handle virtual device config
>                        space or MMIO write access
> 
> vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
>                        information for the target virtual device, then vendor
>                        driver can inject interrupt into virtual machine for this
>                        device.
> 
> 2.3 Potential additional virtual device configuration registration interface:
> ---------------------------------------------------------------------------------
> 
> callback function to describe the MMAP behavior of the virtual GPU 
> 
> callback function to allow GPU vendor driver to provide PCI config space backing
> memory.
> 
> 3. VGPU TYPE1 IOMMU
> ==================================================================================
> 
> Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
> <iova, hva, size, flag> and save the QEMU mm for later reference.
> 
> You can find the quick/ugly implementation in the attached patch file, which is
> actually just a simple version Alex's type1 IOMMU without actual real
> mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 
> 
> We have thought about providing another vendor driver registration interface so
> such tracking information will be sent to vendor driver and he will use the QEMU
> mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
> quick implementation within our driver, I noticed following issues:
> 
> 1) OS/VFIO logic into vendor driver which will be a maintenance issue.
> 
> 2) Every driver vendor has to implement their own RB tree, instead of reusing
> the common existing VFIO code (vfio_find/link/unlink_dma) 
> 
> 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
> better not have anything inside a vendor driver that the VFIO caller immediately
> depends on.
> 
> Based on the above consideration, we decide to implement the DMA tracking logic
> within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
> IOMMU code) and expose two symbols to outside for MMIO mapping and page
> translation and pinning. 
> 
> Also, with a mmap MMIO interface between virtual and physical, this allows
> para-virtualized guest driver can access his virtual MMIO without taking a MMAP
> fault hit, also we can support different MMIO size between virtual and physical
> device.
> 
> int vgpu_map_virtual_bar
> (
>     uint64_t virt_bar_addr,
>     uint64_t phys_bar_addr,
>     uint32_t len,
>     uint32_t flags
> )
> 
> EXPORT_SYMBOL(vgpu_map_virtual_bar);


Per the implementation provided, this needs to be implemented in the
vfio device driver, not in the iommu interface.  Finding the DMA mapping
of the device and replacing it is wrong.  It should be remapped at the
vfio device file interface using vm_ops.


> int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> 
> EXPORT_SYMBOL(vgpu_dma_do_translate);
> 
> Still a lot to be added and modified, such as supporting multiple VMs and 
> multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> kernel driver, error handling, roll-back and locked memory size per user, etc. 

Particularly, handling of mapping changes is completely missing.  This
cannot be a point in time translation, the user is free to remap
addresses whenever they wish and device translations need to be updated
accordingly.


> 4. Modules
> ==================================================================================
> 
> Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> 
> vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
>                            TYPE1 v1 and v2 interface. 

Depending on how intrusive it is, this can possibly by done within the
existing type1 driver.  Either that or we can split out common code for
use by a separate module.

> vgpu.ko                  - provide registration interface and virtual device
>                            VFIO access.
> 
> 5. QEMU note
> ==================================================================================
> 
> To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> use it as a reference for our implementation. It is basically just a quick c & p
> from vfio/pci.c to quickly meet our needs.
> 
> Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> class, and probably the only thing required is to have a new way to discover the
> device.
> 
> 6. Examples
> ==================================================================================
> 
> On this server, we have two NVIDIA M60 GPUs.
> 
> [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> 
> After nvidia.ko gets initialized, we can query the supported vGPU type by
> accessing the "vgpu_supported_types" like following:
> 
> [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> 11:GRID M60-0B
> 12:GRID M60-0Q
> 13:GRID M60-1B
> 14:GRID M60-1Q
> 15:GRID M60-2B
> 16:GRID M60-2Q
> 17:GRID M60-4Q
> 18:GRID M60-8Q
> 
> For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> like to create "GRID M60-4Q" VM on it.
> 
> echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> 
> Note: the number 0 here is for vGPU device index. So far the change is not tested
> for multiple vgpu devices yet, but we will support it.
> 
> At this moment, if you query the "vgpu_supported_types" it will still show all
> supported virtual GPU types as no virtual GPU resource is committed yet.
> 
> Starting VM:
> 
> echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> 
> then, the supported vGPU type query will return:
> 
> [root@cjia-vgx-kvm /home/cjia]$
> > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> 17:GRID M60-4Q
> 
> So vgpu_supported_config needs to be called whenever a new virtual device gets
> created as the underlying HW might limit the supported types if there are
> any existing VM runnings.
> 
> Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> GPU driver vendor to clean up resource.
> 
> Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> device sysfs.


I'd like to hear Intel's thoughts on this interface.  Are there
different vgpu capacities or priority classes that would necessitate
different types of vcpus on Intel?

I think there are some gaps in translating from named vgpu types to
indexes here, along with my previous mention of the UUID/set oddity.

Does Intel have a need for start and shutdown interfaces?

Neo, wasn't there at some point information about how many of each type
could be supported through these interfaces?  How does a user know their
capacity limits?

Thanks,
Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 20:06                     ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 20:06 UTC (permalink / raw)
  To: Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> 
> Hi Alex, Kevin and Jike,
> 
> (Seems I shouldn't use attachment, resend it again to the list, patches are
> inline at the end)
> 
> Thanks for adding me to this technical discussion, a great opportunity
> for us to design together which can bring both Intel and NVIDIA vGPU solution to
> KVM platform.
> 
> Instead of directly jumping to the proposal that we have been working on
> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> quick comments / thoughts regarding the existing discussions on this thread as
> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> 
> Then we can look at what we have, hopefully we can reach some consensus soon.
> 
> > Yes, and since you're creating and destroying the vgpu here, this is
> > where I'd expect a struct device to be created and added to an IOMMU
> > group.  The lifecycle management should really include links between
> > the vGPU and physical GPU, which would be much, much easier to do with
> > struct devices create here rather than at the point where we start
> > doing vfio "stuff".
> 
> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> group and VFIO group.

Is this really a good idea?  The concept of a vgpu is not unique to
vfio, we want vfio to be a driver for a vgpu, not an integral part of
the lifecycle of a vgpu.  That certainly doesn't exclude adding
infrastructure to make lifecycle management of a vgpu more consistent
between drivers, but it should be done independently of vfio.  I'll go
back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
does not create the VF, that's done in coordination with the PF making
use of some PCI infrastructure for consistency between drivers.

It seems like we need to take more advantage of the class and driver
core support to perhaps setup a vgpu bus and class with vfio-vgpu just
being a driver for those devices.

> Graphics driver can register with vfio-vgpu to get management and emulation call
> backs to graphics driver.   
> 
> We already have struct vgpu_device in our proposal that keeps pointer to
> physical device.  
> 
> > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > purpose. Anyway they can share the same injection interface;
> 
> eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> available to graphics driver so that graphics driver can inject interrupts
> directly when physical device triggers interrupt. 
> 
> Here is the proposal we have, please review.
> 
> Please note the patches we have put out here is mainly for POC purpose to
> verify our understanding also can serve the purpose to reduce confusions and speed up 
> our design, although we are very happy to refine that to something eventually
> can be used for both parties and upstreamed.
> 
> Linux vGPU kernel design
> ==================================================================================
> 
> Here we are proposing a generic Linux kernel module based on VFIO framework
> which allows different GPU vendors to plugin and provide their GPU virtualization
> solution on KVM, the benefits of having such generic kernel module are:
> 
> 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> 
> 2) GPU HW agnostic management API for upper layer software such as libvirt
> 
> 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor
> 
> 0. High level overview
> ==================================================================================
> 
>  
>   user space:
>                                 +-----------+  VFIO IOMMU IOCTLs
>                       +---------| QEMU VFIO |-------------------------+
>         VFIO IOCTLs   |         +-----------+                         |
>                       |                                               | 
>  ---------------------|-----------------------------------------------|---------
>                       |                                               |
>   kernel space:       |  +--->----------->---+  (callback)            V
>                       |  |                   v                 +------V-----+
>   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
>   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
>   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
>   |          |   |          |     | (register)           ^         ||
>   +----------+   +-------+--+     |    +-----------+     |         ||
>                          V        +----| i915.ko   +-----+     +---VV-------+ 
>                          |             +-----^-----+           | TYPE1      |
>                          |  (callback)       |                 | IOMMU      |
>                          +-->------------>---+                 +------------+
>  access flow:
> 
>   Guest MMIO / PCI config access
>   |
>   -------------------------------------------------
>   |
>   +-----> KVM VM_EXITs  (kernel)
>           |
>   -------------------------------------------------
>           |
>           +-----> QEMU VFIO driver (user)
>                   | 
>   -------------------------------------------------
>                   |
>                   +---->  VGPU kernel driver (kernel)
>                           |  
>                           | 
>                           +----> vendor driver callback
> 
> 
> 1. VGPU management interface
> ==================================================================================
> 
> This is the interface allows upper layer software (mostly libvirt) to query and
> configure virtual GPU device in a HW agnostic fashion. Also, this management
> interface has provided flexibility to underlying GPU vendor to support virtual
> device hotplug, multiple virtual devices per VM, multiple virtual devices from
> different physical devices, etc.
> 
> 1.1 Under per-physical device sysfs:
> ----------------------------------------------------------------------------------
> 
> vgpu_supported_types - RO, list the current supported virtual GPU types and its
> VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> "vgpu_supported_types".
>                             
> vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> gpu device on a target physical GPU. idx: virtual device index inside a VM
> 
> vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> target physical GPU


I've noted in previous discussions that we need to separate user policy
from kernel policy here, the kernel policy should not require a "VM
UUID".  A UUID simply represents a set of one or more devices and an
index picks the device within the set.  Whether that UUID matches a VM
or is independently used is up to the user policy when creating the
device.

Personally I'd also prefer to get rid of the concept of indexes within a
UUID set of devices and instead have each device be independent.  This
seems to be an imposition on the nvidia implementation into the kernel
interface design.


> 1.3 Under vgpu class sysfs:
> ----------------------------------------------------------------------------------
> 
> vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> interface to notify the GPU vendor driver to commit virtual GPU resource for
> this target VM. 
> 
> Also, the vgpu_start function is a synchronized call, the successful return of
> this call will indicate all the requested vGPU resource has been fully
> committed, the VMM should continue.
> 
> vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
> interface to notify the GPU vendor driver to release virtual GPU resource of
> this target VM.
> 
> 1.4 Virtual device Hotplug
> ----------------------------------------------------------------------------------
> 
> To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> accessed during VM runtime, and the corresponding registration callback will be
> invoked to allow GPU vendor support hotplug.
> 
> To support hotplug, vendor driver would take necessary action to handle the
> situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> implies both create and start for that vgpu device.
> 
> Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
> supports vgpu hotplug.
> 
> If hotplug is not supported and VM is still running, vendor driver can return
> error code to indicate not supported.
> 
> Separate create from start gives flixibility to have:
> 
> - multiple vgpu instances for single VM and
> - hotplug feature.
> 
> 2. GPU driver vendor registration interface
> ==================================================================================
> 
> 2.1 Registration interface definition (include/linux/vgpu.h)
> ----------------------------------------------------------------------------------
> 
> extern int vgpu_register_device(struct pci_dev *dev, 
>                                 const struct gpu_device_ops *ops);
> 
> extern void vgpu_unregister_device(struct pci_dev *dev);
> 
> /**
>  * struct gpu_device_ops - Structure to be registered for each physical GPU to
>  * register the device to vgpu module.
>  *
>  * @owner:                      The module owner.
>  * @vgpu_supported_config:      Called to get information about supported vgpu
>  * types.
>  *                              @dev : pci device structure of physical GPU. 
>  *                              @config: should return string listing supported
>  *                              config
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_create:                Called to allocate basic resouces in graphics
>  *                              driver for a particular vgpu.
>  *                              @dev: physical pci device structure on which
>  *                              vgpu 
>  *                                    should be created
>  *                              @vm_uuid: VM's uuid for which VM it is intended
>  *                              to
>  *                              @instance: vgpu instance in that VM
>  *                              @vgpu_id: This represents the type of vgpu to be
>  *                                        created
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_destroy:               Called to free resources in graphics driver for
>  *                              a vgpu instance of that VM.
>  *                              @dev: physical pci device structure to which
>  *                              this vgpu points to.
>  *                              @vm_uuid: VM's uuid for which the vgpu belongs
>  *                              to.
>  *                              @instance: vgpu instance in that VM
>  *                              Returns integer: success (0) or error (< 0)
>  *                              If VM is running and vgpu_destroy is called that 
>  *                              means the vGPU is being hotunpluged. Return
>  *                              error
>  *                              if VM is running and graphics driver doesn't
>  *                              support vgpu hotplug.
>  * @vgpu_start:                 Called to do initiate vGPU initialization
>  *                              process in graphics driver when VM boots before
>  *                              qemu starts.
>  *                              @vm_uuid: VM's UUID which is booting.
>  *                              Returns integer: success (0) or error (< 0)
>  * @vgpu_shutdown:              Called to teardown vGPU related resources for
>  *                              the VM
>  *                              @vm_uuid: VM's UUID which is shutting down .
>  *                              Returns integer: success (0) or error (< 0)
>  * @read:                       Read emulation callback
>  *                              @vdev: vgpu device structure
>  *                              @buf: read buffer
>  *                              @count: number bytes to read 
>  *                              @address_space: specifies for which address
>  *                              space
>  *                              the request is: pci_config_space, IO register
>  *                              space or MMIO space.
>  *                              Retuns number on bytes read on success or error.
>  * @write:                      Write emulation callback
>  *                              @vdev: vgpu device structure
>  *                              @buf: write buffer
>  *                              @count: number bytes to be written
>  *                              @address_space: specifies for which address
>  *                              space
>  *                              the request is: pci_config_space, IO register
>  *                              space or MMIO space.
>  *                              Retuns number on bytes written on success or
>  *                              error.
>  * @vgpu_set_irqs:              Called to send about interrupts configuration
>  *                              information that qemu set. 
>  *                              @vdev: vgpu device structure
>  *                              @flags, index, start, count and *data : same as
>  *                              that of struct vfio_irq_set of
>  *                              VFIO_DEVICE_SET_IRQS API. 
>  *
>  * Physical GPU that support vGPU should be register with vgpu module with 
>  * gpu_device_ops structure.
>  */
> 
> struct gpu_device_ops {
>         struct module   *owner;
>         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
>         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
>                                uint32_t instance, uint32_t vgpu_id);
>         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
>                                 uint32_t instance);
>         int     (*vgpu_start)(uuid_le vm_uuid);
>         int     (*vgpu_shutdown)(uuid_le vm_uuid);
>         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
>                          uint32_t address_space, loff_t pos);
>         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
>                          uint32_t address_space,loff_t pos);
>         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
>                                  unsigned index, unsigned start, unsigned count,
>                                  void *data);
> 
> };


I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
that register these ops with the main vfio-vgpu driver and they should
also include a probe() function which allows us to associate a given
vgpu device with a set of vendor ops.


> 
> 2.2 Details for callbacks we haven't mentioned above.
> ---------------------------------------------------------------------------------
> 
> vgpu_supported_config: allows the vendor driver to specify the supported vGPU
>                        type/configuration
> 
> vgpu_create          : create a virtual GPU device, can be used for device hotplug.
> 
> vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.
> 
> vgpu_start           : callback function to notify vendor driver vgpu device
>                        come to live for a given virtual machine.
> 
> vgpu_shutdown        : callback function to notify vendor driver 
> 
> read                 : callback to vendor driver to handle virtual device config
>                        space or MMIO read access
> 
> write                : callback to vendor driver to handle virtual device config
>                        space or MMIO write access
> 
> vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
>                        information for the target virtual device, then vendor
>                        driver can inject interrupt into virtual machine for this
>                        device.
> 
> 2.3 Potential additional virtual device configuration registration interface:
> ---------------------------------------------------------------------------------
> 
> callback function to describe the MMAP behavior of the virtual GPU 
> 
> callback function to allow GPU vendor driver to provide PCI config space backing
> memory.
> 
> 3. VGPU TYPE1 IOMMU
> ==================================================================================
> 
> Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
> <iova, hva, size, flag> and save the QEMU mm for later reference.
> 
> You can find the quick/ugly implementation in the attached patch file, which is
> actually just a simple version Alex's type1 IOMMU without actual real
> mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 
> 
> We have thought about providing another vendor driver registration interface so
> such tracking information will be sent to vendor driver and he will use the QEMU
> mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
> quick implementation within our driver, I noticed following issues:
> 
> 1) OS/VFIO logic into vendor driver which will be a maintenance issue.
> 
> 2) Every driver vendor has to implement their own RB tree, instead of reusing
> the common existing VFIO code (vfio_find/link/unlink_dma) 
> 
> 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
> better not have anything inside a vendor driver that the VFIO caller immediately
> depends on.
> 
> Based on the above consideration, we decide to implement the DMA tracking logic
> within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
> IOMMU code) and expose two symbols to outside for MMIO mapping and page
> translation and pinning. 
> 
> Also, with a mmap MMIO interface between virtual and physical, this allows
> para-virtualized guest driver can access his virtual MMIO without taking a MMAP
> fault hit, also we can support different MMIO size between virtual and physical
> device.
> 
> int vgpu_map_virtual_bar
> (
>     uint64_t virt_bar_addr,
>     uint64_t phys_bar_addr,
>     uint32_t len,
>     uint32_t flags
> )
> 
> EXPORT_SYMBOL(vgpu_map_virtual_bar);


Per the implementation provided, this needs to be implemented in the
vfio device driver, not in the iommu interface.  Finding the DMA mapping
of the device and replacing it is wrong.  It should be remapped at the
vfio device file interface using vm_ops.


> int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> 
> EXPORT_SYMBOL(vgpu_dma_do_translate);
> 
> Still a lot to be added and modified, such as supporting multiple VMs and 
> multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> kernel driver, error handling, roll-back and locked memory size per user, etc. 

Particularly, handling of mapping changes is completely missing.  This
cannot be a point in time translation, the user is free to remap
addresses whenever they wish and device translations need to be updated
accordingly.


> 4. Modules
> ==================================================================================
> 
> Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> 
> vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
>                            TYPE1 v1 and v2 interface. 

Depending on how intrusive it is, this can possibly by done within the
existing type1 driver.  Either that or we can split out common code for
use by a separate module.

> vgpu.ko                  - provide registration interface and virtual device
>                            VFIO access.
> 
> 5. QEMU note
> ==================================================================================
> 
> To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> use it as a reference for our implementation. It is basically just a quick c & p
> from vfio/pci.c to quickly meet our needs.
> 
> Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> class, and probably the only thing required is to have a new way to discover the
> device.
> 
> 6. Examples
> ==================================================================================
> 
> On this server, we have two NVIDIA M60 GPUs.
> 
> [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> 
> After nvidia.ko gets initialized, we can query the supported vGPU type by
> accessing the "vgpu_supported_types" like following:
> 
> [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> 11:GRID M60-0B
> 12:GRID M60-0Q
> 13:GRID M60-1B
> 14:GRID M60-1Q
> 15:GRID M60-2B
> 16:GRID M60-2Q
> 17:GRID M60-4Q
> 18:GRID M60-8Q
> 
> For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> like to create "GRID M60-4Q" VM on it.
> 
> echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> 
> Note: the number 0 here is for vGPU device index. So far the change is not tested
> for multiple vgpu devices yet, but we will support it.
> 
> At this moment, if you query the "vgpu_supported_types" it will still show all
> supported virtual GPU types as no virtual GPU resource is committed yet.
> 
> Starting VM:
> 
> echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> 
> then, the supported vGPU type query will return:
> 
> [root@cjia-vgx-kvm /home/cjia]$
> > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> 17:GRID M60-4Q
> 
> So vgpu_supported_config needs to be called whenever a new virtual device gets
> created as the underlying HW might limit the supported types if there are
> any existing VM runnings.
> 
> Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> GPU driver vendor to clean up resource.
> 
> Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> device sysfs.


I'd like to hear Intel's thoughts on this interface.  Are there
different vgpu capacities or priority classes that would necessitate
different types of vcpus on Intel?

I think there are some gaps in translating from named vgpu types to
indexes here, along with my previous mention of the UUID/set oddity.

Does Intel have a need for start and shutdown interfaces?

Neo, wasn't there at some point information about how many of each type
could be supported through these interfaces?  How does a user know their
capacity limits?

Thanks,
Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 16:37                     ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 21:21                       ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:21 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 12:37 AM
> 
> On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > On 2016/1/26 15:41, Jike Song wrote:
> > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > [cc +Neo @Nvidia]
> > > >
> > > > Hi Jike,
> > > >
> > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > I would expect we can spell out next level tasks toward above
> > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > some common VFIO framework changes that he can help :-)
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > would you please have a look?
> > > > >
> > > > > 	Bus Driver
> > > > >
> > > > > 		{ in i915/vgt/xxx.c }
> > > > >
> > > > > 		- define a subset of vfio_pci interfaces
> > > > > 		- selective pass-through (say aperture)
> > > > > 		- trap MMIO: interface w/ QEMU
> > > >
> > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > don't apply, but you'll need to support the full device interface,
> > > > right?  That includes the region info ioctl and access through the vfio
> > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > >
> > >
> > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > descriptor we'll definitely keep it.]
> > >
> > > The list of ioctl commands provided by vfio_pci:
> > >
> > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > >
> > > As you said, above 2 don't apply. But for this:
> > >
> > > 	- VFIO_DEVICE_RESET
> > >
> > > In my opinion it should be kept, no matter what will be provided in
> > > the bus driver.
> > >
> > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > 	- VFIO_PCI_VGA_REGION_INDEX
> > >
> > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > ROM BAR or VGA region.
> > >
> > > 	- VFIO_DEVICE_GET_INFO
> > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > 	- VFIO_DEVICE_SET_IRQS
> > >
> > > Above 4 are needed of course.
> > >
> > > We will need to extend:
> > >
> > > 	- VFIO_DEVICE_GET_REGION_INFO
> > >
> > >
> > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > should be trapped instead of being mmap-ed.
> >
> > I may not in the context, but i am curious how to handle the DONT_MAP in
> > vfio driver? Since there are no real MMIO maps into the region and i
> > suppose the access to the region should be handled by vgpu in i915
> > driver, but currently most of the mmio accesses are handled by Qemu.
> 
> VFIO supports the following region attributes:
> 
> #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> 
> If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> the specified offsets of the device file descriptor, depending on what
> accesses are supported.  This is all reported through the REGION_INFO
> ioctl for a given index.  If mmap is supported, the VM will have direct
> access to the area, without faulting to KVM other than to populate the
> mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> returns out to QEMU to service the request, which then finds the
> MemoryRegion serviced through vfio, which will then perform a
> pread/pwrite through to the kernel vfio bus driver to handle the
> access.  Thanks,
> 

Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
KVM, so VM MMIO access will be forwarded to KVMGT directly for 
emulation in kernel. If we reuse above R/W flags, the whole emulation 
path would be unnecessarily long with obvious performance impact. We
either need a new flag here to indicate in-kernel emulation (bias from
passthrough support), or just hide the region alternatively (let KVMGT
to handle I/O emulation itself like today).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:21                       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:21 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 12:37 AM
> 
> On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > On 2016/1/26 15:41, Jike Song wrote:
> > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > [cc +Neo @Nvidia]
> > > >
> > > > Hi Jike,
> > > >
> > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > I would expect we can spell out next level tasks toward above
> > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > some common VFIO framework changes that he can help :-)
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > would you please have a look?
> > > > >
> > > > > 	Bus Driver
> > > > >
> > > > > 		{ in i915/vgt/xxx.c }
> > > > >
> > > > > 		- define a subset of vfio_pci interfaces
> > > > > 		- selective pass-through (say aperture)
> > > > > 		- trap MMIO: interface w/ QEMU
> > > >
> > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > don't apply, but you'll need to support the full device interface,
> > > > right?  That includes the region info ioctl and access through the vfio
> > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > >
> > >
> > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > descriptor we'll definitely keep it.]
> > >
> > > The list of ioctl commands provided by vfio_pci:
> > >
> > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > >
> > > As you said, above 2 don't apply. But for this:
> > >
> > > 	- VFIO_DEVICE_RESET
> > >
> > > In my opinion it should be kept, no matter what will be provided in
> > > the bus driver.
> > >
> > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > 	- VFIO_PCI_VGA_REGION_INDEX
> > >
> > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > ROM BAR or VGA region.
> > >
> > > 	- VFIO_DEVICE_GET_INFO
> > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > 	- VFIO_DEVICE_SET_IRQS
> > >
> > > Above 4 are needed of course.
> > >
> > > We will need to extend:
> > >
> > > 	- VFIO_DEVICE_GET_REGION_INFO
> > >
> > >
> > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > should be trapped instead of being mmap-ed.
> >
> > I may not in the context, but i am curious how to handle the DONT_MAP in
> > vfio driver? Since there are no real MMIO maps into the region and i
> > suppose the access to the region should be handled by vgpu in i915
> > driver, but currently most of the mmio accesses are handled by Qemu.
> 
> VFIO supports the following region attributes:
> 
> #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> 
> If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> the specified offsets of the device file descriptor, depending on what
> accesses are supported.  This is all reported through the REGION_INFO
> ioctl for a given index.  If mmap is supported, the VM will have direct
> access to the area, without faulting to KVM other than to populate the
> mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> returns out to QEMU to service the request, which then finds the
> MemoryRegion serviced through vfio, which will then perform a
> pread/pwrite through to the kernel vfio bus driver to handle the
> access.  Thanks,
> 

Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
KVM, so VM MMIO access will be forwarded to KVMGT directly for 
emulation in kernel. If we reuse above R/W flags, the whole emulation 
path would be unnecessarily long with obvious performance impact. We
either need a new flag here to indicate in-kernel emulation (bias from
passthrough support), or just hide the region alternatively (let KVMGT
to handle I/O emulation itself like today).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 21:21                       ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 21:30                         ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 21:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Yang Zhang, Song, Jike, Gerd Hoffmann,
	Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm, qemu-devel,
	igvt-g@lists.01.org

On Tue, Jan 26, 2016 at 09:21:42PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 12:37 AM
> > 
> > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > On 2016/1/26 15:41, Jike Song wrote:
> > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > [cc +Neo @Nvidia]
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > some common VFIO framework changes that he can help :-)
> > > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > would you please have a look?
> > > > > >
> > > > > > 	Bus Driver
> > > > > >
> > > > > > 		{ in i915/vgt/xxx.c }
> > > > > >
> > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > 		- selective pass-through (say aperture)
> > > > > > 		- trap MMIO: interface w/ QEMU
> > > > >
> > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > don't apply, but you'll need to support the full device interface,
> > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > >
> > > >
> > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > descriptor we'll definitely keep it.]
> > > >
> > > > The list of ioctl commands provided by vfio_pci:
> > > >
> > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > >
> > > > As you said, above 2 don't apply. But for this:
> > > >
> > > > 	- VFIO_DEVICE_RESET
> > > >
> > > > In my opinion it should be kept, no matter what will be provided in
> > > > the bus driver.
> > > >
> > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > >
> > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > ROM BAR or VGA region.
> > > >
> > > > 	- VFIO_DEVICE_GET_INFO
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > 	- VFIO_DEVICE_SET_IRQS
> > > >
> > > > Above 4 are needed of course.
> > > >
> > > > We will need to extend:
> > > >
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > >
> > > >
> > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > should be trapped instead of being mmap-ed.
> > >
> > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > vfio driver? Since there are no real MMIO maps into the region and i
> > > suppose the access to the region should be handled by vgpu in i915
> > > driver, but currently most of the mmio accesses are handled by Qemu.
> > 
> > VFIO supports the following region attributes:
> > 
> > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > 
> > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > the specified offsets of the device file descriptor, depending on what
> > accesses are supported.  This is all reported through the REGION_INFO
> > ioctl for a given index.  If mmap is supported, the VM will have direct
> > access to the area, without faulting to KVM other than to populate the
> > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > returns out to QEMU to service the request, which then finds the
> > MemoryRegion serviced through vfio, which will then perform a
> > pread/pwrite through to the kernel vfio bus driver to handle the
> > access.  Thanks,
> > 
> 
> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
> KVM, so VM MMIO access will be forwarded to KVMGT directly for 
> emulation in kernel. If we reuse above R/W flags, the whole emulation 
> path would be unnecessarily long with obvious performance impact. We
> either need a new flag here to indicate in-kernel emulation (bias from
> passthrough support), or just hide the region alternatively (let KVMGT
> to handle I/O emulation itself like today).
> 

Hi Kevin,

Maybe there is some confusion about the VFIO interface that we are going to use
here. I thought we were going to adopt VFIO so nobody would need to directly
plug into kvm module.

Thanks,
Neo


> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:30                         ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 21:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Yang Zhang, Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org,
	qemu-devel, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Tue, Jan 26, 2016 at 09:21:42PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 12:37 AM
> > 
> > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > On 2016/1/26 15:41, Jike Song wrote:
> > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > [cc +Neo @Nvidia]
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > some common VFIO framework changes that he can help :-)
> > > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > would you please have a look?
> > > > > >
> > > > > > 	Bus Driver
> > > > > >
> > > > > > 		{ in i915/vgt/xxx.c }
> > > > > >
> > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > 		- selective pass-through (say aperture)
> > > > > > 		- trap MMIO: interface w/ QEMU
> > > > >
> > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > don't apply, but you'll need to support the full device interface,
> > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > >
> > > >
> > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > descriptor we'll definitely keep it.]
> > > >
> > > > The list of ioctl commands provided by vfio_pci:
> > > >
> > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > >
> > > > As you said, above 2 don't apply. But for this:
> > > >
> > > > 	- VFIO_DEVICE_RESET
> > > >
> > > > In my opinion it should be kept, no matter what will be provided in
> > > > the bus driver.
> > > >
> > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > >
> > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > ROM BAR or VGA region.
> > > >
> > > > 	- VFIO_DEVICE_GET_INFO
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > 	- VFIO_DEVICE_SET_IRQS
> > > >
> > > > Above 4 are needed of course.
> > > >
> > > > We will need to extend:
> > > >
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > >
> > > >
> > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > should be trapped instead of being mmap-ed.
> > >
> > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > vfio driver? Since there are no real MMIO maps into the region and i
> > > suppose the access to the region should be handled by vgpu in i915
> > > driver, but currently most of the mmio accesses are handled by Qemu.
> > 
> > VFIO supports the following region attributes:
> > 
> > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > 
> > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > the specified offsets of the device file descriptor, depending on what
> > accesses are supported.  This is all reported through the REGION_INFO
> > ioctl for a given index.  If mmap is supported, the VM will have direct
> > access to the area, without faulting to KVM other than to populate the
> > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > returns out to QEMU to service the request, which then finds the
> > MemoryRegion serviced through vfio, which will then perform a
> > pread/pwrite through to the kernel vfio bus driver to handle the
> > access.  Thanks,
> > 
> 
> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
> KVM, so VM MMIO access will be forwarded to KVMGT directly for 
> emulation in kernel. If we reuse above R/W flags, the whole emulation 
> path would be unnecessarily long with obvious performance impact. We
> either need a new flag here to indicate in-kernel emulation (bias from
> passthrough support), or just hide the region alternatively (let KVMGT
> to handle I/O emulation itself like today).
> 

Hi Kevin,

Maybe there is some confusion about the VFIO interface that we are going to use
here. I thought we were going to adopt VFIO so nobody would need to directly
plug into kvm module.

Thanks,
Neo


> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 20:06                     ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 21:38                       ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:38 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia
  Cc: Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org, Kirti Wankhede

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 4:06 AM
> 
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >
> > Hi Alex, Kevin and Jike,
> >
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> >
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > KVM platform.
> >
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > quick comments / thoughts regarding the existing discussions on this thread as
> > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> >
> > Then we can look at what we have, hopefully we can reach some consensus soon.
> >
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> >
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.

Agree with Alex here. Even if we want to do more abstraction of overall
vgpu management, here let's stick to necessary changes within VFIO 
scope.


> >
> > 6. Examples
> >
> =====================================================
> =============================
> >
> > On this server, we have two NVIDIA M60 GPUs.
> >
> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> >
> > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > accessing the "vgpu_supported_types" like following:
> >
> > [root@cjia-vgx-kvm ~]# cat
> /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> >
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > like to create "GRID M60-4Q" VM on it.
> >
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> >
> > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > for multiple vgpu devices yet, but we will support it.
> >
> > At this moment, if you query the "vgpu_supported_types" it will still show all
> > supported virtual GPU types as no virtual GPU resource is committed yet.
> >
> > Starting VM:
> >
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> >
> > then, the supported vGPU type query will return:
> >
> > [root@cjia-vgx-kvm /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> >
> > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > created as the underlying HW might limit the supported types if there are
> > any existing VM runnings.
> >
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > GPU driver vendor to clean up resource.
> >
> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > device sysfs.
> 
> 
> I'd like to hear Intel's thoughts on this interface.  Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?

We'll evaluate this proposal with our requirement. A quick comment is
that we don't have such type thing required. We just expose the same
type of vgpu as the underlying platform. On the other hand, our
implementation gives flexibility to user to control resource allocation 
(e.g. video memory) to different VMs, instead of a fixed partition
scheme, so we have an interface to query remaining free resources.

> 
> Does Intel have a need for start and shutdown interfaces?

No for now. But we can extend to support such interface which provides
more flexibility to separate resource allocation from run-time control.

Given that nvidia/intel do have specific requirement on vgpu management,
I'd suggest that we focus on VFIO change first. After that we can evaluate
how much commonality of vgpu management upon which to evaluate 
whether to have a common vgpu framework or just stay with vendor
specific implementation for that part.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:38                       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:38 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Kirti Wankhede, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 4:06 AM
> 
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >
> > Hi Alex, Kevin and Jike,
> >
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> >
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > KVM platform.
> >
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > quick comments / thoughts regarding the existing discussions on this thread as
> > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> >
> > Then we can look at what we have, hopefully we can reach some consensus soon.
> >
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> >
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.

Agree with Alex here. Even if we want to do more abstraction of overall
vgpu management, here let's stick to necessary changes within VFIO 
scope.


> >
> > 6. Examples
> >
> =====================================================
> =============================
> >
> > On this server, we have two NVIDIA M60 GPUs.
> >
> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> >
> > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > accessing the "vgpu_supported_types" like following:
> >
> > [root@cjia-vgx-kvm ~]# cat
> /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> >
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > like to create "GRID M60-4Q" VM on it.
> >
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> >
> > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > for multiple vgpu devices yet, but we will support it.
> >
> > At this moment, if you query the "vgpu_supported_types" it will still show all
> > supported virtual GPU types as no virtual GPU resource is committed yet.
> >
> > Starting VM:
> >
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> >
> > then, the supported vGPU type query will return:
> >
> > [root@cjia-vgx-kvm /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> >
> > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > created as the underlying HW might limit the supported types if there are
> > any existing VM runnings.
> >
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > GPU driver vendor to clean up resource.
> >
> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > device sysfs.
> 
> 
> I'd like to hear Intel's thoughts on this interface.  Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?

We'll evaluate this proposal with our requirement. A quick comment is
that we don't have such type thing required. We just expose the same
type of vgpu as the underlying platform. On the other hand, our
implementation gives flexibility to user to control resource allocation 
(e.g. video memory) to different VMs, instead of a fixed partition
scheme, so we have an interface to query remaining free resources.

> 
> Does Intel have a need for start and shutdown interfaces?

No for now. But we can extend to support such interface which provides
more flexibility to separate resource allocation from run-time control.

Given that nvidia/intel do have specific requirement on vgpu management,
I'd suggest that we focus on VFIO change first. After that we can evaluate
how much commonality of vgpu management upon which to evaluate 
whether to have a common vgpu framework or just stay with vendor
specific implementation for that part.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 21:30                         ` [Qemu-devel] " Neo Jia
@ 2016-01-26 21:43                           ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:43 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Yang Zhang, Song, Jike, Gerd Hoffmann,
	Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm, qemu-devel,
	igvt-g@lists.01.org

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Wednesday, January 27, 2016 5:31 AM
> 
> On Tue, Jan 26, 2016 at 09:21:42PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, January 27, 2016 12:37 AM
> > >
> > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > > On 2016/1/26 15:41, Jike Song wrote:
> > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > > [cc +Neo @Nvidia]
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > > would you please have a look?
> > > > > > >
> > > > > > > 	Bus Driver
> > > > > > >
> > > > > > > 		{ in i915/vgt/xxx.c }
> > > > > > >
> > > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > > 		- selective pass-through (say aperture)
> > > > > > > 		- trap MMIO: interface w/ QEMU
> > > > > >
> > > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > > don't apply, but you'll need to support the full device interface,
> > > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > >
> > > > >
> > > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > > descriptor we'll definitely keep it.]
> > > > >
> > > > > The list of ioctl commands provided by vfio_pci:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > > >
> > > > > As you said, above 2 don't apply. But for this:
> > > > >
> > > > > 	- VFIO_DEVICE_RESET
> > > > >
> > > > > In my opinion it should be kept, no matter what will be provided in
> > > > > the bus driver.
> > > > >
> > > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > > >
> > > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > > ROM BAR or VGA region.
> > > > >
> > > > > 	- VFIO_DEVICE_GET_INFO
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > > 	- VFIO_DEVICE_SET_IRQS
> > > > >
> > > > > Above 4 are needed of course.
> > > > >
> > > > > We will need to extend:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > >
> > > > >
> > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > > should be trapped instead of being mmap-ed.
> > > >
> > > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > > vfio driver? Since there are no real MMIO maps into the region and i
> > > > suppose the access to the region should be handled by vgpu in i915
> > > > driver, but currently most of the mmio accesses are handled by Qemu.
> > >
> > > VFIO supports the following region attributes:
> > >
> > > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > >
> > > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > > the specified offsets of the device file descriptor, depending on what
> > > accesses are supported.  This is all reported through the REGION_INFO
> > > ioctl for a given index.  If mmap is supported, the VM will have direct
> > > access to the area, without faulting to KVM other than to populate the
> > > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > > returns out to QEMU to service the request, which then finds the
> > > MemoryRegion serviced through vfio, which will then perform a
> > > pread/pwrite through to the kernel vfio bus driver to handle the
> > > access.  Thanks,
> > >
> >
> > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > path would be unnecessarily long with obvious performance impact. We
> > either need a new flag here to indicate in-kernel emulation (bias from
> > passthrough support), or just hide the region alternatively (let KVMGT
> > to handle I/O emulation itself like today).
> >
> 
> Hi Kevin,
> 
> Maybe there is some confusion about the VFIO interface that we are going to use
> here. I thought we were going to adopt VFIO so nobody would need to directly
> plug into kvm module.
> 

We have reason to do so since looping kernel->user->kernel will incur
several times of emulation overhead per trap which can generate
obvious impact in some performance-critical path. We discussed it with
KVM maintainer (iirc. Paolo) last year for the rationale behind.

We can extend VFIO interface to support such model in general.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:43                           ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:43 UTC (permalink / raw)
  To: Neo Jia
  Cc: Yang Zhang, Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org,
	qemu-devel, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Wednesday, January 27, 2016 5:31 AM
> 
> On Tue, Jan 26, 2016 at 09:21:42PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, January 27, 2016 12:37 AM
> > >
> > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > > On 2016/1/26 15:41, Jike Song wrote:
> > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > > [cc +Neo @Nvidia]
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > > would you please have a look?
> > > > > > >
> > > > > > > 	Bus Driver
> > > > > > >
> > > > > > > 		{ in i915/vgt/xxx.c }
> > > > > > >
> > > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > > 		- selective pass-through (say aperture)
> > > > > > > 		- trap MMIO: interface w/ QEMU
> > > > > >
> > > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > > don't apply, but you'll need to support the full device interface,
> > > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > >
> > > > >
> > > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > > descriptor we'll definitely keep it.]
> > > > >
> > > > > The list of ioctl commands provided by vfio_pci:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > > >
> > > > > As you said, above 2 don't apply. But for this:
> > > > >
> > > > > 	- VFIO_DEVICE_RESET
> > > > >
> > > > > In my opinion it should be kept, no matter what will be provided in
> > > > > the bus driver.
> > > > >
> > > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > > >
> > > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > > ROM BAR or VGA region.
> > > > >
> > > > > 	- VFIO_DEVICE_GET_INFO
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > > 	- VFIO_DEVICE_SET_IRQS
> > > > >
> > > > > Above 4 are needed of course.
> > > > >
> > > > > We will need to extend:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > >
> > > > >
> > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > > should be trapped instead of being mmap-ed.
> > > >
> > > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > > vfio driver? Since there are no real MMIO maps into the region and i
> > > > suppose the access to the region should be handled by vgpu in i915
> > > > driver, but currently most of the mmio accesses are handled by Qemu.
> > >
> > > VFIO supports the following region attributes:
> > >
> > > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > >
> > > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > > the specified offsets of the device file descriptor, depending on what
> > > accesses are supported.  This is all reported through the REGION_INFO
> > > ioctl for a given index.  If mmap is supported, the VM will have direct
> > > access to the area, without faulting to KVM other than to populate the
> > > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > > returns out to QEMU to service the request, which then finds the
> > > MemoryRegion serviced through vfio, which will then perform a
> > > pread/pwrite through to the kernel vfio bus driver to handle the
> > > access.  Thanks,
> > >
> >
> > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > path would be unnecessarily long with obvious performance impact. We
> > either need a new flag here to indicate in-kernel emulation (bias from
> > passthrough support), or just hide the region alternatively (let KVMGT
> > to handle I/O emulation itself like today).
> >
> 
> Hi Kevin,
> 
> Maybe there is some confusion about the VFIO interface that we are going to use
> here. I thought we were going to adopt VFIO so nobody would need to directly
> plug into kvm module.
> 

We have reason to do so since looping kernel->user->kernel will incur
several times of emulation overhead per trap which can generate
obvious impact in some performance-critical path. We discussed it with
KVM maintainer (iirc. Paolo) last year for the rationale behind.

We can extend VFIO interface to support such model in general.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 21:21                       ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 21:43                         ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 21:43 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 12:37 AM
> > 
> > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > On 2016/1/26 15:41, Jike Song wrote:
> > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > [cc +Neo @Nvidia]
> > > > > 
> > > > > Hi Jike,
> > > > > 
> > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > 
> > > > > > Hi Alex,
> > > > > > 
> > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > would you please have a look?
> > > > > > 
> > > > > > 	Bus Driver
> > > > > > 
> > > > > > 		{ in i915/vgt/xxx.c }
> > > > > > 
> > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > 		- selective pass-through (say aperture)
> > > > > > 		- trap MMIO: interface w/ QEMU
> > > > > 
> > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > don't apply, but you'll need to support the full device interface,
> > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > 
> > > > 
> > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > descriptor we'll definitely keep it.]
> > > > 
> > > > The list of ioctl commands provided by vfio_pci:
> > > > 
> > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > > 
> > > > As you said, above 2 don't apply. But for this:
> > > > 
> > > > 	- VFIO_DEVICE_RESET
> > > > 
> > > > In my opinion it should be kept, no matter what will be provided in
> > > > the bus driver.
> > > > 
> > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > > 
> > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > ROM BAR or VGA region.
> > > > 
> > > > 	- VFIO_DEVICE_GET_INFO
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > 	- VFIO_DEVICE_SET_IRQS
> > > > 
> > > > Above 4 are needed of course.
> > > > 
> > > > We will need to extend:
> > > > 
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > 
> > > > 
> > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > should be trapped instead of being mmap-ed.
> > > 
> > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > vfio driver? Since there are no real MMIO maps into the region and i
> > > suppose the access to the region should be handled by vgpu in i915
> > > driver, but currently most of the mmio accesses are handled by Qemu.
> > 
> > VFIO supports the following region attributes:
> > 
> > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > 
> > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > the specified offsets of the device file descriptor, depending on what
> > accesses are supported.  This is all reported through the REGION_INFO
> > ioctl for a given index.  If mmap is supported, the VM will have direct
> > access to the area, without faulting to KVM other than to populate the
> > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > returns out to QEMU to service the request, which then finds the
> > MemoryRegion serviced through vfio, which will then perform a
> > pread/pwrite through to the kernel vfio bus driver to handle the
> > access.  Thanks,
> > 
> 
> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
> KVM, so VM MMIO access will be forwarded to KVMGT directly for 
> emulation in kernel. If we reuse above R/W flags, the whole emulation 
> path would be unnecessarily long with obvious performance impact. We
> either need a new flag here to indicate in-kernel emulation (bias from
> passthrough support), or just hide the region alternatively (let KVMGT
> to handle I/O emulation itself like today).

That sounds like a future optimization TBH.  There's very strict
layering between vfio and kvm.  Physical device assignment could make
use of it as well, avoiding a round trip through userspace when an
ioread/write would do.  Userspace also needs to orchestrate those kinds
of accelerators, there might be cases where userspace wants to see those
transactions for debugging or manipulating the device.  We can't simply
take shortcuts to provide such direct access.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:43                         ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 21:43 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 12:37 AM
> > 
> > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > On 2016/1/26 15:41, Jike Song wrote:
> > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > [cc +Neo @Nvidia]
> > > > > 
> > > > > Hi Jike,
> > > > > 
> > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > 
> > > > > > Hi Alex,
> > > > > > 
> > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > would you please have a look?
> > > > > > 
> > > > > > 	Bus Driver
> > > > > > 
> > > > > > 		{ in i915/vgt/xxx.c }
> > > > > > 
> > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > 		- selective pass-through (say aperture)
> > > > > > 		- trap MMIO: interface w/ QEMU
> > > > > 
> > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > don't apply, but you'll need to support the full device interface,
> > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > 
> > > > 
> > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > descriptor we'll definitely keep it.]
> > > > 
> > > > The list of ioctl commands provided by vfio_pci:
> > > > 
> > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > > 
> > > > As you said, above 2 don't apply. But for this:
> > > > 
> > > > 	- VFIO_DEVICE_RESET
> > > > 
> > > > In my opinion it should be kept, no matter what will be provided in
> > > > the bus driver.
> > > > 
> > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > > 
> > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > ROM BAR or VGA region.
> > > > 
> > > > 	- VFIO_DEVICE_GET_INFO
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > 	- VFIO_DEVICE_SET_IRQS
> > > > 
> > > > Above 4 are needed of course.
> > > > 
> > > > We will need to extend:
> > > > 
> > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > 
> > > > 
> > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > should be trapped instead of being mmap-ed.
> > > 
> > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > vfio driver? Since there are no real MMIO maps into the region and i
> > > suppose the access to the region should be handled by vgpu in i915
> > > driver, but currently most of the mmio accesses are handled by Qemu.
> > 
> > VFIO supports the following region attributes:
> > 
> > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > 
> > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > the specified offsets of the device file descriptor, depending on what
> > accesses are supported.  This is all reported through the REGION_INFO
> > ioctl for a given index.  If mmap is supported, the VM will have direct
> > access to the area, without faulting to KVM other than to populate the
> > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > returns out to QEMU to service the request, which then finds the
> > MemoryRegion serviced through vfio, which will then perform a
> > pread/pwrite through to the kernel vfio bus driver to handle the
> > access.  Thanks,
> > 
> 
> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to 
> KVM, so VM MMIO access will be forwarded to KVMGT directly for 
> emulation in kernel. If we reuse above R/W flags, the whole emulation 
> path would be unnecessarily long with obvious performance impact. We
> either need a new flag here to indicate in-kernel emulation (bias from
> passthrough support), or just hide the region alternatively (let KVMGT
> to handle I/O emulation itself like today).

That sounds like a future optimization TBH.  There's very strict
layering between vfio and kvm.  Physical device assignment could make
use of it as well, avoiding a round trip through userspace when an
ioread/write would do.  Userspace also needs to orchestrate those kinds
of accelerators, there might be cases where userspace wants to see those
transactions for debugging or manipulating the device.  We can't simply
take shortcuts to provide such direct access.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 21:43                         ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 21:50                           ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:50 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 5:44 AM
> 
> On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, January 27, 2016 12:37 AM
> > >
> > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > > On 2016/1/26 15:41, Jike Song wrote:
> > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > > [cc +Neo @Nvidia]
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > > would you please have a look?
> > > > > > >
> > > > > > > 	Bus Driver
> > > > > > >
> > > > > > > 		{ in i915/vgt/xxx.c }
> > > > > > >
> > > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > > 		- selective pass-through (say aperture)
> > > > > > > 		- trap MMIO: interface w/ QEMU
> > > > > >
> > > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > > don't apply, but you'll need to support the full device interface,
> > > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > >
> > > > >
> > > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > > descriptor we'll definitely keep it.]
> > > > >
> > > > > The list of ioctl commands provided by vfio_pci:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > > >
> > > > > As you said, above 2 don't apply. But for this:
> > > > >
> > > > > 	- VFIO_DEVICE_RESET
> > > > >
> > > > > In my opinion it should be kept, no matter what will be provided in
> > > > > the bus driver.
> > > > >
> > > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > > >
> > > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > > ROM BAR or VGA region.
> > > > >
> > > > > 	- VFIO_DEVICE_GET_INFO
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > > 	- VFIO_DEVICE_SET_IRQS
> > > > >
> > > > > Above 4 are needed of course.
> > > > >
> > > > > We will need to extend:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > >
> > > > >
> > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > > should be trapped instead of being mmap-ed.
> > > >
> > > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > > vfio driver? Since there are no real MMIO maps into the region and i
> > > > suppose the access to the region should be handled by vgpu in i915
> > > > driver, but currently most of the mmio accesses are handled by Qemu.
> > >
> > > VFIO supports the following region attributes:
> > >
> > > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > >
> > > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > > the specified offsets of the device file descriptor, depending on what
> > > accesses are supported.  This is all reported through the REGION_INFO
> > > ioctl for a given index.  If mmap is supported, the VM will have direct
> > > access to the area, without faulting to KVM other than to populate the
> > > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > > returns out to QEMU to service the request, which then finds the
> > > MemoryRegion serviced through vfio, which will then perform a
> > > pread/pwrite through to the kernel vfio bus driver to handle the
> > > access.  Thanks,
> > >
> >
> > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > path would be unnecessarily long with obvious performance impact. We
> > either need a new flag here to indicate in-kernel emulation (bias from
> > passthrough support), or just hide the region alternatively (let KVMGT
> > to handle I/O emulation itself like today).
> 
> That sounds like a future optimization TBH.  There's very strict
> layering between vfio and kvm.  Physical device assignment could make
> use of it as well, avoiding a round trip through userspace when an
> ioread/write would do.  Userspace also needs to orchestrate those kinds
> of accelerators, there might be cases where userspace wants to see those
> transactions for debugging or manipulating the device.  We can't simply
> take shortcuts to provide such direct access.  Thanks,
> 

But we have to balance such debugging flexibility and acceptable performance.
To me the latter one is more important otherwise there'd be no real usage
around this technique, while for debugging there are other alternative (e.g.
ftrace) Consider some extreme case with 100k traps/second and then see 
how much impact a 2-3x longer emulation path can bring...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:50                           ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:50 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 5:44 AM
> 
> On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, January 27, 2016 12:37 AM
> > >
> > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > > On 2016/1/26 15:41, Jike Song wrote:
> > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > > [cc +Neo @Nvidia]
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > > would you please have a look?
> > > > > > >
> > > > > > > 	Bus Driver
> > > > > > >
> > > > > > > 		{ in i915/vgt/xxx.c }
> > > > > > >
> > > > > > > 		- define a subset of vfio_pci interfaces
> > > > > > > 		- selective pass-through (say aperture)
> > > > > > > 		- trap MMIO: interface w/ QEMU
> > > > > >
> > > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > > don't apply, but you'll need to support the full device interface,
> > > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > >
> > > > >
> > > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > > descriptor we'll definitely keep it.]
> > > > >
> > > > > The list of ioctl commands provided by vfio_pci:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > 	- VFIO_DEVICE_PCI_HOT_RESET
> > > > >
> > > > > As you said, above 2 don't apply. But for this:
> > > > >
> > > > > 	- VFIO_DEVICE_RESET
> > > > >
> > > > > In my opinion it should be kept, no matter what will be provided in
> > > > > the bus driver.
> > > > >
> > > > > 	- VFIO_PCI_ROM_REGION_INDEX
> > > > > 	- VFIO_PCI_VGA_REGION_INDEX
> > > > >
> > > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > > ROM BAR or VGA region.
> > > > >
> > > > > 	- VFIO_DEVICE_GET_INFO
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > > 	- VFIO_DEVICE_GET_IRQ_INFO
> > > > > 	- VFIO_DEVICE_SET_IRQS
> > > > >
> > > > > Above 4 are needed of course.
> > > > >
> > > > > We will need to extend:
> > > > >
> > > > > 	- VFIO_DEVICE_GET_REGION_INFO
> > > > >
> > > > >
> > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > > should be trapped instead of being mmap-ed.
> > > >
> > > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > > vfio driver? Since there are no real MMIO maps into the region and i
> > > > suppose the access to the region should be handled by vgpu in i915
> > > > driver, but currently most of the mmio accesses are handled by Qemu.
> > >
> > > VFIO supports the following region attributes:
> > >
> > > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > >
> > > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > > the specified offsets of the device file descriptor, depending on what
> > > accesses are supported.  This is all reported through the REGION_INFO
> > > ioctl for a given index.  If mmap is supported, the VM will have direct
> > > access to the area, without faulting to KVM other than to populate the
> > > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > > returns out to QEMU to service the request, which then finds the
> > > MemoryRegion serviced through vfio, which will then perform a
> > > pread/pwrite through to the kernel vfio bus driver to handle the
> > > access.  Thanks,
> > >
> >
> > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > path would be unnecessarily long with obvious performance impact. We
> > either need a new flag here to indicate in-kernel emulation (bias from
> > passthrough support), or just hide the region alternatively (let KVMGT
> > to handle I/O emulation itself like today).
> 
> That sounds like a future optimization TBH.  There's very strict
> layering between vfio and kvm.  Physical device assignment could make
> use of it as well, avoiding a round trip through userspace when an
> ioread/write would do.  Userspace also needs to orchestrate those kinds
> of accelerators, there might be cases where userspace wants to see those
> transactions for debugging or manipulating the device.  We can't simply
> take shortcuts to provide such direct access.  Thanks,
> 

But we have to balance such debugging flexibility and acceptable performance.
To me the latter one is more important otherwise there'd be no real usage
around this technique, while for debugging there are other alternative (e.g.
ftrace) Consider some extreme case with 100k traps/second and then see 
how much impact a 2-3x longer emulation path can bring...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 16:12                   ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 21:57                     ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:57 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

> From: Alex Williamson
> Sent: Wednesday, January 27, 2016 12:13 AM
> > b) adding other information. For example, for the OpRegion, QEMU need
> > to do more than mmap a region, it has to:
> >
> > 	- allocate a region
> > 	- copy contents from somewhere in host to that region
> > 	- mmap it to guest
> >
> >
> > I remember you already have a prototype for this?
> 
> Yes, I'm working on this currently, it will by a device specific region
> and QEMU can either copy the contents to a new buffer in guest memory
> or provided trapped access to the host opregion.  I thought vgpus
> weren't going to need opregions though, I figured it was more for GVT-d
> support.  Thanks,
> 

It's beneficial to vgpu too. Anyway same graphics driver runs inside VM
for both cases, so any driver assumption on passthru also applies to vgpu.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 21:57                     ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 21:57 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson
> Sent: Wednesday, January 27, 2016 12:13 AM
> > b) adding other information. For example, for the OpRegion, QEMU need
> > to do more than mmap a region, it has to:
> >
> > 	- allocate a region
> > 	- copy contents from somewhere in host to that region
> > 	- mmap it to guest
> >
> >
> > I remember you already have a prototype for this?
> 
> Yes, I'm working on this currently, it will by a device specific region
> and QEMU can either copy the contents to a new buffer in guest memory
> or provided trapped access to the host opregion.  I thought vgpus
> weren't going to need opregions though, I figured it was more for GVT-d
> support.  Thanks,
> 

It's beneficial to vgpu too. Anyway same graphics driver runs inside VM
for both cases, so any driver assumption on passthru also applies to vgpu.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 21:50                           ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 22:07                             ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 22:07 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

On Tue, 2016-01-26 at 21:50 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 5:44 AM
> > 
> > On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 12:37 AM
> > > > 
> > > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > > > On 2016/1/26 15:41, Jike Song wrote:
> > > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > > > [cc +Neo @Nvidia]
> > > > > > > 
> > > > > > > Hi Jike,
> > > > > > > 
> > > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > > > 
> > > > > > > > Hi Alex,
> > > > > > > > 
> > > > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > > > would you please have a look?
> > > > > > > > 
> > > > > > > >  	Bus Driver
> > > > > > > > 
> > > > > > > >  		{ in i915/vgt/xxx.c }
> > > > > > > > 
> > > > > > > >  		- define a subset of vfio_pci interfaces
> > > > > > > >  		- selective pass-through (say aperture)
> > > > > > > >  		- trap MMIO: interface w/ QEMU
> > > > > > > 
> > > > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > > > don't apply, but you'll need to support the full device interface,
> > > > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > > > 
> > > > > > 
> > > > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > > > descriptor we'll definitely keep it.]
> > > > > > 
> > > > > > The list of ioctl commands provided by vfio_pci:
> > > > > > 
> > > > > >  	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > >  	- VFIO_DEVICE_PCI_HOT_RESET
> > > > > > 
> > > > > > As you said, above 2 don't apply. But for this:
> > > > > > 
> > > > > >  	- VFIO_DEVICE_RESET
> > > > > > 
> > > > > > In my opinion it should be kept, no matter what will be provided in
> > > > > > the bus driver.
> > > > > > 
> > > > > >  	- VFIO_PCI_ROM_REGION_INDEX
> > > > > >  	- VFIO_PCI_VGA_REGION_INDEX
> > > > > > 
> > > > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > > > ROM BAR or VGA region.
> > > > > > 
> > > > > >  	- VFIO_DEVICE_GET_INFO
> > > > > >  	- VFIO_DEVICE_GET_REGION_INFO
> > > > > >  	- VFIO_DEVICE_GET_IRQ_INFO
> > > > > >  	- VFIO_DEVICE_SET_IRQS
> > > > > > 
> > > > > > Above 4 are needed of course.
> > > > > > 
> > > > > > We will need to extend:
> > > > > > 
> > > > > >  	- VFIO_DEVICE_GET_REGION_INFO
> > > > > > 
> > > > > > 
> > > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > > > should be trapped instead of being mmap-ed.
> > > > > 
> > > > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > > > vfio driver? Since there are no real MMIO maps into the region and i
> > > > > suppose the access to the region should be handled by vgpu in i915
> > > > > driver, but currently most of the mmio accesses are handled by Qemu.
> > > > 
> > > > VFIO supports the following region attributes:
> > > > 
> > > > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > > > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > > > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > > > 
> > > > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > > > the specified offsets of the device file descriptor, depending on what
> > > > accesses are supported.  This is all reported through the REGION_INFO
> > > > ioctl for a given index.  If mmap is supported, the VM will have direct
> > > > access to the area, without faulting to KVM other than to populate the
> > > > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > > > returns out to QEMU to service the request, which then finds the
> > > > MemoryRegion serviced through vfio, which will then perform a
> > > > pread/pwrite through to the kernel vfio bus driver to handle the
> > > > access.  Thanks,
> > > > 
> > > 
> > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > path would be unnecessarily long with obvious performance impact. We
> > > either need a new flag here to indicate in-kernel emulation (bias from
> > > passthrough support), or just hide the region alternatively (let KVMGT
> > > to handle I/O emulation itself like today).
> > 
> > That sounds like a future optimization TBH.  There's very strict
> > layering between vfio and kvm.  Physical device assignment could make
> > use of it as well, avoiding a round trip through userspace when an
> > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > of accelerators, there might be cases where userspace wants to see those
> > transactions for debugging or manipulating the device.  We can't simply
> > take shortcuts to provide such direct access.  Thanks,
> > 
> 
> But we have to balance such debugging flexibility and acceptable performance.
> To me the latter one is more important otherwise there'd be no real usage
> around this technique, while for debugging there are other alternative (e.g.
> ftrace) Consider some extreme case with 100k traps/second and then see 
> how much impact a 2-3x longer emulation path can bring...

Are you jumping to the conclusion that it cannot be done with proper
layering in place?  Performance is important, but it's not an excuse to
abandon designing interfaces between independent components.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 22:07                             ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 22:07 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 21:50 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 5:44 AM
> > 
> > On Tue, 2016-01-26 at 21:21 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 12:37 AM
> > > > 
> > > > On Tue, 2016-01-26 at 22:05 +0800, Yang Zhang wrote:
> > > > > On 2016/1/26 15:41, Jike Song wrote:
> > > > > > On 01/26/2016 05:30 AM, Alex Williamson wrote:
> > > > > > > [cc +Neo @Nvidia]
> > > > > > > 
> > > > > > > Hi Jike,
> > > > > > > 
> > > > > > > On Mon, 2016-01-25 at 19:34 +0800, Jike Song wrote:
> > > > > > > > On 01/20/2016 05:05 PM, Tian, Kevin wrote:
> > > > > > > > > I would expect we can spell out next level tasks toward above
> > > > > > > > > direction, upon which Alex can easily judge whether there are
> > > > > > > > > some common VFIO framework changes that he can help :-)
> > > > > > > > 
> > > > > > > > Hi Alex,
> > > > > > > > 
> > > > > > > > Here is a draft task list after a short discussion w/ Kevin,
> > > > > > > > would you please have a look?
> > > > > > > > 
> > > > > > > >  	Bus Driver
> > > > > > > > 
> > > > > > > >  		{ in i915/vgt/xxx.c }
> > > > > > > > 
> > > > > > > >  		- define a subset of vfio_pci interfaces
> > > > > > > >  		- selective pass-through (say aperture)
> > > > > > > >  		- trap MMIO: interface w/ QEMU
> > > > > > > 
> > > > > > > What's included in the subset?  Certainly the bus reset ioctls really
> > > > > > > don't apply, but you'll need to support the full device interface,
> > > > > > > right?  That includes the region info ioctl and access through the vfio
> > > > > > > device file descriptor as well as the interrupt info and setup ioctls.
> > > > > > > 
> > > > > > 
> > > > > > [All interfaces I thought are via ioctl:)  For other stuff like file
> > > > > > descriptor we'll definitely keep it.]
> > > > > > 
> > > > > > The list of ioctl commands provided by vfio_pci:
> > > > > > 
> > > > > >  	- VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > > > >  	- VFIO_DEVICE_PCI_HOT_RESET
> > > > > > 
> > > > > > As you said, above 2 don't apply. But for this:
> > > > > > 
> > > > > >  	- VFIO_DEVICE_RESET
> > > > > > 
> > > > > > In my opinion it should be kept, no matter what will be provided in
> > > > > > the bus driver.
> > > > > > 
> > > > > >  	- VFIO_PCI_ROM_REGION_INDEX
> > > > > >  	- VFIO_PCI_VGA_REGION_INDEX
> > > > > > 
> > > > > > I suppose above 2 don't apply neither? For a vgpu we don't provide a
> > > > > > ROM BAR or VGA region.
> > > > > > 
> > > > > >  	- VFIO_DEVICE_GET_INFO
> > > > > >  	- VFIO_DEVICE_GET_REGION_INFO
> > > > > >  	- VFIO_DEVICE_GET_IRQ_INFO
> > > > > >  	- VFIO_DEVICE_SET_IRQS
> > > > > > 
> > > > > > Above 4 are needed of course.
> > > > > > 
> > > > > > We will need to extend:
> > > > > > 
> > > > > >  	- VFIO_DEVICE_GET_REGION_INFO
> > > > > > 
> > > > > > 
> > > > > > a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
> > > > > > should be trapped instead of being mmap-ed.
> > > > > 
> > > > > I may not in the context, but i am curious how to handle the DONT_MAP in
> > > > > vfio driver? Since there are no real MMIO maps into the region and i
> > > > > suppose the access to the region should be handled by vgpu in i915
> > > > > driver, but currently most of the mmio accesses are handled by Qemu.
> > > > 
> > > > VFIO supports the following region attributes:
> > > > 
> > > > #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> > > > #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> > > > #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> > > > 
> > > > If MMAP is not set, then the QEMU driver will do pread and/or pwrite to
> > > > the specified offsets of the device file descriptor, depending on what
> > > > accesses are supported.  This is all reported through the REGION_INFO
> > > > ioctl for a given index.  If mmap is supported, the VM will have direct
> > > > access to the area, without faulting to KVM other than to populate the
> > > > mapping.  Without mmap support, a VM MMIO access traps into KVM, which
> > > > returns out to QEMU to service the request, which then finds the
> > > > MemoryRegion serviced through vfio, which will then perform a
> > > > pread/pwrite through to the kernel vfio bus driver to handle the
> > > > access.  Thanks,
> > > > 
> > > 
> > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > path would be unnecessarily long with obvious performance impact. We
> > > either need a new flag here to indicate in-kernel emulation (bias from
> > > passthrough support), or just hide the region alternatively (let KVMGT
> > > to handle I/O emulation itself like today).
> > 
> > That sounds like a future optimization TBH.  There's very strict
> > layering between vfio and kvm.  Physical device assignment could make
> > use of it as well, avoiding a round trip through userspace when an
> > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > of accelerators, there might be cases where userspace wants to see those
> > transactions for debugging or manipulating the device.  We can't simply
> > take shortcuts to provide such direct access.  Thanks,
> > 
> 
> But we have to balance such debugging flexibility and acceptable performance.
> To me the latter one is more important otherwise there'd be no real usage
> around this technique, while for debugging there are other alternative (e.g.
> ftrace) Consider some extreme case with 100k traps/second and then see 
> how much impact a 2-3x longer emulation path can bring...

Are you jumping to the conclusion that it cannot be done with proper
layering in place?  Performance is important, but it's not an excuse to
abandon designing interfaces between independent components.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:07                             ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 22:15                               ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 22:15 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 6:08 AM
> 
> > > > >
> > > >
> > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > path would be unnecessarily long with obvious performance impact. We
> > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > to handle I/O emulation itself like today).
> > >
> > > That sounds like a future optimization TBH.  There's very strict
> > > layering between vfio and kvm.  Physical device assignment could make
> > > use of it as well, avoiding a round trip through userspace when an
> > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > of accelerators, there might be cases where userspace wants to see those
> > > transactions for debugging or manipulating the device.  We can't simply
> > > take shortcuts to provide such direct access.  Thanks,
> > >
> >
> > But we have to balance such debugging flexibility and acceptable performance.
> > To me the latter one is more important otherwise there'd be no real usage
> > around this technique, while for debugging there are other alternative (e.g.
> > ftrace) Consider some extreme case with 100k traps/second and then see
> > how much impact a 2-3x longer emulation path can bring...
> 
> Are you jumping to the conclusion that it cannot be done with proper
> layering in place?  Performance is important, but it's not an excuse to
> abandon designing interfaces between independent components.  Thanks,
> 

Two are not controversial. My point is to remove unnecessary long trip
as possible. After another thought, yes we can reuse existing read/write
flags:
	- KVMGT will expose a private control variable whether in-kernel
delivery is required;
	- when the variable is true, KVMGT will register in-kernel MMIO 
emulation callbacks then VM MMIO request will be delivered to KVMGT 
directly;
	- when the variable is false, KVMGT will not register anything. 
VM MMIO request will then be delivered to Qemu and then ioread/write
will be used to finally reach KVMGT emulation logic;

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 22:15                               ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 22:15 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 6:08 AM
> 
> > > > >
> > > >
> > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > path would be unnecessarily long with obvious performance impact. We
> > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > to handle I/O emulation itself like today).
> > >
> > > That sounds like a future optimization TBH.  There's very strict
> > > layering between vfio and kvm.  Physical device assignment could make
> > > use of it as well, avoiding a round trip through userspace when an
> > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > of accelerators, there might be cases where userspace wants to see those
> > > transactions for debugging or manipulating the device.  We can't simply
> > > take shortcuts to provide such direct access.  Thanks,
> > >
> >
> > But we have to balance such debugging flexibility and acceptable performance.
> > To me the latter one is more important otherwise there'd be no real usage
> > around this technique, while for debugging there are other alternative (e.g.
> > ftrace) Consider some extreme case with 100k traps/second and then see
> > how much impact a 2-3x longer emulation path can bring...
> 
> Are you jumping to the conclusion that it cannot be done with proper
> layering in place?  Performance is important, but it's not an excuse to
> abandon designing interfaces between independent components.  Thanks,
> 

Two are not controversial. My point is to remove unnecessary long trip
as possible. After another thought, yes we can reuse existing read/write
flags:
	- KVMGT will expose a private control variable whether in-kernel
delivery is required;
	- when the variable is true, KVMGT will register in-kernel MMIO 
emulation callbacks then VM MMIO request will be delivered to KVMGT 
directly;
	- when the variable is false, KVMGT will not register anything. 
VM MMIO request will then be delivered to Qemu and then ioread/write
will be used to finally reach KVMGT emulation logic;

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:15                               ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 22:27                                 ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 22:27 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 6:08 AM
> > 
> > > > > > 
> > > > > 
> > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > to handle I/O emulation itself like today).
> > > > 
> > > > That sounds like a future optimization TBH.  There's very strict
> > > > layering between vfio and kvm.  Physical device assignment could make
> > > > use of it as well, avoiding a round trip through userspace when an
> > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > of accelerators, there might be cases where userspace wants to see those
> > > > transactions for debugging or manipulating the device.  We can't simply
> > > > take shortcuts to provide such direct access.  Thanks,
> > > > 
> > > 
> > > But we have to balance such debugging flexibility and acceptable performance.
> > > To me the latter one is more important otherwise there'd be no real usage
> > > around this technique, while for debugging there are other alternative (e.g.
> > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > how much impact a 2-3x longer emulation path can bring...
> > 
> > Are you jumping to the conclusion that it cannot be done with proper
> > layering in place?  Performance is important, but it's not an excuse to
> > abandon designing interfaces between independent components.  Thanks,
> > 
> 
> Two are not controversial. My point is to remove unnecessary long trip
> as possible. After another thought, yes we can reuse existing read/write
> flags:
> 	- KVMGT will expose a private control variable whether in-kernel
> delivery is required;

But in-kernel delivery is never *required*.  Wouldn't userspace want to
deliver in-kernel any time it possibly could?

> 	- when the variable is true, KVMGT will register in-kernel MMIO 
> emulation callbacks then VM MMIO request will be delivered to KVMGT 
> directly;
> 	- when the variable is false, KVMGT will not register anything. 
> VM MMIO request will then be delivered to Qemu and then ioread/write
> will be used to finally reach KVMGT emulation logic;

No, that means the interface is entirely dependent on a backdoor through
KVM.  Why can't userspace (QEMU) do something like register an MMIO
region with KVM handled via a provided file descriptor and offset,
couldn't KVM then call the file ops without a kernel exit?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 22:27                                 ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 22:27 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 6:08 AM
> > 
> > > > > > 
> > > > > 
> > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > to handle I/O emulation itself like today).
> > > > 
> > > > That sounds like a future optimization TBH.  There's very strict
> > > > layering between vfio and kvm.  Physical device assignment could make
> > > > use of it as well, avoiding a round trip through userspace when an
> > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > of accelerators, there might be cases where userspace wants to see those
> > > > transactions for debugging or manipulating the device.  We can't simply
> > > > take shortcuts to provide such direct access.  Thanks,
> > > > 
> > > 
> > > But we have to balance such debugging flexibility and acceptable performance.
> > > To me the latter one is more important otherwise there'd be no real usage
> > > around this technique, while for debugging there are other alternative (e.g.
> > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > how much impact a 2-3x longer emulation path can bring...
> > 
> > Are you jumping to the conclusion that it cannot be done with proper
> > layering in place?  Performance is important, but it's not an excuse to
> > abandon designing interfaces between independent components.  Thanks,
> > 
> 
> Two are not controversial. My point is to remove unnecessary long trip
> as possible. After another thought, yes we can reuse existing read/write
> flags:
> 	- KVMGT will expose a private control variable whether in-kernel
> delivery is required;

But in-kernel delivery is never *required*.  Wouldn't userspace want to
deliver in-kernel any time it possibly could?

> 	- when the variable is true, KVMGT will register in-kernel MMIO 
> emulation callbacks then VM MMIO request will be delivered to KVMGT 
> directly;
> 	- when the variable is false, KVMGT will not register anything. 
> VM MMIO request will then be delivered to Qemu and then ioread/write
> will be used to finally reach KVMGT emulation logic;

No, that means the interface is entirely dependent on a backdoor through
KVM.  Why can't userspace (QEMU) do something like register an MMIO
region with KVM handled via a provided file descriptor and offset,
couldn't KVM then call the file ops without a kernel exit?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 20:06                     ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 22:28                       ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 22:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Kirti Wankhede

On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > 
> > Hi Alex, Kevin and Jike,
> > 
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> > 
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > KVM platform.
> > 
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > quick comments / thoughts regarding the existing discussions on this thread as
> > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > 
> > Then we can look at what we have, hopefully we can reach some consensus soon.
> > 
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> > 
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.
> 
> > Graphics driver can register with vfio-vgpu to get management and emulation call
> > backs to graphics driver.   
> > 
> > We already have struct vgpu_device in our proposal that keeps pointer to
> > physical device.  
> > 
> > > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > > purpose. Anyway they can share the same injection interface;
> > 
> > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> > available to graphics driver so that graphics driver can inject interrupts
> > directly when physical device triggers interrupt. 
> > 
> > Here is the proposal we have, please review.
> > 
> > Please note the patches we have put out here is mainly for POC purpose to
> > verify our understanding also can serve the purpose to reduce confusions and speed up 
> > our design, although we are very happy to refine that to something eventually
> > can be used for both parties and upstreamed.
> > 
> > Linux vGPU kernel design
> > ==================================================================================
> > 
> > Here we are proposing a generic Linux kernel module based on VFIO framework
> > which allows different GPU vendors to plugin and provide their GPU virtualization
> > solution on KVM, the benefits of having such generic kernel module are:
> > 
> > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> > 
> > 2) GPU HW agnostic management API for upper layer software such as libvirt
> > 
> > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor
> > 
> > 0. High level overview
> > ==================================================================================
> > 
> >  
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               | 
> >  ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     +---VV-------+ 
> >                          |             +-----^-----+           | TYPE1      |
> >                          |  (callback)       |                 | IOMMU      |
> >                          +-->------------>---+                 +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   | 
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |  
> >                           | 
> >                           +----> vendor driver callback
> > 
> > 
> > 1. VGPU management interface
> > ==================================================================================
> > 
> > This is the interface allows upper layer software (mostly libvirt) to query and
> > configure virtual GPU device in a HW agnostic fashion. Also, this management
> > interface has provided flexibility to underlying GPU vendor to support virtual
> > device hotplug, multiple virtual devices per VM, multiple virtual devices from
> > different physical devices, etc.
> > 
> > 1.1 Under per-physical device sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > "vgpu_supported_types".
> >                             
> > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > 
> > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > target physical GPU
> 
> 
> I've noted in previous discussions that we need to separate user policy
> from kernel policy here, the kernel policy should not require a "VM
> UUID".  A UUID simply represents a set of one or more devices and an
> index picks the device within the set.  Whether that UUID matches a VM
> or is independently used is up to the user policy when creating the
> device.
> 
> Personally I'd also prefer to get rid of the concept of indexes within a
> UUID set of devices and instead have each device be independent.  This
> seems to be an imposition on the nvidia implementation into the kernel
> interface design.
> 

Hi Alex,

I agree with you that we should not put UUID concept into a kernel API. At
this point (without any prototyping), I am thinking of using a list of virtual
devices instead of UUID.

> 
> > 1.3 Under vgpu class sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to commit virtual GPU resource for
> > this target VM. 
> > 
> > Also, the vgpu_start function is a synchronized call, the successful return of
> > this call will indicate all the requested vGPU resource has been fully
> > committed, the VMM should continue.
> > 
> > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to release virtual GPU resource of
> > this target VM.
> > 
> > 1.4 Virtual device Hotplug
> > ----------------------------------------------------------------------------------
> > 
> > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> > accessed during VM runtime, and the corresponding registration callback will be
> > invoked to allow GPU vendor support hotplug.
> > 
> > To support hotplug, vendor driver would take necessary action to handle the
> > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> > implies both create and start for that vgpu device.
> > 
> > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
> > supports vgpu hotplug.
> > 
> > If hotplug is not supported and VM is still running, vendor driver can return
> > error code to indicate not supported.
> > 
> > Separate create from start gives flixibility to have:
> > 
> > - multiple vgpu instances for single VM and
> > - hotplug feature.
> > 
> > 2. GPU driver vendor registration interface
> > ==================================================================================
> > 
> > 2.1 Registration interface definition (include/linux/vgpu.h)
> > ----------------------------------------------------------------------------------
> > 
> > extern int vgpu_register_device(struct pci_dev *dev, 
> >                                 const struct gpu_device_ops *ops);
> > 
> > extern void vgpu_unregister_device(struct pci_dev *dev);
> > 
> > /**
> >  * struct gpu_device_ops - Structure to be registered for each physical GPU to
> >  * register the device to vgpu module.
> >  *
> >  * @owner:                      The module owner.
> >  * @vgpu_supported_config:      Called to get information about supported vgpu
> >  * types.
> >  *                              @dev : pci device structure of physical GPU. 
> >  *                              @config: should return string listing supported
> >  *                              config
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_create:                Called to allocate basic resouces in graphics
> >  *                              driver for a particular vgpu.
> >  *                              @dev: physical pci device structure on which
> >  *                              vgpu 
> >  *                                    should be created
> >  *                              @vm_uuid: VM's uuid for which VM it is intended
> >  *                              to
> >  *                              @instance: vgpu instance in that VM
> >  *                              @vgpu_id: This represents the type of vgpu to be
> >  *                                        created
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_destroy:               Called to free resources in graphics driver for
> >  *                              a vgpu instance of that VM.
> >  *                              @dev: physical pci device structure to which
> >  *                              this vgpu points to.
> >  *                              @vm_uuid: VM's uuid for which the vgpu belongs
> >  *                              to.
> >  *                              @instance: vgpu instance in that VM
> >  *                              Returns integer: success (0) or error (< 0)
> >  *                              If VM is running and vgpu_destroy is called that 
> >  *                              means the vGPU is being hotunpluged. Return
> >  *                              error
> >  *                              if VM is running and graphics driver doesn't
> >  *                              support vgpu hotplug.
> >  * @vgpu_start:                 Called to do initiate vGPU initialization
> >  *                              process in graphics driver when VM boots before
> >  *                              qemu starts.
> >  *                              @vm_uuid: VM's UUID which is booting.
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_shutdown:              Called to teardown vGPU related resources for
> >  *                              the VM
> >  *                              @vm_uuid: VM's UUID which is shutting down .
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @read:                       Read emulation callback
> >  *                              @vdev: vgpu device structure
> >  *                              @buf: read buffer
> >  *                              @count: number bytes to read 
> >  *                              @address_space: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes read on success or error.
> >  * @write:                      Write emulation callback
> >  *                              @vdev: vgpu device structure
> >  *                              @buf: write buffer
> >  *                              @count: number bytes to be written
> >  *                              @address_space: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes written on success or
> >  *                              error.
> >  * @vgpu_set_irqs:              Called to send about interrupts configuration
> >  *                              information that qemu set. 
> >  *                              @vdev: vgpu device structure
> >  *                              @flags, index, start, count and *data : same as
> >  *                              that of struct vfio_irq_set of
> >  *                              VFIO_DEVICE_SET_IRQS API. 
> >  *
> >  * Physical GPU that support vGPU should be register with vgpu module with 
> >  * gpu_device_ops structure.
> >  */
> > 
> > struct gpu_device_ops {
> >         struct module   *owner;
> >         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
> >         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                uint32_t instance, uint32_t vgpu_id);
> >         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                 uint32_t instance);
> >         int     (*vgpu_start)(uuid_le vm_uuid);
> >         int     (*vgpu_shutdown)(uuid_le vm_uuid);
> >         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space, loff_t pos);
> >         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space,loff_t pos);
> >         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
> >                                  unsigned index, unsigned start, unsigned count,
> >                                  void *data);
> > 
> > };
> 
> 
> I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
> that register these ops with the main vfio-vgpu driver and they should
> also include a probe() function which allows us to associate a given
> vgpu device with a set of vendor ops.
> 
> 
> > 
> > 2.2 Details for callbacks we haven't mentioned above.
> > ---------------------------------------------------------------------------------
> > 
> > vgpu_supported_config: allows the vendor driver to specify the supported vGPU
> >                        type/configuration
> > 
> > vgpu_create          : create a virtual GPU device, can be used for device hotplug.
> > 
> > vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.
> > 
> > vgpu_start           : callback function to notify vendor driver vgpu device
> >                        come to live for a given virtual machine.
> > 
> > vgpu_shutdown        : callback function to notify vendor driver 
> > 
> > read                 : callback to vendor driver to handle virtual device config
> >                        space or MMIO read access
> > 
> > write                : callback to vendor driver to handle virtual device config
> >                        space or MMIO write access
> > 
> > vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
> >                        information for the target virtual device, then vendor
> >                        driver can inject interrupt into virtual machine for this
> >                        device.
> > 
> > 2.3 Potential additional virtual device configuration registration interface:
> > ---------------------------------------------------------------------------------
> > 
> > callback function to describe the MMAP behavior of the virtual GPU 
> > 
> > callback function to allow GPU vendor driver to provide PCI config space backing
> > memory.
> > 
> > 3. VGPU TYPE1 IOMMU
> > ==================================================================================
> > 
> > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
> > <iova, hva, size, flag> and save the QEMU mm for later reference.
> > 
> > You can find the quick/ugly implementation in the attached patch file, which is
> > actually just a simple version Alex's type1 IOMMU without actual real
> > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 
> > 
> > We have thought about providing another vendor driver registration interface so
> > such tracking information will be sent to vendor driver and he will use the QEMU
> > mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
> > quick implementation within our driver, I noticed following issues:
> > 
> > 1) OS/VFIO logic into vendor driver which will be a maintenance issue.
> > 
> > 2) Every driver vendor has to implement their own RB tree, instead of reusing
> > the common existing VFIO code (vfio_find/link/unlink_dma) 
> > 
> > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
> > better not have anything inside a vendor driver that the VFIO caller immediately
> > depends on.
> > 
> > Based on the above consideration, we decide to implement the DMA tracking logic
> > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
> > IOMMU code) and expose two symbols to outside for MMIO mapping and page
> > translation and pinning. 
> > 
> > Also, with a mmap MMIO interface between virtual and physical, this allows
> > para-virtualized guest driver can access his virtual MMIO without taking a MMAP
> > fault hit, also we can support different MMIO size between virtual and physical
> > device.
> > 
> > int vgpu_map_virtual_bar
> > (
> >     uint64_t virt_bar_addr,
> >     uint64_t phys_bar_addr,
> >     uint32_t len,
> >     uint32_t flags
> > )
> > 
> > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> 
> 
> Per the implementation provided, this needs to be implemented in the
> vfio device driver, not in the iommu interface.  Finding the DMA mapping
> of the device and replacing it is wrong.  It should be remapped at the
> vfio device file interface using vm_ops.
> 

So you are basically suggesting that we are going to take a mmap fault and
within that fault handler, we will go into vendor driver to look up the
"pre-registered" mapping and remap there.

Is my understanding correct?

> 
> > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > 
> > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > 
> > Still a lot to be added and modified, such as supporting multiple VMs and 
> > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> 
> Particularly, handling of mapping changes is completely missing.  This
> cannot be a point in time translation, the user is free to remap
> addresses whenever they wish and device translations need to be updated
> accordingly.
> 

When you say "user", do you mean the QEMU? Here, whenever the DMA that
the guest driver is going to launch will be first pinned within VM, and then
registered to QEMU, therefore the IOMMU memory listener, eventually the pages
will be pinned by the GPU or DMA engine.

Since we are keeping the upper level code same, thinking about passthru case,
where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
can change that mapping without causing an IOMMU fault on a active DMA device.

> 
> > 4. Modules
> > ==================================================================================
> > 
> > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > 
> > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> >                            TYPE1 v1 and v2 interface. 
> 
> Depending on how intrusive it is, this can possibly by done within the
> existing type1 driver.  Either that or we can split out common code for
> use by a separate module.
> 
> > vgpu.ko                  - provide registration interface and virtual device
> >                            VFIO access.
> > 
> > 5. QEMU note
> > ==================================================================================
> > 
> > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > use it as a reference for our implementation. It is basically just a quick c & p
> > from vfio/pci.c to quickly meet our needs.
> > 
> > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > class, and probably the only thing required is to have a new way to discover the
> > device.
> > 
> > 6. Examples
> > ==================================================================================
> > 
> > On this server, we have two NVIDIA M60 GPUs.
> > 
> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 
> > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > accessing the "vgpu_supported_types" like following:
> > 
> > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> > 
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > like to create "GRID M60-4Q" VM on it.
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > 
> > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > for multiple vgpu devices yet, but we will support it.
> > 
> > At this moment, if you query the "vgpu_supported_types" it will still show all
> > supported virtual GPU types as no virtual GPU resource is committed yet.
> > 
> > Starting VM:
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > 
> > then, the supported vGPU type query will return:
> > 
> > [root@cjia-vgx-kvm /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> > 
> > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > created as the underlying HW might limit the supported types if there are
> > any existing VM runnings.
> > 
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > GPU driver vendor to clean up resource.
> > 
> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > device sysfs.
> 
> 
> I'd like to hear Intel's thoughts on this interface.  Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?
> 
> I think there are some gaps in translating from named vgpu types to
> indexes here, along with my previous mention of the UUID/set oddity.
> 
> Does Intel have a need for start and shutdown interfaces?
> 
> Neo, wasn't there at some point information about how many of each type
> could be supported through these interfaces?  How does a user know their
> capacity limits?
> 

Thanks for reminding me that, I think we probably forget to put that *important*
information as the output of "vgpu_supported_types".

Regarding the capacity, we can provide the frame buffer size as part of the
"vgpu_supported_types" output as well, I would imagine those will be eventually
show up on the openstack management interface or virt-mgr.

Basically, yes there would be a separate col to show the number of instance you
can create for each type of VGPU on a specific physical GPU.

Thanks,
Neo


> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 22:28                       ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-26 22:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > 
> > Hi Alex, Kevin and Jike,
> > 
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> > 
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > KVM platform.
> > 
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > quick comments / thoughts regarding the existing discussions on this thread as
> > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > 
> > Then we can look at what we have, hopefully we can reach some consensus soon.
> > 
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> > 
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.
> 
> > Graphics driver can register with vfio-vgpu to get management and emulation call
> > backs to graphics driver.   
> > 
> > We already have struct vgpu_device in our proposal that keeps pointer to
> > physical device.  
> > 
> > > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > > purpose. Anyway they can share the same injection interface;
> > 
> > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> > available to graphics driver so that graphics driver can inject interrupts
> > directly when physical device triggers interrupt. 
> > 
> > Here is the proposal we have, please review.
> > 
> > Please note the patches we have put out here is mainly for POC purpose to
> > verify our understanding also can serve the purpose to reduce confusions and speed up 
> > our design, although we are very happy to refine that to something eventually
> > can be used for both parties and upstreamed.
> > 
> > Linux vGPU kernel design
> > ==================================================================================
> > 
> > Here we are proposing a generic Linux kernel module based on VFIO framework
> > which allows different GPU vendors to plugin and provide their GPU virtualization
> > solution on KVM, the benefits of having such generic kernel module are:
> > 
> > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> > 
> > 2) GPU HW agnostic management API for upper layer software such as libvirt
> > 
> > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver vendor
> > 
> > 0. High level overview
> > ==================================================================================
> > 
> >  
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               | 
> >  ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     +---++-------+ 
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     +---VV-------+ 
> >                          |             +-----^-----+           | TYPE1      |
> >                          |  (callback)       |                 | IOMMU      |
> >                          +-->------------>---+                 +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   | 
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |  
> >                           | 
> >                           +----> vendor driver callback
> > 
> > 
> > 1. VGPU management interface
> > ==================================================================================
> > 
> > This is the interface allows upper layer software (mostly libvirt) to query and
> > configure virtual GPU device in a HW agnostic fashion. Also, this management
> > interface has provided flexibility to underlying GPU vendor to support virtual
> > device hotplug, multiple virtual devices per VM, multiple virtual devices from
> > different physical devices, etc.
> > 
> > 1.1 Under per-physical device sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > "vgpu_supported_types".
> >                             
> > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > 
> > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > target physical GPU
> 
> 
> I've noted in previous discussions that we need to separate user policy
> from kernel policy here, the kernel policy should not require a "VM
> UUID".  A UUID simply represents a set of one or more devices and an
> index picks the device within the set.  Whether that UUID matches a VM
> or is independently used is up to the user policy when creating the
> device.
> 
> Personally I'd also prefer to get rid of the concept of indexes within a
> UUID set of devices and instead have each device be independent.  This
> seems to be an imposition on the nvidia implementation into the kernel
> interface design.
> 

Hi Alex,

I agree with you that we should not put UUID concept into a kernel API. At
this point (without any prototyping), I am thinking of using a list of virtual
devices instead of UUID.

> 
> > 1.3 Under vgpu class sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to commit virtual GPU resource for
> > this target VM. 
> > 
> > Also, the vgpu_start function is a synchronized call, the successful return of
> > this call will indicate all the requested vGPU resource has been fully
> > committed, the VMM should continue.
> > 
> > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to release virtual GPU resource of
> > this target VM.
> > 
> > 1.4 Virtual device Hotplug
> > ----------------------------------------------------------------------------------
> > 
> > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> > accessed during VM runtime, and the corresponding registration callback will be
> > invoked to allow GPU vendor support hotplug.
> > 
> > To support hotplug, vendor driver would take necessary action to handle the
> > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> > implies both create and start for that vgpu device.
> > 
> > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor driver
> > supports vgpu hotplug.
> > 
> > If hotplug is not supported and VM is still running, vendor driver can return
> > error code to indicate not supported.
> > 
> > Separate create from start gives flixibility to have:
> > 
> > - multiple vgpu instances for single VM and
> > - hotplug feature.
> > 
> > 2. GPU driver vendor registration interface
> > ==================================================================================
> > 
> > 2.1 Registration interface definition (include/linux/vgpu.h)
> > ----------------------------------------------------------------------------------
> > 
> > extern int vgpu_register_device(struct pci_dev *dev, 
> >                                 const struct gpu_device_ops *ops);
> > 
> > extern void vgpu_unregister_device(struct pci_dev *dev);
> > 
> > /**
> >  * struct gpu_device_ops - Structure to be registered for each physical GPU to
> >  * register the device to vgpu module.
> >  *
> >  * @owner:                      The module owner.
> >  * @vgpu_supported_config:      Called to get information about supported vgpu
> >  * types.
> >  *                              @dev : pci device structure of physical GPU. 
> >  *                              @config: should return string listing supported
> >  *                              config
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_create:                Called to allocate basic resouces in graphics
> >  *                              driver for a particular vgpu.
> >  *                              @dev: physical pci device structure on which
> >  *                              vgpu 
> >  *                                    should be created
> >  *                              @vm_uuid: VM's uuid for which VM it is intended
> >  *                              to
> >  *                              @instance: vgpu instance in that VM
> >  *                              @vgpu_id: This represents the type of vgpu to be
> >  *                                        created
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_destroy:               Called to free resources in graphics driver for
> >  *                              a vgpu instance of that VM.
> >  *                              @dev: physical pci device structure to which
> >  *                              this vgpu points to.
> >  *                              @vm_uuid: VM's uuid for which the vgpu belongs
> >  *                              to.
> >  *                              @instance: vgpu instance in that VM
> >  *                              Returns integer: success (0) or error (< 0)
> >  *                              If VM is running and vgpu_destroy is called that 
> >  *                              means the vGPU is being hotunpluged. Return
> >  *                              error
> >  *                              if VM is running and graphics driver doesn't
> >  *                              support vgpu hotplug.
> >  * @vgpu_start:                 Called to do initiate vGPU initialization
> >  *                              process in graphics driver when VM boots before
> >  *                              qemu starts.
> >  *                              @vm_uuid: VM's UUID which is booting.
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_shutdown:              Called to teardown vGPU related resources for
> >  *                              the VM
> >  *                              @vm_uuid: VM's UUID which is shutting down .
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @read:                       Read emulation callback
> >  *                              @vdev: vgpu device structure
> >  *                              @buf: read buffer
> >  *                              @count: number bytes to read 
> >  *                              @address_space: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes read on success or error.
> >  * @write:                      Write emulation callback
> >  *                              @vdev: vgpu device structure
> >  *                              @buf: write buffer
> >  *                              @count: number bytes to be written
> >  *                              @address_space: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes written on success or
> >  *                              error.
> >  * @vgpu_set_irqs:              Called to send about interrupts configuration
> >  *                              information that qemu set. 
> >  *                              @vdev: vgpu device structure
> >  *                              @flags, index, start, count and *data : same as
> >  *                              that of struct vfio_irq_set of
> >  *                              VFIO_DEVICE_SET_IRQS API. 
> >  *
> >  * Physical GPU that support vGPU should be register with vgpu module with 
> >  * gpu_device_ops structure.
> >  */
> > 
> > struct gpu_device_ops {
> >         struct module   *owner;
> >         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
> >         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                uint32_t instance, uint32_t vgpu_id);
> >         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                 uint32_t instance);
> >         int     (*vgpu_start)(uuid_le vm_uuid);
> >         int     (*vgpu_shutdown)(uuid_le vm_uuid);
> >         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space, loff_t pos);
> >         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space,loff_t pos);
> >         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
> >                                  unsigned index, unsigned start, unsigned count,
> >                                  void *data);
> > 
> > };
> 
> 
> I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
> that register these ops with the main vfio-vgpu driver and they should
> also include a probe() function which allows us to associate a given
> vgpu device with a set of vendor ops.
> 
> 
> > 
> > 2.2 Details for callbacks we haven't mentioned above.
> > ---------------------------------------------------------------------------------
> > 
> > vgpu_supported_config: allows the vendor driver to specify the supported vGPU
> >                        type/configuration
> > 
> > vgpu_create          : create a virtual GPU device, can be used for device hotplug.
> > 
> > vgpu_destroy         : destroy a virtual GPU device, can be used for device hotplug.
> > 
> > vgpu_start           : callback function to notify vendor driver vgpu device
> >                        come to live for a given virtual machine.
> > 
> > vgpu_shutdown        : callback function to notify vendor driver 
> > 
> > read                 : callback to vendor driver to handle virtual device config
> >                        space or MMIO read access
> > 
> > write                : callback to vendor driver to handle virtual device config
> >                        space or MMIO write access
> > 
> > vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
> >                        information for the target virtual device, then vendor
> >                        driver can inject interrupt into virtual machine for this
> >                        device.
> > 
> > 2.3 Potential additional virtual device configuration registration interface:
> > ---------------------------------------------------------------------------------
> > 
> > callback function to describe the MMAP behavior of the virtual GPU 
> > 
> > callback function to allow GPU vendor driver to provide PCI config space backing
> > memory.
> > 
> > 3. VGPU TYPE1 IOMMU
> > ==================================================================================
> > 
> > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep track the 
> > <iova, hva, size, flag> and save the QEMU mm for later reference.
> > 
> > You can find the quick/ugly implementation in the attached patch file, which is
> > actually just a simple version Alex's type1 IOMMU without actual real
> > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 
> > 
> > We have thought about providing another vendor driver registration interface so
> > such tracking information will be sent to vendor driver and he will use the QEMU
> > mm to do the get_user_pages / remap_pfn_range when it is required. After doing a
> > quick implementation within our driver, I noticed following issues:
> > 
> > 1) OS/VFIO logic into vendor driver which will be a maintenance issue.
> > 
> > 2) Every driver vendor has to implement their own RB tree, instead of reusing
> > the common existing VFIO code (vfio_find/link/unlink_dma) 
> > 
> > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the caller/QEMU,
> > better not have anything inside a vendor driver that the VFIO caller immediately
> > depends on.
> > 
> > Based on the above consideration, we decide to implement the DMA tracking logic
> > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current TYPE1
> > IOMMU code) and expose two symbols to outside for MMIO mapping and page
> > translation and pinning. 
> > 
> > Also, with a mmap MMIO interface between virtual and physical, this allows
> > para-virtualized guest driver can access his virtual MMIO without taking a MMAP
> > fault hit, also we can support different MMIO size between virtual and physical
> > device.
> > 
> > int vgpu_map_virtual_bar
> > (
> >     uint64_t virt_bar_addr,
> >     uint64_t phys_bar_addr,
> >     uint32_t len,
> >     uint32_t flags
> > )
> > 
> > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> 
> 
> Per the implementation provided, this needs to be implemented in the
> vfio device driver, not in the iommu interface.  Finding the DMA mapping
> of the device and replacing it is wrong.  It should be remapped at the
> vfio device file interface using vm_ops.
> 

So you are basically suggesting that we are going to take a mmap fault and
within that fault handler, we will go into vendor driver to look up the
"pre-registered" mapping and remap there.

Is my understanding correct?

> 
> > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > 
> > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > 
> > Still a lot to be added and modified, such as supporting multiple VMs and 
> > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> 
> Particularly, handling of mapping changes is completely missing.  This
> cannot be a point in time translation, the user is free to remap
> addresses whenever they wish and device translations need to be updated
> accordingly.
> 

When you say "user", do you mean the QEMU? Here, whenever the DMA that
the guest driver is going to launch will be first pinned within VM, and then
registered to QEMU, therefore the IOMMU memory listener, eventually the pages
will be pinned by the GPU or DMA engine.

Since we are keeping the upper level code same, thinking about passthru case,
where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
can change that mapping without causing an IOMMU fault on a active DMA device.

> 
> > 4. Modules
> > ==================================================================================
> > 
> > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > 
> > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> >                            TYPE1 v1 and v2 interface. 
> 
> Depending on how intrusive it is, this can possibly by done within the
> existing type1 driver.  Either that or we can split out common code for
> use by a separate module.
> 
> > vgpu.ko                  - provide registration interface and virtual device
> >                            VFIO access.
> > 
> > 5. QEMU note
> > ==================================================================================
> > 
> > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > use it as a reference for our implementation. It is basically just a quick c & p
> > from vfio/pci.c to quickly meet our needs.
> > 
> > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > class, and probably the only thing required is to have a new way to discover the
> > device.
> > 
> > 6. Examples
> > ==================================================================================
> > 
> > On this server, we have two NVIDIA M60 GPUs.
> > 
> > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 
> > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > accessing the "vgpu_supported_types" like following:
> > 
> > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> > 
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > like to create "GRID M60-4Q" VM on it.
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > 
> > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > for multiple vgpu devices yet, but we will support it.
> > 
> > At this moment, if you query the "vgpu_supported_types" it will still show all
> > supported virtual GPU types as no virtual GPU resource is committed yet.
> > 
> > Starting VM:
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > 
> > then, the supported vGPU type query will return:
> > 
> > [root@cjia-vgx-kvm /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> > 
> > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > created as the underlying HW might limit the supported types if there are
> > any existing VM runnings.
> > 
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > GPU driver vendor to clean up resource.
> > 
> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > device sysfs.
> 
> 
> I'd like to hear Intel's thoughts on this interface.  Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?
> 
> I think there are some gaps in translating from named vgpu types to
> indexes here, along with my previous mention of the UUID/set oddity.
> 
> Does Intel have a need for start and shutdown interfaces?
> 
> Neo, wasn't there at some point information about how many of each type
> could be supported through these interfaces?  How does a user know their
> capacity limits?
> 

Thanks for reminding me that, I think we probably forget to put that *important*
information as the output of "vgpu_supported_types".

Regarding the capacity, we can provide the frame buffer size as part of the
"vgpu_supported_types" output as well, I would imagine those will be eventually
show up on the openstack management interface or virt-mgr.

Basically, yes there would be a separate col to show the number of instance you
can create for each type of VGPU on a specific physical GPU.

Thanks,
Neo


> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:27                                 ` [Qemu-devel] " Alex Williamson
@ 2016-01-26 22:39                                   ` Tian, Kevin
  -1 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 22:39 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 6:27 AM
> 
> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, January 27, 2016 6:08 AM
> > >
> > > > > > >
> > > > > >
> > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > to handle I/O emulation itself like today).
> > > > >
> > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > take shortcuts to provide such direct access.  Thanks,
> > > > >
> > > >
> > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > To me the latter one is more important otherwise there'd be no real usage
> > > > around this technique, while for debugging there are other alternative (e.g.
> > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > how much impact a 2-3x longer emulation path can bring...
> > >
> > > Are you jumping to the conclusion that it cannot be done with proper
> > > layering in place?  Performance is important, but it's not an excuse to
> > > abandon designing interfaces between independent components.  Thanks,
> > >
> >
> > Two are not controversial. My point is to remove unnecessary long trip
> > as possible. After another thought, yes we can reuse existing read/write
> > flags:
> > 	- KVMGT will expose a private control variable whether in-kernel
> > delivery is required;
> 
> But in-kernel delivery is never *required*.  Wouldn't userspace want to
> deliver in-kernel any time it possibly could?
> 
> > 	- when the variable is true, KVMGT will register in-kernel MMIO
> > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > directly;
> > 	- when the variable is false, KVMGT will not register anything.
> > VM MMIO request will then be delivered to Qemu and then ioread/write
> > will be used to finally reach KVMGT emulation logic;
> 
> No, that means the interface is entirely dependent on a backdoor through
> KVM.  Why can't userspace (QEMU) do something like register an MMIO
> region with KVM handled via a provided file descriptor and offset,
> couldn't KVM then call the file ops without a kernel exit?  Thanks,
> 

Could you elaborate this thought? If it can achieve the purpose w/o
a kernel exit definitely we can adapt to it. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 22:39                                   ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2016-01-26 22:39 UTC (permalink / raw)
  To: Alex Williamson, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, January 27, 2016 6:27 AM
> 
> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, January 27, 2016 6:08 AM
> > >
> > > > > > >
> > > > > >
> > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > to handle I/O emulation itself like today).
> > > > >
> > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > take shortcuts to provide such direct access.  Thanks,
> > > > >
> > > >
> > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > To me the latter one is more important otherwise there'd be no real usage
> > > > around this technique, while for debugging there are other alternative (e.g.
> > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > how much impact a 2-3x longer emulation path can bring...
> > >
> > > Are you jumping to the conclusion that it cannot be done with proper
> > > layering in place?  Performance is important, but it's not an excuse to
> > > abandon designing interfaces between independent components.  Thanks,
> > >
> >
> > Two are not controversial. My point is to remove unnecessary long trip
> > as possible. After another thought, yes we can reuse existing read/write
> > flags:
> > 	- KVMGT will expose a private control variable whether in-kernel
> > delivery is required;
> 
> But in-kernel delivery is never *required*.  Wouldn't userspace want to
> deliver in-kernel any time it possibly could?
> 
> > 	- when the variable is true, KVMGT will register in-kernel MMIO
> > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > directly;
> > 	- when the variable is false, KVMGT will not register anything.
> > VM MMIO request will then be delivered to Qemu and then ioread/write
> > will be used to finally reach KVMGT emulation logic;
> 
> No, that means the interface is entirely dependent on a backdoor through
> KVM.  Why can't userspace (QEMU) do something like register an MMIO
> region with KVM handled via a provided file descriptor and offset,
> couldn't KVM then call the file ops without a kernel exit?  Thanks,
> 

Could you elaborate this thought? If it can achieve the purpose w/o
a kernel exit definitely we can adapt to it. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:39                                   ` [Qemu-devel] " Tian, Kevin
@ 2016-01-26 22:56                                     ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 22:56 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan, Shuai, kvm,
	qemu-devel, igvt-g@lists.01.org, Neo Jia

On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 6:27 AM
> > 
> > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > to handle I/O emulation itself like today).
> > > > > > 
> > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > 
> > > > > 
> > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > how much impact a 2-3x longer emulation path can bring...
> > > > 
> > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > layering in place?  Performance is important, but it's not an excuse to
> > > > abandon designing interfaces between independent components.  Thanks,
> > > > 
> > > 
> > > Two are not controversial. My point is to remove unnecessary long trip
> > > as possible. After another thought, yes we can reuse existing read/write
> > > flags:
> > >  	- KVMGT will expose a private control variable whether in-kernel
> > > delivery is required;
> > 
> > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > deliver in-kernel any time it possibly could?
> > 
> > >  	- when the variable is true, KVMGT will register in-kernel MMIO
> > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > directly;
> > >  	- when the variable is false, KVMGT will not register anything.
> > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > will be used to finally reach KVMGT emulation logic;
> > 
> > No, that means the interface is entirely dependent on a backdoor through
> > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > region with KVM handled via a provided file descriptor and offset,
> > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > 
> 
> Could you elaborate this thought? If it can achieve the purpose w/o
> a kernel exit definitely we can adapt to it. :-)

I only thought of it when replying to the last email and have been doing
some research, but we already do quite a bit of synchronization through
file descriptors.  The kvm-vfio pseudo device uses a group file
descriptor to ensure a user has access to a group, allowing some degree
of interaction between modules.  Eventfds and irqfds already make use of
f_ops on file descriptors to poke data.  So, if KVM had information that
an MMIO region was backed by a file descriptor for which it already has
a reference via fdget() (and verified access rights and whatnot), then
it ought to be a simple matter to get to f_ops->read/write knowing the
base offset of that MMIO region.  Perhaps it could even simply use
__vfs_read/write().  Then we've got a proper reference to the file
descriptor for ownership purposes and we've transparently jumped across
modules without any implicit knowledge of the other end.  Could it work?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 22:56                                     ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 22:56 UTC (permalink / raw)
  To: Tian, Kevin, Yang Zhang, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 6:27 AM
> > 
> > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > to handle I/O emulation itself like today).
> > > > > > 
> > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > 
> > > > > 
> > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > how much impact a 2-3x longer emulation path can bring...
> > > > 
> > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > layering in place?  Performance is important, but it's not an excuse to
> > > > abandon designing interfaces between independent components.  Thanks,
> > > > 
> > > 
> > > Two are not controversial. My point is to remove unnecessary long trip
> > > as possible. After another thought, yes we can reuse existing read/write
> > > flags:
> > >  	- KVMGT will expose a private control variable whether in-kernel
> > > delivery is required;
> > 
> > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > deliver in-kernel any time it possibly could?
> > 
> > >  	- when the variable is true, KVMGT will register in-kernel MMIO
> > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > directly;
> > >  	- when the variable is false, KVMGT will not register anything.
> > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > will be used to finally reach KVMGT emulation logic;
> > 
> > No, that means the interface is entirely dependent on a backdoor through
> > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > region with KVM handled via a provided file descriptor and offset,
> > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > 
> 
> Could you elaborate this thought? If it can achieve the purpose w/o
> a kernel exit definitely we can adapt to it. :-)

I only thought of it when replying to the last email and have been doing
some research, but we already do quite a bit of synchronization through
file descriptors.  The kvm-vfio pseudo device uses a group file
descriptor to ensure a user has access to a group, allowing some degree
of interaction between modules.  Eventfds and irqfds already make use of
f_ops on file descriptors to poke data.  So, if KVM had information that
an MMIO region was backed by a file descriptor for which it already has
a reference via fdget() (and verified access rights and whatnot), then
it ought to be a simple matter to get to f_ops->read/write knowing the
base offset of that MMIO region.  Perhaps it could even simply use
__vfs_read/write().  Then we've got a proper reference to the file
descriptor for ownership purposes and we've transparently jumped across
modules without any implicit knowledge of the other end.  Could it work?
Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:28                       ` [Qemu-devel] " Neo Jia
@ 2016-01-26 23:30                         ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 23:30 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Kirti Wankhede

On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > 1.1 Under per-physical device sysfs:
> > > ----------------------------------------------------------------------------------
> > >  
> > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > "vgpu_supported_types".
> > >                             
> > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > >  
> > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > target physical GPU
> > 
> > 
> > I've noted in previous discussions that we need to separate user policy
> > from kernel policy here, the kernel policy should not require a "VM
> > UUID".  A UUID simply represents a set of one or more devices and an
> > index picks the device within the set.  Whether that UUID matches a VM
> > or is independently used is up to the user policy when creating the
> > device.
> > 
> > Personally I'd also prefer to get rid of the concept of indexes within a
> > UUID set of devices and instead have each device be independent.  This
> > seems to be an imposition on the nvidia implementation into the kernel
> > interface design.
> > 
> 
> Hi Alex,
> 
> I agree with you that we should not put UUID concept into a kernel API. At
> this point (without any prototyping), I am thinking of using a list of virtual
> devices instead of UUID.

Hi Neo,

A UUID is a perfectly fine name, so long as we let it be just a UUID and
not the UUID matching some specific use case.

> > >  
> > > int vgpu_map_virtual_bar
> > > (
> > >     uint64_t virt_bar_addr,
> > >     uint64_t phys_bar_addr,
> > >     uint32_t len,
> > >     uint32_t flags
> > > )
> > >  
> > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > 
> > 
> > Per the implementation provided, this needs to be implemented in the
> > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > of the device and replacing it is wrong.  It should be remapped at the
> > vfio device file interface using vm_ops.
> > 
> 
> So you are basically suggesting that we are going to take a mmap fault and
> within that fault handler, we will go into vendor driver to look up the
> "pre-registered" mapping and remap there.
> 
> Is my understanding correct?

Essentially, hopefully the vendor driver will have already registered
the backing for the mmap prior to the fault, but either way could work.
I think the key though is that you want to remap it onto the vma
accessing the vfio device file, not scanning it out of an IOVA mapping
that might be dynamic and doing a vma lookup based on the point in time
mapping of the BAR.  The latter doesn't give me much confidence that
mappings couldn't change while the former should be a one time fault.

In case it's not clear to folks at Intel, the purpose of this is that a
vGPU may directly map a segment of the physical GPU MMIO space, but we
may not know what segment that is at setup time, when QEMU does an mmap
of the vfio device file descriptor.  The thought is that we can create
an invalid mapping when QEMU calls mmap(), knowing that it won't be
accessed until later, then we can fault in the real mmap on demand.  Do
you need anything similar?

> > 
> > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > >  
> > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > >  
> > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > 
> > Particularly, handling of mapping changes is completely missing.  This
> > cannot be a point in time translation, the user is free to remap
> > addresses whenever they wish and device translations need to be updated
> > accordingly.
> > 
> 
> When you say "user", do you mean the QEMU?

vfio is a generic userspace driver interface, QEMU is a very, very
important user of the interface, but not the only user.  So for this
conversation, we're mostly talking about QEMU as the user, but we should
be careful about assuming QEMU is the only user.

> Here, whenever the DMA that
> the guest driver is going to launch will be first pinned within VM, and then
> registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> will be pinned by the GPU or DMA engine.
> 
> Since we are keeping the upper level code same, thinking about passthru case,
> where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> can change that mapping without causing an IOMMU fault on a active DMA device.

For the virtual BAR mapping above, it's easy to imagine that mapping a
BAR to a given address is at the guest discretion, it may be mapped and
unmapped, it may be mapped to different addresses at different points in
time, the guest BIOS may choose to map it at yet another address, etc.
So if somehow we were trying to setup a mapping for peer-to-peer, there
are lots of ways that IOVA could change.  But even with RAM, we can
support memory hotplug in a VM.  What was once a DMA target may be
removed or may now be backed by something else.  Chipset configuration
on the emulated platform may change how guest physical memory appears
and that might change between VM boots.

Currently with physical device assignment the memory listener watches
for both maps and unmaps and updates the iotlb to match.  Just like real
hardware doing these same sorts of things, we rely on the guest to stop
using memory that's going to be moved as a DMA target prior to moving
it.

> > > 4. Modules
> > > ==================================================================================
> > >  
> > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > >  
> > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > >                            TYPE1 v1 and v2 interface. 
> > 
> > Depending on how intrusive it is, this can possibly by done within the
> > existing type1 driver.  Either that or we can split out common code for
> > use by a separate module.
> > 
> > > vgpu.ko                  - provide registration interface and virtual device
> > >                            VFIO access.
> > >  
> > > 5. QEMU note
> > > ==================================================================================
> > >  
> > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > > use it as a reference for our implementation. It is basically just a quick c & p
> > > from vfio/pci.c to quickly meet our needs.
> > >  
> > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > > class, and probably the only thing required is to have a new way to discover the
> > > device.
> > >  
> > > 6. Examples
> > > ==================================================================================
> > >  
> > > On this server, we have two NVIDIA M60 GPUs.
> > >  
> > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > >  
> > > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > > accessing the "vgpu_supported_types" like following:
> > >  
> > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > 11:GRID M60-0B
> > > 12:GRID M60-0Q
> > > 13:GRID M60-1B
> > > 14:GRID M60-1Q
> > > 15:GRID M60-2B
> > > 16:GRID M60-2Q
> > > 17:GRID M60-4Q
> > > 18:GRID M60-8Q
> > >  
> > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > > like to create "GRID M60-4Q" VM on it.
> > >  
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > >  
> > > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > > for multiple vgpu devices yet, but we will support it.
> > >  
> > > At this moment, if you query the "vgpu_supported_types" it will still show all
> > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > >  
> > > Starting VM:
> > >  
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > >  
> > > then, the supported vGPU type query will return:
> > >  
> > > [root@cjia-vgx-kvm /home/cjia]$
> > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > 17:GRID M60-4Q
> > >  
> > > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > > created as the underlying HW might limit the supported types if there are
> > > any existing VM runnings.
> > >  
> > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > > GPU driver vendor to clean up resource.
> > >  
> > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > > device sysfs.
> > 
> > 
> > I'd like to hear Intel's thoughts on this interface.  Are there
> > different vgpu capacities or priority classes that would necessitate
> > different types of vcpus on Intel?
> > 
> > I think there are some gaps in translating from named vgpu types to
> > indexes here, along with my previous mention of the UUID/set oddity.
> > 
> > Does Intel have a need for start and shutdown interfaces?
> > 
> > Neo, wasn't there at some point information about how many of each type
> > could be supported through these interfaces?  How does a user know their
> > capacity limits?
> > 
> 
> Thanks for reminding me that, I think we probably forget to put that *important*
> information as the output of "vgpu_supported_types".
> 
> Regarding the capacity, we can provide the frame buffer size as part of the
> "vgpu_supported_types" output as well, I would imagine those will be eventually
> show up on the openstack management interface or virt-mgr.
> 
> Basically, yes there would be a separate col to show the number of instance you
> can create for each type of VGPU on a specific physical GPU.

Ok, Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-26 23:30                         ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-26 23:30 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > 1.1 Under per-physical device sysfs:
> > > ----------------------------------------------------------------------------------
> > >  
> > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > "vgpu_supported_types".
> > >                             
> > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > >  
> > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > target physical GPU
> > 
> > 
> > I've noted in previous discussions that we need to separate user policy
> > from kernel policy here, the kernel policy should not require a "VM
> > UUID".  A UUID simply represents a set of one or more devices and an
> > index picks the device within the set.  Whether that UUID matches a VM
> > or is independently used is up to the user policy when creating the
> > device.
> > 
> > Personally I'd also prefer to get rid of the concept of indexes within a
> > UUID set of devices and instead have each device be independent.  This
> > seems to be an imposition on the nvidia implementation into the kernel
> > interface design.
> > 
> 
> Hi Alex,
> 
> I agree with you that we should not put UUID concept into a kernel API. At
> this point (without any prototyping), I am thinking of using a list of virtual
> devices instead of UUID.

Hi Neo,

A UUID is a perfectly fine name, so long as we let it be just a UUID and
not the UUID matching some specific use case.

> > >  
> > > int vgpu_map_virtual_bar
> > > (
> > >     uint64_t virt_bar_addr,
> > >     uint64_t phys_bar_addr,
> > >     uint32_t len,
> > >     uint32_t flags
> > > )
> > >  
> > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > 
> > 
> > Per the implementation provided, this needs to be implemented in the
> > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > of the device and replacing it is wrong.  It should be remapped at the
> > vfio device file interface using vm_ops.
> > 
> 
> So you are basically suggesting that we are going to take a mmap fault and
> within that fault handler, we will go into vendor driver to look up the
> "pre-registered" mapping and remap there.
> 
> Is my understanding correct?

Essentially, hopefully the vendor driver will have already registered
the backing for the mmap prior to the fault, but either way could work.
I think the key though is that you want to remap it onto the vma
accessing the vfio device file, not scanning it out of an IOVA mapping
that might be dynamic and doing a vma lookup based on the point in time
mapping of the BAR.  The latter doesn't give me much confidence that
mappings couldn't change while the former should be a one time fault.

In case it's not clear to folks at Intel, the purpose of this is that a
vGPU may directly map a segment of the physical GPU MMIO space, but we
may not know what segment that is at setup time, when QEMU does an mmap
of the vfio device file descriptor.  The thought is that we can create
an invalid mapping when QEMU calls mmap(), knowing that it won't be
accessed until later, then we can fault in the real mmap on demand.  Do
you need anything similar?

> > 
> > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > >  
> > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > >  
> > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > 
> > Particularly, handling of mapping changes is completely missing.  This
> > cannot be a point in time translation, the user is free to remap
> > addresses whenever they wish and device translations need to be updated
> > accordingly.
> > 
> 
> When you say "user", do you mean the QEMU?

vfio is a generic userspace driver interface, QEMU is a very, very
important user of the interface, but not the only user.  So for this
conversation, we're mostly talking about QEMU as the user, but we should
be careful about assuming QEMU is the only user.

> Here, whenever the DMA that
> the guest driver is going to launch will be first pinned within VM, and then
> registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> will be pinned by the GPU or DMA engine.
> 
> Since we are keeping the upper level code same, thinking about passthru case,
> where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> can change that mapping without causing an IOMMU fault on a active DMA device.

For the virtual BAR mapping above, it's easy to imagine that mapping a
BAR to a given address is at the guest discretion, it may be mapped and
unmapped, it may be mapped to different addresses at different points in
time, the guest BIOS may choose to map it at yet another address, etc.
So if somehow we were trying to setup a mapping for peer-to-peer, there
are lots of ways that IOVA could change.  But even with RAM, we can
support memory hotplug in a VM.  What was once a DMA target may be
removed or may now be backed by something else.  Chipset configuration
on the emulated platform may change how guest physical memory appears
and that might change between VM boots.

Currently with physical device assignment the memory listener watches
for both maps and unmaps and updates the iotlb to match.  Just like real
hardware doing these same sorts of things, we rely on the guest to stop
using memory that's going to be moved as a DMA target prior to moving
it.

> > > 4. Modules
> > > ==================================================================================
> > >  
> > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > >  
> > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > >                            TYPE1 v1 and v2 interface. 
> > 
> > Depending on how intrusive it is, this can possibly by done within the
> > existing type1 driver.  Either that or we can split out common code for
> > use by a separate module.
> > 
> > > vgpu.ko                  - provide registration interface and virtual device
> > >                            VFIO access.
> > >  
> > > 5. QEMU note
> > > ==================================================================================
> > >  
> > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > > use it as a reference for our implementation. It is basically just a quick c & p
> > > from vfio/pci.c to quickly meet our needs.
> > >  
> > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > > class, and probably the only thing required is to have a new way to discover the
> > > device.
> > >  
> > > 6. Examples
> > > ==================================================================================
> > >  
> > > On this server, we have two NVIDIA M60 GPUs.
> > >  
> > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > >  
> > > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > > accessing the "vgpu_supported_types" like following:
> > >  
> > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > 11:GRID M60-0B
> > > 12:GRID M60-0Q
> > > 13:GRID M60-1B
> > > 14:GRID M60-1Q
> > > 15:GRID M60-2B
> > > 16:GRID M60-2Q
> > > 17:GRID M60-4Q
> > > 18:GRID M60-8Q
> > >  
> > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > > like to create "GRID M60-4Q" VM on it.
> > >  
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > >  
> > > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > > for multiple vgpu devices yet, but we will support it.
> > >  
> > > At this moment, if you query the "vgpu_supported_types" it will still show all
> > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > >  
> > > Starting VM:
> > >  
> > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > >  
> > > then, the supported vGPU type query will return:
> > >  
> > > [root@cjia-vgx-kvm /home/cjia]$
> > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > 17:GRID M60-4Q
> > >  
> > > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > > created as the underlying HW might limit the supported types if there are
> > > any existing VM runnings.
> > >  
> > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > > GPU driver vendor to clean up resource.
> > >  
> > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > > device sysfs.
> > 
> > 
> > I'd like to hear Intel's thoughts on this interface.  Are there
> > different vgpu capacities or priority classes that would necessitate
> > different types of vcpus on Intel?
> > 
> > I think there are some gaps in translating from named vgpu types to
> > indexes here, along with my previous mention of the UUID/set oddity.
> > 
> > Does Intel have a need for start and shutdown interfaces?
> > 
> > Neo, wasn't there at some point information about how many of each type
> > could be supported through these interfaces?  How does a user know their
> > capacity limits?
> > 
> 
> Thanks for reminding me that, I think we probably forget to put that *important*
> information as the output of "vgpu_supported_types".
> 
> Regarding the capacity, we can provide the frame buffer size as part of the
> "vgpu_supported_types" output as well, I would imagine those will be eventually
> show up on the openstack management interface or virt-mgr.
> 
> Basically, yes there would be a separate col to show the number of instance you
> can create for each type of VGPU on a specific physical GPU.

Ok, Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 14:05                   ` [Qemu-devel] " Yang Zhang
@ 2016-01-27  0:06                     ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  0:06 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Alex Williamson, Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On 01/26/2016 10:05 PM, Yang Zhang wrote:
> On 2016/1/26 15:41, Jike Song wrote:
>
>> We will need to extend:
>>
>> 	- VFIO_DEVICE_GET_REGION_INFO
>>
>>
>> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
>> should be trapped instead of being mmap-ed.
> 
> I may not in the context, but i am curious how to handle the DONT_MAP in 
> vfio driver? Since there are no real MMIO maps into the region and i 
> suppose the access to the region should be handled by vgpu in i915 
> driver, but currently most of the mmio accesses are handled by Qemu.
>

Hi Yang,

MMIO accesses are supposed to be handled in kernel, without vm-exiting
to QEMU, similar to in-kernel irqchip :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  0:06                     ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  0:06 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On 01/26/2016 10:05 PM, Yang Zhang wrote:
> On 2016/1/26 15:41, Jike Song wrote:
>
>> We will need to extend:
>>
>> 	- VFIO_DEVICE_GET_REGION_INFO
>>
>>
>> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
>> should be trapped instead of being mmap-ed.
> 
> I may not in the context, but i am curious how to handle the DONT_MAP in 
> vfio driver? Since there are no real MMIO maps into the region and i 
> suppose the access to the region should be handled by vgpu in i915 
> driver, but currently most of the mmio accesses are handled by Qemu.
>

Hi Yang,

MMIO accesses are supposed to be handled in kernel, without vm-exiting
to QEMU, similar to in-kernel irqchip :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  0:06                     ` [Qemu-devel] " Jike Song
@ 2016-01-27  1:34                       ` Yang Zhang
  -1 siblings, 0 replies; 118+ messages in thread
From: Yang Zhang @ 2016-01-27  1:34 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On 2016/1/27 8:06, Jike Song wrote:
> On 01/26/2016 10:05 PM, Yang Zhang wrote:
>> On 2016/1/26 15:41, Jike Song wrote:
>>
>>> We will need to extend:
>>>
>>> 	- VFIO_DEVICE_GET_REGION_INFO
>>>
>>>
>>> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
>>> should be trapped instead of being mmap-ed.
>>
>> I may not in the context, but i am curious how to handle the DONT_MAP in
>> vfio driver? Since there are no real MMIO maps into the region and i
>> suppose the access to the region should be handled by vgpu in i915
>> driver, but currently most of the mmio accesses are handled by Qemu.
>>
>
> Hi Yang,
>
> MMIO accesses are supposed to be handled in kernel, without vm-exiting
> to QEMU, similar to in-kernel irqchip :)

The question is current vfio doesn't support it. The long discussion 
between Alex and Kevin is what i am to understand how KVMGT works under 
vfio framework.


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  1:34                       ` Yang Zhang
  0 siblings, 0 replies; 118+ messages in thread
From: Yang Zhang @ 2016-01-27  1:34 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On 2016/1/27 8:06, Jike Song wrote:
> On 01/26/2016 10:05 PM, Yang Zhang wrote:
>> On 2016/1/26 15:41, Jike Song wrote:
>>
>>> We will need to extend:
>>>
>>> 	- VFIO_DEVICE_GET_REGION_INFO
>>>
>>>
>>> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
>>> should be trapped instead of being mmap-ed.
>>
>> I may not in the context, but i am curious how to handle the DONT_MAP in
>> vfio driver? Since there are no real MMIO maps into the region and i
>> suppose the access to the region should be handled by vgpu in i915
>> driver, but currently most of the mmio accesses are handled by Qemu.
>>
>
> Hi Yang,
>
> MMIO accesses are supposed to be handled in kernel, without vm-exiting
> to QEMU, similar to in-kernel irqchip :)

The question is current vfio doesn't support it. The long discussion 
between Alex and Kevin is what i am to understand how KVMGT works under 
vfio framework.


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:56                                     ` [Qemu-devel] " Alex Williamson
@ 2016-01-27  1:47                                       ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  1:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Yang Zhang, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On 01/27/2016 06:56 AM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>  
>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>  
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>  
>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>  
>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>  
>>>>  
>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>> as possible. After another thought, yes we can reuse existing read/write
>>>> flags:
>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>> delivery is required;
>>>  
>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>> deliver in-kernel any time it possibly could?
>>>  
>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>> directly;
>>>>  	- when the variable is false, KVMGT will not register anything.
>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>> will be used to finally reach KVMGT emulation logic;
>>>  
>>> No, that means the interface is entirely dependent on a backdoor through
>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>> region with KVM handled via a provided file descriptor and offset,
>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>  
>>  
>> Could you elaborate this thought? If it can achieve the purpose w/o
>> a kernel exit definitely we can adapt to it. :-)
> 
> I only thought of it when replying to the last email and have been doing
> some research, but we already do quite a bit of synchronization through
> file descriptors.  The kvm-vfio pseudo device uses a group file
> descriptor to ensure a user has access to a group, allowing some degree
> of interaction between modules.  Eventfds and irqfds already make use of
> f_ops on file descriptors to poke data.  So, if KVM had information that
> an MMIO region was backed by a file descriptor for which it already has
> a reference via fdget() (and verified access rights and whatnot), then
> it ought to be a simple matter to get to f_ops->read/write knowing the
> base offset of that MMIO region.  Perhaps it could even simply use
> __vfs_read/write().  Then we've got a proper reference to the file
> descriptor for ownership purposes and we've transparently jumped across
> modules without any implicit knowledge of the other end.  Could it work?

This is OK for KVMGT, from fops to vgpu device-model would always be simple.
The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?

copy-and-paste the current implementation of vcpu_mmio_write(), seems
nothing but GPA and len are provided:

	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
				   const void *v)
	{
		int handled = 0;
		int n;

		do {
			n = min(len, 8);
			if (!(vcpu->arch.apic &&
			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
				break;
			handled += n;
			addr += n;
			len -= n;
			v += n;
		} while (len);

		return handled;
	}

If we back a GPA range with a fd, this will also be a 'backdoor'?


> Thanks,
> 
> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  1:47                                       ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  1:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On 01/27/2016 06:56 AM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>  
>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>  
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>  
>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>  
>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>  
>>>>  
>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>> as possible. After another thought, yes we can reuse existing read/write
>>>> flags:
>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>> delivery is required;
>>>  
>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>> deliver in-kernel any time it possibly could?
>>>  
>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>> directly;
>>>>  	- when the variable is false, KVMGT will not register anything.
>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>> will be used to finally reach KVMGT emulation logic;
>>>  
>>> No, that means the interface is entirely dependent on a backdoor through
>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>> region with KVM handled via a provided file descriptor and offset,
>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>  
>>  
>> Could you elaborate this thought? If it can achieve the purpose w/o
>> a kernel exit definitely we can adapt to it. :-)
> 
> I only thought of it when replying to the last email and have been doing
> some research, but we already do quite a bit of synchronization through
> file descriptors.  The kvm-vfio pseudo device uses a group file
> descriptor to ensure a user has access to a group, allowing some degree
> of interaction between modules.  Eventfds and irqfds already make use of
> f_ops on file descriptors to poke data.  So, if KVM had information that
> an MMIO region was backed by a file descriptor for which it already has
> a reference via fdget() (and verified access rights and whatnot), then
> it ought to be a simple matter to get to f_ops->read/write knowing the
> base offset of that MMIO region.  Perhaps it could even simply use
> __vfs_read/write().  Then we've got a proper reference to the file
> descriptor for ownership purposes and we've transparently jumped across
> modules without any implicit knowledge of the other end.  Could it work?

This is OK for KVMGT, from fops to vgpu device-model would always be simple.
The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?

copy-and-paste the current implementation of vcpu_mmio_write(), seems
nothing but GPA and len are provided:

	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
				   const void *v)
	{
		int handled = 0;
		int n;

		do {
			n = min(len, 8);
			if (!(vcpu->arch.apic &&
			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
				break;
			handled += n;
			addr += n;
			len -= n;
			v += n;
		} while (len);

		return handled;
	}

If we back a GPA range with a fd, this will also be a 'backdoor'?


> Thanks,
> 
> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  1:34                       ` [Qemu-devel] " Yang Zhang
@ 2016-01-27  1:51                         ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  1:51 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Alex Williamson, Tian, Kevin, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On 01/27/2016 09:34 AM, Yang Zhang wrote:
> On 2016/1/27 8:06, Jike Song wrote:
>> On 01/26/2016 10:05 PM, Yang Zhang wrote:
>>> On 2016/1/26 15:41, Jike Song wrote:
>>>
>>>> We will need to extend:
>>>>
>>>> 	- VFIO_DEVICE_GET_REGION_INFO
>>>>
>>>>
>>>> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
>>>> should be trapped instead of being mmap-ed.
>>>
>>> I may not in the context, but i am curious how to handle the DONT_MAP in
>>> vfio driver? Since there are no real MMIO maps into the region and i
>>> suppose the access to the region should be handled by vgpu in i915
>>> driver, but currently most of the mmio accesses are handled by Qemu.
>>>
>>
>> Hi Yang,
>>
>> MMIO accesses are supposed to be handled in kernel, without vm-exiting
>> to QEMU, similar to in-kernel irqchip :)
> 
> The question is current vfio doesn't support it. The long discussion 
> between Alex and Kevin is what i am to understand how KVMGT works under 
> vfio framework.
>

Yes, good to expose it earlier.

Previously Kevin and I thought KVMGT is free to register an iodev,
responsible for a MMIO range r/w, to KVM hypervisor directly. If this is
not acceptable then we will have to figure out an alternative.

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  1:51                         ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  1:51 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Ruan, Shuai, Tian, Kevin, Neo Jia, kvm, igvt-g@lists.01.org,
	qemu-devel, Alex Williamson, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On 01/27/2016 09:34 AM, Yang Zhang wrote:
> On 2016/1/27 8:06, Jike Song wrote:
>> On 01/26/2016 10:05 PM, Yang Zhang wrote:
>>> On 2016/1/26 15:41, Jike Song wrote:
>>>
>>>> We will need to extend:
>>>>
>>>> 	- VFIO_DEVICE_GET_REGION_INFO
>>>>
>>>>
>>>> a) adding a flag: DONT_MAP. For example, the MMIO of vgpu
>>>> should be trapped instead of being mmap-ed.
>>>
>>> I may not in the context, but i am curious how to handle the DONT_MAP in
>>> vfio driver? Since there are no real MMIO maps into the region and i
>>> suppose the access to the region should be handled by vgpu in i915
>>> driver, but currently most of the mmio accesses are handled by Qemu.
>>>
>>
>> Hi Yang,
>>
>> MMIO accesses are supposed to be handled in kernel, without vm-exiting
>> to QEMU, similar to in-kernel irqchip :)
> 
> The question is current vfio doesn't support it. The long discussion 
> between Alex and Kevin is what i am to understand how KVMGT works under 
> vfio framework.
>

Yes, good to expose it earlier.

Previously Kevin and I thought KVMGT is free to register an iodev,
responsible for a MMIO range r/w, to KVM hypervisor directly. If this is
not acceptable then we will have to figure out an alternative.

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 22:56                                     ` [Qemu-devel] " Alex Williamson
@ 2016-01-27  1:52                                       ` Yang Zhang
  -1 siblings, 0 replies; 118+ messages in thread
From: Yang Zhang @ 2016-01-27  1:52 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On 2016/1/27 6:56, Alex Williamson wrote:
> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>
>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>
>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>
>>>>>>
>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>
>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>
>>>>
>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>> as possible. After another thought, yes we can reuse existing read/write
>>>> flags:
>>>>   	- KVMGT will expose a private control variable whether in-kernel
>>>> delivery is required;
>>>
>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>> deliver in-kernel any time it possibly could?
>>>
>>>>   	- when the variable is true, KVMGT will register in-kernel MMIO
>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>> directly;
>>>>   	- when the variable is false, KVMGT will not register anything.
>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>> will be used to finally reach KVMGT emulation logic;
>>>
>>> No, that means the interface is entirely dependent on a backdoor through
>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>> region with KVM handled via a provided file descriptor and offset,
>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>
>>
>> Could you elaborate this thought? If it can achieve the purpose w/o
>> a kernel exit definitely we can adapt to it. :-)
>
> I only thought of it when replying to the last email and have been doing
> some research, but we already do quite a bit of synchronization through
> file descriptors.  The kvm-vfio pseudo device uses a group file
> descriptor to ensure a user has access to a group, allowing some degree
> of interaction between modules.  Eventfds and irqfds already make use of
> f_ops on file descriptors to poke data.  So, if KVM had information that
> an MMIO region was backed by a file descriptor for which it already has
> a reference via fdget() (and verified access rights and whatnot), then
> it ought to be a simple matter to get to f_ops->read/write knowing the
> base offset of that MMIO region.  Perhaps it could even simply use
> __vfs_read/write().  Then we've got a proper reference to the file
> descriptor for ownership purposes and we've transparently jumped across
> modules without any implicit knowledge of the other end.  Could it work?
> Thanks,

ioeventfd is a good example.
As i known, all access to the MMIO of IGD is trapped into kernel. Also, 
the pci config space is emulated by Qemu. Same the for VGA, which is 
emulated too. I guest interrupt also is emulated(This means we cannot 
benifit from VT-d pi). The most important is that KVMGT doesn't required 
hardware IOMMU. As we known, VFIO is for the direct device assignment, 
but most of thing for KVMGT are emulated, why we should use VFIO for it?

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  1:52                                       ` Yang Zhang
  0 siblings, 0 replies; 118+ messages in thread
From: Yang Zhang @ 2016-01-27  1:52 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On 2016/1/27 6:56, Alex Williamson wrote:
> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>
>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>
>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>
>>>>>>
>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>
>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>
>>>>
>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>> as possible. After another thought, yes we can reuse existing read/write
>>>> flags:
>>>>   	- KVMGT will expose a private control variable whether in-kernel
>>>> delivery is required;
>>>
>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>> deliver in-kernel any time it possibly could?
>>>
>>>>   	- when the variable is true, KVMGT will register in-kernel MMIO
>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>> directly;
>>>>   	- when the variable is false, KVMGT will not register anything.
>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>> will be used to finally reach KVMGT emulation logic;
>>>
>>> No, that means the interface is entirely dependent on a backdoor through
>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>> region with KVM handled via a provided file descriptor and offset,
>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>
>>
>> Could you elaborate this thought? If it can achieve the purpose w/o
>> a kernel exit definitely we can adapt to it. :-)
>
> I only thought of it when replying to the last email and have been doing
> some research, but we already do quite a bit of synchronization through
> file descriptors.  The kvm-vfio pseudo device uses a group file
> descriptor to ensure a user has access to a group, allowing some degree
> of interaction between modules.  Eventfds and irqfds already make use of
> f_ops on file descriptors to poke data.  So, if KVM had information that
> an MMIO region was backed by a file descriptor for which it already has
> a reference via fdget() (and verified access rights and whatnot), then
> it ought to be a simple matter to get to f_ops->read/write knowing the
> base offset of that MMIO region.  Perhaps it could even simply use
> __vfs_read/write().  Then we've got a proper reference to the file
> descriptor for ownership purposes and we've transparently jumped across
> modules without any implicit knowledge of the other end.  Could it work?
> Thanks,

ioeventfd is a good example.
As i known, all access to the MMIO of IGD is trapped into kernel. Also, 
the pci config space is emulated by Qemu. Same the for VGA, which is 
emulated too. I guest interrupt also is emulated(This means we cannot 
benifit from VT-d pi). The most important is that KVMGT doesn't required 
hardware IOMMU. As we known, VFIO is for the direct device assignment, 
but most of thing for KVMGT are emulated, why we should use VFIO for it?

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  1:47                                       ` [Qemu-devel] " Jike Song
@ 2016-01-27  3:07                                         ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27  3:07 UTC (permalink / raw)
  To: Jike Song
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > >  
> > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > >  
> > > > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > > >  
> > > > > > >  
> > > > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > >  
> > > > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > > > layering in place?  Performance is important, but it's not an excuse to
> > > > > > abandon designing interfaces between independent components.  Thanks,
> > > > > >  
> > > > >  
> > > > > Two are not controversial. My point is to remove unnecessary long trip
> > > > > as possible. After another thought, yes we can reuse existing read/write
> > > > > flags:
> > > > >  	- KVMGT will expose a private control variable whether in-kernel
> > > > > delivery is required;
> > > >  
> > > > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > > > deliver in-kernel any time it possibly could?
> > > >  
> > > > >  	- when the variable is true, KVMGT will register in-kernel MMIO
> > > > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > > > directly;
> > > > >  	- when the variable is false, KVMGT will not register anything.
> > > > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > > > will be used to finally reach KVMGT emulation logic;
> > > >  
> > > > No, that means the interface is entirely dependent on a backdoor through
> > > > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > > > region with KVM handled via a provided file descriptor and offset,
> > > > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > > >  
> > >  
> > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > a kernel exit definitely we can adapt to it. :-)
> > 
> > I only thought of it when replying to the last email and have been doing
> > some research, but we already do quite a bit of synchronization through
> > file descriptors.  The kvm-vfio pseudo device uses a group file
> > descriptor to ensure a user has access to a group, allowing some degree
> > of interaction between modules.  Eventfds and irqfds already make use of
> > f_ops on file descriptors to poke data.  So, if KVM had information that
> > an MMIO region was backed by a file descriptor for which it already has
> > a reference via fdget() (and verified access rights and whatnot), then
> > it ought to be a simple matter to get to f_ops->read/write knowing the
> > base offset of that MMIO region.  Perhaps it could even simply use
> > __vfs_read/write().  Then we've got a proper reference to the file
> > descriptor for ownership purposes and we've transparently jumped across
> > modules without any implicit knowledge of the other end.  Could it work?
> 
> This is OK for KVMGT, from fops to vgpu device-model would always be simple.
> The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?

Hi Jike,

Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
to the fd via fdget(), so the vfio device wouldn't be closed until the
VM exits and KVM releases that reference.

> copy-and-paste the current implementation of vcpu_mmio_write(), seems
> nothing but GPA and len are provided:

I presume that an MMIO region is already registered with a GPA and
length, the additional information necessary would be a file descriptor
and offset into the file descriptor for the base of the MMIO space.

> 	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
> 				   const void *v)
> 	{
> 		int handled = 0;
> 		int n;
> 
> 		do {
> 			n = min(len, 8);
> 			if (!(vcpu->arch.apic &&
> 			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
> 			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
> 				break;
> 			handled += n;
> 			addr += n;
> 			len -= n;
> 			v += n;
> 		} while (len);
> 
> 		return handled;
> 	}
> 
> If we back a GPA range with a fd, this will also be a 'backdoor'?

KVM would simply be able to service the MMIO access using the provided
fd and offset.  It's not a back door because we will have created an API
for KVM to have a file descriptor and offset registered (by userspace)
to handle the access.  Also, KVM does not know the file descriptor is
handled by a VFIO device and VFIO doesn't know the read/write accesses
is initiated by KVM.  Seems like the question is whether we can fit
something like that into the existing KVM MMIO bus/device handlers
in-kernel.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  3:07                                         ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27  3:07 UTC (permalink / raw)
  To: Jike Song
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > >  
> > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > >  
> > > > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > > >  
> > > > > > >  
> > > > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > >  
> > > > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > > > layering in place?  Performance is important, but it's not an excuse to
> > > > > > abandon designing interfaces between independent components.  Thanks,
> > > > > >  
> > > > >  
> > > > > Two are not controversial. My point is to remove unnecessary long trip
> > > > > as possible. After another thought, yes we can reuse existing read/write
> > > > > flags:
> > > > >  	- KVMGT will expose a private control variable whether in-kernel
> > > > > delivery is required;
> > > >  
> > > > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > > > deliver in-kernel any time it possibly could?
> > > >  
> > > > >  	- when the variable is true, KVMGT will register in-kernel MMIO
> > > > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > > > directly;
> > > > >  	- when the variable is false, KVMGT will not register anything.
> > > > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > > > will be used to finally reach KVMGT emulation logic;
> > > >  
> > > > No, that means the interface is entirely dependent on a backdoor through
> > > > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > > > region with KVM handled via a provided file descriptor and offset,
> > > > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > > >  
> > >  
> > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > a kernel exit definitely we can adapt to it. :-)
> > 
> > I only thought of it when replying to the last email and have been doing
> > some research, but we already do quite a bit of synchronization through
> > file descriptors.  The kvm-vfio pseudo device uses a group file
> > descriptor to ensure a user has access to a group, allowing some degree
> > of interaction between modules.  Eventfds and irqfds already make use of
> > f_ops on file descriptors to poke data.  So, if KVM had information that
> > an MMIO region was backed by a file descriptor for which it already has
> > a reference via fdget() (and verified access rights and whatnot), then
> > it ought to be a simple matter to get to f_ops->read/write knowing the
> > base offset of that MMIO region.  Perhaps it could even simply use
> > __vfs_read/write().  Then we've got a proper reference to the file
> > descriptor for ownership purposes and we've transparently jumped across
> > modules without any implicit knowledge of the other end.  Could it work?
> 
> This is OK for KVMGT, from fops to vgpu device-model would always be simple.
> The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?

Hi Jike,

Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
to the fd via fdget(), so the vfio device wouldn't be closed until the
VM exits and KVM releases that reference.

> copy-and-paste the current implementation of vcpu_mmio_write(), seems
> nothing but GPA and len are provided:

I presume that an MMIO region is already registered with a GPA and
length, the additional information necessary would be a file descriptor
and offset into the file descriptor for the base of the MMIO space.

> 	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
> 				   const void *v)
> 	{
> 		int handled = 0;
> 		int n;
> 
> 		do {
> 			n = min(len, 8);
> 			if (!(vcpu->arch.apic &&
> 			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
> 			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
> 				break;
> 			handled += n;
> 			addr += n;
> 			len -= n;
> 			v += n;
> 		} while (len);
> 
> 		return handled;
> 	}
> 
> If we back a GPA range with a fd, this will also be a 'backdoor'?

KVM would simply be able to service the MMIO access using the provided
fd and offset.  It's not a back door because we will have created an API
for KVM to have a file descriptor and offset registered (by userspace)
to handle the access.  Also, KVM does not know the file descriptor is
handled by a VFIO device and VFIO doesn't know the read/write accesses
is initiated by KVM.  Seems like the question is whether we can fit
something like that into the existing KVM MMIO bus/device handlers
in-kernel.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  1:52                                       ` [Qemu-devel] " Yang Zhang
@ 2016-01-27  3:37                                         ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27  3:37 UTC (permalink / raw)
  To: Yang Zhang, Tian, Kevin, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Wed, 2016-01-27 at 09:52 +0800, Yang Zhang wrote:
> On 2016/1/27 6:56, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > 
> > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > 
> > > > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > > > 
> > > > > > > 
> > > > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > > 
> > > > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > > > layering in place?  Performance is important, but it's not an excuse to
> > > > > > abandon designing interfaces between independent components.  Thanks,
> > > > > > 
> > > > > 
> > > > > Two are not controversial. My point is to remove unnecessary long trip
> > > > > as possible. After another thought, yes we can reuse existing read/write
> > > > > flags:
> > > > >   	- KVMGT will expose a private control variable whether in-kernel
> > > > > delivery is required;
> > > > 
> > > > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > > > deliver in-kernel any time it possibly could?
> > > > 
> > > > >   	- when the variable is true, KVMGT will register in-kernel MMIO
> > > > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > > > directly;
> > > > >   	- when the variable is false, KVMGT will not register anything.
> > > > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > > > will be used to finally reach KVMGT emulation logic;
> > > > 
> > > > No, that means the interface is entirely dependent on a backdoor through
> > > > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > > > region with KVM handled via a provided file descriptor and offset,
> > > > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > > > 
> > > 
> > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > a kernel exit definitely we can adapt to it. :-)
> > 
> > I only thought of it when replying to the last email and have been doing
> > some research, but we already do quite a bit of synchronization through
> > file descriptors.  The kvm-vfio pseudo device uses a group file
> > descriptor to ensure a user has access to a group, allowing some degree
> > of interaction between modules.  Eventfds and irqfds already make use of
> > f_ops on file descriptors to poke data.  So, if KVM had information that
> > an MMIO region was backed by a file descriptor for which it already has
> > a reference via fdget() (and verified access rights and whatnot), then
> > it ought to be a simple matter to get to f_ops->read/write knowing the
> > base offset of that MMIO region.  Perhaps it could even simply use
> > __vfs_read/write().  Then we've got a proper reference to the file
> > descriptor for ownership purposes and we've transparently jumped across
> > modules without any implicit knowledge of the other end.  Could it work?
> > Thanks,
> 
> ioeventfd is a good example.
> As i known, all access to the MMIO of IGD is trapped into kernel. Also, 
> the pci config space is emulated by Qemu. Same the for VGA, which is 
> emulated too. I guest interrupt also is emulated(This means we cannot 
> benifit from VT-d pi). The most important is that KVMGT doesn't required 
> hardware IOMMU. As we known, VFIO is for the direct device assignment, 
> but most of thing for KVMGT are emulated, why we should use VFIO for it?

What is a vGPU?  It's a PCI device exposed to QEMU that needs to support
emulated and direct MMIO paths into the kernel driver, PCI config space
emulation, and various interrupt models.  What does the VFIO API
provide?  Exactly those things.

Yes, vfio is typically used for assigning physical devices, but it has a
very modular infrastructure which allows sub-drivers to be written that
can do much more complicated and device specific passthrough and
emulation in the kernel.  vfio typically works with a platform IOMMU,
but any devices that can provide isolation and translation services will
work.  In the case of graphics cards, there's effectively already an
IOMMU on the device, in the case of vGPU, this is mediated through the
physical GPU driver.

So what's the benefit?  VFIO already has the IOMMU and device access
interfaces, is already supported by QEMU and libvirt, and re-using these
for vGPU avoids a proliferation of new vendor specific devices, each
with their own implementation of these interfaces and each requiring
unique libvirt and upper level management device specific knowledge.
That's why.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  3:37                                         ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27  3:37 UTC (permalink / raw)
  To: Yang Zhang, Tian, Kevin, Song, Jike
  Cc: Ruan, Shuai, Neo Jia, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Wed, 2016-01-27 at 09:52 +0800, Yang Zhang wrote:
> On 2016/1/27 6:56, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > 
> > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > 
> > > > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > > > 
> > > > > > > 
> > > > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > > 
> > > > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > > > layering in place?  Performance is important, but it's not an excuse to
> > > > > > abandon designing interfaces between independent components.  Thanks,
> > > > > > 
> > > > > 
> > > > > Two are not controversial. My point is to remove unnecessary long trip
> > > > > as possible. After another thought, yes we can reuse existing read/write
> > > > > flags:
> > > > >   	- KVMGT will expose a private control variable whether in-kernel
> > > > > delivery is required;
> > > > 
> > > > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > > > deliver in-kernel any time it possibly could?
> > > > 
> > > > >   	- when the variable is true, KVMGT will register in-kernel MMIO
> > > > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > > > directly;
> > > > >   	- when the variable is false, KVMGT will not register anything.
> > > > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > > > will be used to finally reach KVMGT emulation logic;
> > > > 
> > > > No, that means the interface is entirely dependent on a backdoor through
> > > > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > > > region with KVM handled via a provided file descriptor and offset,
> > > > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > > > 
> > > 
> > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > a kernel exit definitely we can adapt to it. :-)
> > 
> > I only thought of it when replying to the last email and have been doing
> > some research, but we already do quite a bit of synchronization through
> > file descriptors.  The kvm-vfio pseudo device uses a group file
> > descriptor to ensure a user has access to a group, allowing some degree
> > of interaction between modules.  Eventfds and irqfds already make use of
> > f_ops on file descriptors to poke data.  So, if KVM had information that
> > an MMIO region was backed by a file descriptor for which it already has
> > a reference via fdget() (and verified access rights and whatnot), then
> > it ought to be a simple matter to get to f_ops->read/write knowing the
> > base offset of that MMIO region.  Perhaps it could even simply use
> > __vfs_read/write().  Then we've got a proper reference to the file
> > descriptor for ownership purposes and we've transparently jumped across
> > modules without any implicit knowledge of the other end.  Could it work?
> > Thanks,
> 
> ioeventfd is a good example.
> As i known, all access to the MMIO of IGD is trapped into kernel. Also, 
> the pci config space is emulated by Qemu. Same the for VGA, which is 
> emulated too. I guest interrupt also is emulated(This means we cannot 
> benifit from VT-d pi). The most important is that KVMGT doesn't required 
> hardware IOMMU. As we known, VFIO is for the direct device assignment, 
> but most of thing for KVMGT are emulated, why we should use VFIO for it?

What is a vGPU?  It's a PCI device exposed to QEMU that needs to support
emulated and direct MMIO paths into the kernel driver, PCI config space
emulation, and various interrupt models.  What does the VFIO API
provide?  Exactly those things.

Yes, vfio is typically used for assigning physical devices, but it has a
very modular infrastructure which allows sub-drivers to be written that
can do much more complicated and device specific passthrough and
emulation in the kernel.  vfio typically works with a platform IOMMU,
but any devices that can provide isolation and translation services will
work.  In the case of graphics cards, there's effectively already an
IOMMU on the device, in the case of vGPU, this is mediated through the
physical GPU driver.

So what's the benefit?  VFIO already has the IOMMU and device access
interfaces, is already supported by QEMU and libvirt, and re-using these
for vGPU avoids a proliferation of new vendor specific devices, each
with their own implementation of these interfaces and each requiring
unique libvirt and upper level management device specific knowledge.
That's why.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  3:07                                         ` [Qemu-devel] " Alex Williamson
@ 2016-01-27  5:43                                           ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  5:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Yang Zhang, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On 01/27/2016 11:07 AM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
>> On 01/27/2016 06:56 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>>>  
>>>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>>>  
>>>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>>>  
>>>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>>>> as possible. After another thought, yes we can reuse existing read/write
>>>>>> flags:
>>>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>>>> delivery is required;
>>>>>  
>>>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>>>> deliver in-kernel any time it possibly could?
>>>>>  
>>>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>>>> directly;
>>>>>>  	- when the variable is false, KVMGT will not register anything.
>>>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>>>> will be used to finally reach KVMGT emulation logic;
>>>>>  
>>>>> No, that means the interface is entirely dependent on a backdoor through
>>>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>>>> region with KVM handled via a provided file descriptor and offset,
>>>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>>>  
>>>>  
>>>> Could you elaborate this thought? If it can achieve the purpose w/o
>>>> a kernel exit definitely we can adapt to it. :-)
>>>  
>>> I only thought of it when replying to the last email and have been doing
>>> some research, but we already do quite a bit of synchronization through
>>> file descriptors.  The kvm-vfio pseudo device uses a group file
>>> descriptor to ensure a user has access to a group, allowing some degree
>>> of interaction between modules.  Eventfds and irqfds already make use of
>>> f_ops on file descriptors to poke data.  So, if KVM had information that
>>> an MMIO region was backed by a file descriptor for which it already has
>>> a reference via fdget() (and verified access rights and whatnot), then
>>> it ought to be a simple matter to get to f_ops->read/write knowing the
>>> base offset of that MMIO region.  Perhaps it could even simply use
>>> __vfs_read/write().  Then we've got a proper reference to the file
>>> descriptor for ownership purposes and we've transparently jumped across
>>> modules without any implicit knowledge of the other end.  Could it work?
>>  
>> This is OK for KVMGT, from fops to vgpu device-model would always be simple.
>> The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?
> 
> Hi Jike,
> 
> Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
> to the fd via fdget(), so the vfio device wouldn't be closed until the
> VM exits and KVM releases that reference.
> 

Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervisor.

>> copy-and-paste the current implementation of vcpu_mmio_write(), seems
>> nothing but GPA and len are provided:
> 
> I presume that an MMIO region is already registered with a GPA and
> length, the additional information necessary would be a file descriptor
> and offset into the file descriptor for the base of the MMIO space.
> 
>>  	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
>>  				   const void *v)
>>  	{
>>  		int handled = 0;
>>  		int n;
>>  
>>  		do {
>>  			n = min(len, 8);
>>  			if (!(vcpu->arch.apic &&
>>  			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
>>  			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
>>  				break;
>>  			handled += n;
>>  			addr += n;
>>  			len -= n;
>>  			v += n;
>>  		} while (len);
>>  
>>  		return handled;
>>  	}
>>  
>> If we back a GPA range with a fd, this will also be a 'backdoor'?
> 
> KVM would simply be able to service the MMIO access using the provided
> fd and offset.  It's not a back door because we will have created an API
> for KVM to have a file descriptor and offset registered (by userspace)
> to handle the access.  Also, KVM does not know the file descriptor is
> handled by a VFIO device and VFIO doesn't know the read/write accesses
> is initiated by KVM.  Seems like the question is whether we can fit
> something like that into the existing KVM MMIO bus/device handlers
> in-kernel.  Thanks,
> 

Had a look at eventfd, I would say yes, technically we are able to
achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
call into vgpu device-model, also an iodev registered for a MMIO GPA
range to invoke the fop->{read|write}.  I just didn't understand why
userspace can't register an iodev via API directly.

Besides, this doesn't necessarily require another thread, right?
I guess it can be within the VCPU thread? 

And this brought another question: except the vfio bus drvier and
iommu backend (and the page_track ulitiy used for guest memory write-protection), 
is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
becoming less and less willing to do that with VFIO, it's still better
to know that before going wrong.

Thanks!


> Alex
>

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  5:43                                           ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-27  5:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On 01/27/2016 11:07 AM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
>> On 01/27/2016 06:56 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>> Sent: Wednesday, January 27, 2016 6:27 AM
>>>>>  
>>>>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>> Sent: Wednesday, January 27, 2016 6:08 AM
>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
>>>>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for
>>>>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation
>>>>>>>>>> path would be unnecessarily long with obvious performance impact. We
>>>>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from
>>>>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT
>>>>>>>>>> to handle I/O emulation itself like today).
>>>>>>>>>  
>>>>>>>>> That sounds like a future optimization TBH.  There's very strict
>>>>>>>>> layering between vfio and kvm.  Physical device assignment could make
>>>>>>>>> use of it as well, avoiding a round trip through userspace when an
>>>>>>>>> ioread/write would do.  Userspace also needs to orchestrate those kinds
>>>>>>>>> of accelerators, there might be cases where userspace wants to see those
>>>>>>>>> transactions for debugging or manipulating the device.  We can't simply
>>>>>>>>> take shortcuts to provide such direct access.  Thanks,
>>>>>>>>>  
>>>>>>>>  
>>>>>>>> But we have to balance such debugging flexibility and acceptable performance.
>>>>>>>> To me the latter one is more important otherwise there'd be no real usage
>>>>>>>> around this technique, while for debugging there are other alternative (e.g.
>>>>>>>> ftrace) Consider some extreme case with 100k traps/second and then see
>>>>>>>> how much impact a 2-3x longer emulation path can bring...
>>>>>>>  
>>>>>>> Are you jumping to the conclusion that it cannot be done with proper
>>>>>>> layering in place?  Performance is important, but it's not an excuse to
>>>>>>> abandon designing interfaces between independent components.  Thanks,
>>>>>>>  
>>>>>>  
>>>>>> Two are not controversial. My point is to remove unnecessary long trip
>>>>>> as possible. After another thought, yes we can reuse existing read/write
>>>>>> flags:
>>>>>>  	- KVMGT will expose a private control variable whether in-kernel
>>>>>> delivery is required;
>>>>>  
>>>>> But in-kernel delivery is never *required*.  Wouldn't userspace want to
>>>>> deliver in-kernel any time it possibly could?
>>>>>  
>>>>>>  	- when the variable is true, KVMGT will register in-kernel MMIO
>>>>>> emulation callbacks then VM MMIO request will be delivered to KVMGT
>>>>>> directly;
>>>>>>  	- when the variable is false, KVMGT will not register anything.
>>>>>> VM MMIO request will then be delivered to Qemu and then ioread/write
>>>>>> will be used to finally reach KVMGT emulation logic;
>>>>>  
>>>>> No, that means the interface is entirely dependent on a backdoor through
>>>>> KVM.  Why can't userspace (QEMU) do something like register an MMIO
>>>>> region with KVM handled via a provided file descriptor and offset,
>>>>> couldn't KVM then call the file ops without a kernel exit?  Thanks,
>>>>>  
>>>>  
>>>> Could you elaborate this thought? If it can achieve the purpose w/o
>>>> a kernel exit definitely we can adapt to it. :-)
>>>  
>>> I only thought of it when replying to the last email and have been doing
>>> some research, but we already do quite a bit of synchronization through
>>> file descriptors.  The kvm-vfio pseudo device uses a group file
>>> descriptor to ensure a user has access to a group, allowing some degree
>>> of interaction between modules.  Eventfds and irqfds already make use of
>>> f_ops on file descriptors to poke data.  So, if KVM had information that
>>> an MMIO region was backed by a file descriptor for which it already has
>>> a reference via fdget() (and verified access rights and whatnot), then
>>> it ought to be a simple matter to get to f_ops->read/write knowing the
>>> base offset of that MMIO region.  Perhaps it could even simply use
>>> __vfs_read/write().  Then we've got a proper reference to the file
>>> descriptor for ownership purposes and we've transparently jumped across
>>> modules without any implicit knowledge of the other end.  Could it work?
>>  
>> This is OK for KVMGT, from fops to vgpu device-model would always be simple.
>> The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?
> 
> Hi Jike,
> 
> Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
> to the fd via fdget(), so the vfio device wouldn't be closed until the
> VM exits and KVM releases that reference.
> 

Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervisor.

>> copy-and-paste the current implementation of vcpu_mmio_write(), seems
>> nothing but GPA and len are provided:
> 
> I presume that an MMIO region is already registered with a GPA and
> length, the additional information necessary would be a file descriptor
> and offset into the file descriptor for the base of the MMIO space.
> 
>>  	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
>>  				   const void *v)
>>  	{
>>  		int handled = 0;
>>  		int n;
>>  
>>  		do {
>>  			n = min(len, 8);
>>  			if (!(vcpu->arch.apic &&
>>  			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
>>  			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
>>  				break;
>>  			handled += n;
>>  			addr += n;
>>  			len -= n;
>>  			v += n;
>>  		} while (len);
>>  
>>  		return handled;
>>  	}
>>  
>> If we back a GPA range with a fd, this will also be a 'backdoor'?
> 
> KVM would simply be able to service the MMIO access using the provided
> fd and offset.  It's not a back door because we will have created an API
> for KVM to have a file descriptor and offset registered (by userspace)
> to handle the access.  Also, KVM does not know the file descriptor is
> handled by a VFIO device and VFIO doesn't know the read/write accesses
> is initiated by KVM.  Seems like the question is whether we can fit
> something like that into the existing KVM MMIO bus/device handlers
> in-kernel.  Thanks,
> 

Had a look at eventfd, I would say yes, technically we are able to
achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
call into vgpu device-model, also an iodev registered for a MMIO GPA
range to invoke the fop->{read|write}.  I just didn't understand why
userspace can't register an iodev via API directly.

Besides, this doesn't necessarily require another thread, right?
I guess it can be within the VCPU thread? 

And this brought another question: except the vfio bus drvier and
iommu backend (and the page_track ulitiy used for guest memory write-protection), 
is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
becoming less and less willing to do that with VFIO, it's still better
to know that before going wrong.

Thanks!


> Alex
>

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 20:06                     ` [Qemu-devel] " Alex Williamson
@ 2016-01-27  8:06                       ` Kirti Wankhede
  -1 siblings, 0 replies; 118+ messages in thread
From: Kirti Wankhede @ 2016-01-27  8:06 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia, Tian, Kevin
  Cc: Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org



On 1/27/2016 1:36 AM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>   
>> Hi Alex, Kevin and Jike,
>>   
>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>> inline at the end)
>>   
>> Thanks for adding me to this technical discussion, a great opportunity
>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>> KVM platform.
>>   
>> Instead of directly jumping to the proposal that we have been working on
>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>> quick comments / thoughts regarding the existing discussions on this thread as
>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>   
>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>   
>>> Yes, and since you're creating and destroying the vgpu here, this is
>>> where I'd expect a struct device to be created and added to an IOMMU
>>> group.  The lifecycle management should really include links between
>>> the vGPU and physical GPU, which would be much, much easier to do with
>>> struct devices create here rather than at the point where we start
>>> doing vfio "stuff".
>>   
>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>> group and VFIO group.
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
>
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.

For device passthrough or SR-IOV model, PCI devices are created by PCI 
bus driver and from the probe routine each device is added in vfio group.

For vgpu, there should be a common module that create vgpu device, say 
vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
group.  This module can handle management of vgpus. Advantage of keeping 
this module a separate module than doing device creation in vendor 
modules is to have generic interface for vgpu management, for example, 
files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
vgpu driver registration interface.

In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
device. In the vgpu module, vgpu devices are created on request, so 
vgpu_group_init() should be called explicitly for per vgpu device. 
  That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
module.  Vgpu_vfio would remain separate entity but merged with vgpu 
module.


Thanks,
Kirti





^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  8:06                       ` Kirti Wankhede
  0 siblings, 0 replies; 118+ messages in thread
From: Kirti Wankhede @ 2016-01-27  8:06 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan



On 1/27/2016 1:36 AM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>   
>> Hi Alex, Kevin and Jike,
>>   
>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>> inline at the end)
>>   
>> Thanks for adding me to this technical discussion, a great opportunity
>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>> KVM platform.
>>   
>> Instead of directly jumping to the proposal that we have been working on
>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>> quick comments / thoughts regarding the existing discussions on this thread as
>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>   
>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>   
>>> Yes, and since you're creating and destroying the vgpu here, this is
>>> where I'd expect a struct device to be created and added to an IOMMU
>>> group.  The lifecycle management should really include links between
>>> the vGPU and physical GPU, which would be much, much easier to do with
>>> struct devices create here rather than at the point where we start
>>> doing vfio "stuff".
>>   
>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>> group and VFIO group.
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
>
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.

For device passthrough or SR-IOV model, PCI devices are created by PCI 
bus driver and from the probe routine each device is added in vfio group.

For vgpu, there should be a common module that create vgpu device, say 
vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
group.  This module can handle management of vgpus. Advantage of keeping 
this module a separate module than doing device creation in vendor 
modules is to have generic interface for vgpu management, for example, 
files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
vgpu driver registration interface.

In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
device. In the vgpu module, vgpu devices are created on request, so 
vgpu_group_init() should be called explicitly for per vgpu device. 
  That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
module.  Vgpu_vfio would remain separate entity but merged with vgpu 
module.


Thanks,
Kirti

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-26 23:30                         ` [Qemu-devel] " Alex Williamson
@ 2016-01-27  9:14                           ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-27  9:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > 1.1 Under per-physical device sysfs:
> > > > ----------------------------------------------------------------------------------
> > > >  
> > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > "vgpu_supported_types".
> > > >                             
> > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > >  
> > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > target physical GPU
> > > 
> > > 
> > > I've noted in previous discussions that we need to separate user policy
> > > from kernel policy here, the kernel policy should not require a "VM
> > > UUID".  A UUID simply represents a set of one or more devices and an
> > > index picks the device within the set.  Whether that UUID matches a VM
> > > or is independently used is up to the user policy when creating the
> > > device.
> > > 
> > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > UUID set of devices and instead have each device be independent.  This
> > > seems to be an imposition on the nvidia implementation into the kernel
> > > interface design.
> > > 
> > 
> > Hi Alex,
> > 
> > I agree with you that we should not put UUID concept into a kernel API. At
> > this point (without any prototyping), I am thinking of using a list of virtual
> > devices instead of UUID.
> 
> Hi Neo,
> 
> A UUID is a perfectly fine name, so long as we let it be just a UUID and
> not the UUID matching some specific use case.
> 
> > > >  
> > > > int vgpu_map_virtual_bar
> > > > (
> > > >     uint64_t virt_bar_addr,
> > > >     uint64_t phys_bar_addr,
> > > >     uint32_t len,
> > > >     uint32_t flags
> > > > )
> > > >  
> > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > 
> > > 
> > > Per the implementation provided, this needs to be implemented in the
> > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > of the device and replacing it is wrong.  It should be remapped at the
> > > vfio device file interface using vm_ops.
> > > 
> > 
> > So you are basically suggesting that we are going to take a mmap fault and
> > within that fault handler, we will go into vendor driver to look up the
> > "pre-registered" mapping and remap there.
> > 
> > Is my understanding correct?
> 
> Essentially, hopefully the vendor driver will have already registered
> the backing for the mmap prior to the fault, but either way could work.
> I think the key though is that you want to remap it onto the vma
> accessing the vfio device file, not scanning it out of an IOVA mapping
> that might be dynamic and doing a vma lookup based on the point in time
> mapping of the BAR.  The latter doesn't give me much confidence that
> mappings couldn't change while the former should be a one time fault.

Hi Alex,

The fact is that the vendor driver can only prevent such mmap fault by looking
up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
when the guest region gets programmed. Also, like you have mentioned below, such
mapping between iova and hva shouldn't be changed as long as the SBIOS and
guest OS are done with their job. 

Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Probably we should just limit this interface to guest MMIO region and we can have
some crosscheck between the VFIO driver who has monitored the config spcae
access to make sure nothing getting moved around?

> 
> In case it's not clear to folks at Intel, the purpose of this is that a
> vGPU may directly map a segment of the physical GPU MMIO space, but we
> may not know what segment that is at setup time, when QEMU does an mmap
> of the vfio device file descriptor.  The thought is that we can create
> an invalid mapping when QEMU calls mmap(), knowing that it won't be
> accessed until later, then we can fault in the real mmap on demand.  Do
> you need anything similar?
> 
> > > 
> > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > >  
> > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > >  
> > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > 
> > > Particularly, handling of mapping changes is completely missing.  This
> > > cannot be a point in time translation, the user is free to remap
> > > addresses whenever they wish and device translations need to be updated
> > > accordingly.
> > > 
> > 
> > When you say "user", do you mean the QEMU?
> 
> vfio is a generic userspace driver interface, QEMU is a very, very
> important user of the interface, but not the only user.  So for this
> conversation, we're mostly talking about QEMU as the user, but we should
> be careful about assuming QEMU is the only user.
> 

Understood. I have to say that our focus at this moment is to support QEMU and
KVM, but I know VFIO interface is much more than that, and that is why I think
it is right to leverage this framework so we can together explore future use
case in the userland.


> > Here, whenever the DMA that
> > the guest driver is going to launch will be first pinned within VM, and then
> > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > will be pinned by the GPU or DMA engine.
> > 
> > Since we are keeping the upper level code same, thinking about passthru case,
> > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > can change that mapping without causing an IOMMU fault on a active DMA device.
> 
> For the virtual BAR mapping above, it's easy to imagine that mapping a
> BAR to a given address is at the guest discretion, it may be mapped and
> unmapped, it may be mapped to different addresses at different points in
> time, the guest BIOS may choose to map it at yet another address, etc.
> So if somehow we were trying to setup a mapping for peer-to-peer, there
> are lots of ways that IOVA could change.  But even with RAM, we can
> support memory hotplug in a VM.  What was once a DMA target may be
> removed or may now be backed by something else.  Chipset configuration
> on the emulated platform may change how guest physical memory appears
> and that might change between VM boots.
> 
> Currently with physical device assignment the memory listener watches
> for both maps and unmaps and updates the iotlb to match.  Just like real
> hardware doing these same sorts of things, we rely on the guest to stop
> using memory that's going to be moved as a DMA target prior to moving
> it.

Right,  you can only do that when the device is quiescent.

As long as this will be notified to the guest, I think we should be able to
support it although the real implementation will depend on how the device gets into 
quiescent state.

This is definitely a very interesting feature we should explore, but I hope we
probably can first focus on the most basic functionality.

Thanks,
Neo

> 
> > > > 4. Modules
> > > > ==================================================================================
> > > >  
> > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > > >  
> > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > > >                            TYPE1 v1 and v2 interface. 
> > > 
> > > Depending on how intrusive it is, this can possibly by done within the
> > > existing type1 driver.  Either that or we can split out common code for
> > > use by a separate module.
> > > 
> > > > vgpu.ko                  - provide registration interface and virtual device
> > > >                            VFIO access.
> > > >  
> > > > 5. QEMU note
> > > > ==================================================================================
> > > >  
> > > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > > > use it as a reference for our implementation. It is basically just a quick c & p
> > > > from vfio/pci.c to quickly meet our needs.
> > > >  
> > > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > > > class, and probably the only thing required is to have a new way to discover the
> > > > device.
> > > >  
> > > > 6. Examples
> > > > ==================================================================================
> > > >  
> > > > On this server, we have two NVIDIA M60 GPUs.
> > > >  
> > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > >  
> > > > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > > > accessing the "vgpu_supported_types" like following:
> > > >  
> > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > > 11:GRID M60-0B
> > > > 12:GRID M60-0Q
> > > > 13:GRID M60-1B
> > > > 14:GRID M60-1Q
> > > > 15:GRID M60-2B
> > > > 16:GRID M60-2Q
> > > > 17:GRID M60-4Q
> > > > 18:GRID M60-8Q
> > > >  
> > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > > > like to create "GRID M60-4Q" VM on it.
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > > >  
> > > > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > > > for multiple vgpu devices yet, but we will support it.
> > > >  
> > > > At this moment, if you query the "vgpu_supported_types" it will still show all
> > > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > > >  
> > > > Starting VM:
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > > >  
> > > > then, the supported vGPU type query will return:
> > > >  
> > > > [root@cjia-vgx-kvm /home/cjia]$
> > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > > 17:GRID M60-4Q
> > > >  
> > > > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > > > created as the underlying HW might limit the supported types if there are
> > > > any existing VM runnings.
> > > >  
> > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > > > GPU driver vendor to clean up resource.
> > > >  
> > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > > > device sysfs.
> > > 
> > > 
> > > I'd like to hear Intel's thoughts on this interface.  Are there
> > > different vgpu capacities or priority classes that would necessitate
> > > different types of vcpus on Intel?
> > > 
> > > I think there are some gaps in translating from named vgpu types to
> > > indexes here, along with my previous mention of the UUID/set oddity.
> > > 
> > > Does Intel have a need for start and shutdown interfaces?
> > > 
> > > Neo, wasn't there at some point information about how many of each type
> > > could be supported through these interfaces?  How does a user know their
> > > capacity limits?
> > > 
> > 
> > Thanks for reminding me that, I think we probably forget to put that *important*
> > information as the output of "vgpu_supported_types".
> > 
> > Regarding the capacity, we can provide the frame buffer size as part of the
> > "vgpu_supported_types" output as well, I would imagine those will be eventually
> > show up on the openstack management interface or virt-mgr.
> > 
> > Basically, yes there would be a separate col to show the number of instance you
> > can create for each type of VGPU on a specific physical GPU.
> 
> Ok, Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27  9:14                           ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-27  9:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > 1.1 Under per-physical device sysfs:
> > > > ----------------------------------------------------------------------------------
> > > >  
> > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > "vgpu_supported_types".
> > > >                             
> > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > >  
> > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > target physical GPU
> > > 
> > > 
> > > I've noted in previous discussions that we need to separate user policy
> > > from kernel policy here, the kernel policy should not require a "VM
> > > UUID".  A UUID simply represents a set of one or more devices and an
> > > index picks the device within the set.  Whether that UUID matches a VM
> > > or is independently used is up to the user policy when creating the
> > > device.
> > > 
> > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > UUID set of devices and instead have each device be independent.  This
> > > seems to be an imposition on the nvidia implementation into the kernel
> > > interface design.
> > > 
> > 
> > Hi Alex,
> > 
> > I agree with you that we should not put UUID concept into a kernel API. At
> > this point (without any prototyping), I am thinking of using a list of virtual
> > devices instead of UUID.
> 
> Hi Neo,
> 
> A UUID is a perfectly fine name, so long as we let it be just a UUID and
> not the UUID matching some specific use case.
> 
> > > >  
> > > > int vgpu_map_virtual_bar
> > > > (
> > > >     uint64_t virt_bar_addr,
> > > >     uint64_t phys_bar_addr,
> > > >     uint32_t len,
> > > >     uint32_t flags
> > > > )
> > > >  
> > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > 
> > > 
> > > Per the implementation provided, this needs to be implemented in the
> > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > of the device and replacing it is wrong.  It should be remapped at the
> > > vfio device file interface using vm_ops.
> > > 
> > 
> > So you are basically suggesting that we are going to take a mmap fault and
> > within that fault handler, we will go into vendor driver to look up the
> > "pre-registered" mapping and remap there.
> > 
> > Is my understanding correct?
> 
> Essentially, hopefully the vendor driver will have already registered
> the backing for the mmap prior to the fault, but either way could work.
> I think the key though is that you want to remap it onto the vma
> accessing the vfio device file, not scanning it out of an IOVA mapping
> that might be dynamic and doing a vma lookup based on the point in time
> mapping of the BAR.  The latter doesn't give me much confidence that
> mappings couldn't change while the former should be a one time fault.

Hi Alex,

The fact is that the vendor driver can only prevent such mmap fault by looking
up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
when the guest region gets programmed. Also, like you have mentioned below, such
mapping between iova and hva shouldn't be changed as long as the SBIOS and
guest OS are done with their job. 

Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Probably we should just limit this interface to guest MMIO region and we can have
some crosscheck between the VFIO driver who has monitored the config spcae
access to make sure nothing getting moved around?

> 
> In case it's not clear to folks at Intel, the purpose of this is that a
> vGPU may directly map a segment of the physical GPU MMIO space, but we
> may not know what segment that is at setup time, when QEMU does an mmap
> of the vfio device file descriptor.  The thought is that we can create
> an invalid mapping when QEMU calls mmap(), knowing that it won't be
> accessed until later, then we can fault in the real mmap on demand.  Do
> you need anything similar?
> 
> > > 
> > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > >  
> > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > >  
> > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > 
> > > Particularly, handling of mapping changes is completely missing.  This
> > > cannot be a point in time translation, the user is free to remap
> > > addresses whenever they wish and device translations need to be updated
> > > accordingly.
> > > 
> > 
> > When you say "user", do you mean the QEMU?
> 
> vfio is a generic userspace driver interface, QEMU is a very, very
> important user of the interface, but not the only user.  So for this
> conversation, we're mostly talking about QEMU as the user, but we should
> be careful about assuming QEMU is the only user.
> 

Understood. I have to say that our focus at this moment is to support QEMU and
KVM, but I know VFIO interface is much more than that, and that is why I think
it is right to leverage this framework so we can together explore future use
case in the userland.


> > Here, whenever the DMA that
> > the guest driver is going to launch will be first pinned within VM, and then
> > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > will be pinned by the GPU or DMA engine.
> > 
> > Since we are keeping the upper level code same, thinking about passthru case,
> > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > can change that mapping without causing an IOMMU fault on a active DMA device.
> 
> For the virtual BAR mapping above, it's easy to imagine that mapping a
> BAR to a given address is at the guest discretion, it may be mapped and
> unmapped, it may be mapped to different addresses at different points in
> time, the guest BIOS may choose to map it at yet another address, etc.
> So if somehow we were trying to setup a mapping for peer-to-peer, there
> are lots of ways that IOVA could change.  But even with RAM, we can
> support memory hotplug in a VM.  What was once a DMA target may be
> removed or may now be backed by something else.  Chipset configuration
> on the emulated platform may change how guest physical memory appears
> and that might change between VM boots.
> 
> Currently with physical device assignment the memory listener watches
> for both maps and unmaps and updates the iotlb to match.  Just like real
> hardware doing these same sorts of things, we rely on the guest to stop
> using memory that's going to be moved as a DMA target prior to moving
> it.

Right,  you can only do that when the device is quiescent.

As long as this will be notified to the guest, I think we should be able to
support it although the real implementation will depend on how the device gets into 
quiescent state.

This is definitely a very interesting feature we should explore, but I hope we
probably can first focus on the most basic functionality.

Thanks,
Neo

> 
> > > > 4. Modules
> > > > ==================================================================================
> > > >  
> > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > > >  
> > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > > >                            TYPE1 v1 and v2 interface. 
> > > 
> > > Depending on how intrusive it is, this can possibly by done within the
> > > existing type1 driver.  Either that or we can split out common code for
> > > use by a separate module.
> > > 
> > > > vgpu.ko                  - provide registration interface and virtual device
> > > >                            VFIO access.
> > > >  
> > > > 5. QEMU note
> > > > ==================================================================================
> > > >  
> > > > To allow us focus on the VGPU kernel driver prototyping, we have introduced a new VFIO 
> > > > class - vgpu inside QEMU, so we don't have to change the existing vfio/pci.c file and 
> > > > use it as a reference for our implementation. It is basically just a quick c & p
> > > > from vfio/pci.c to quickly meet our needs.
> > > >  
> > > > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > > > class, and probably the only thing required is to have a new way to discover the
> > > > device.
> > > >  
> > > > 6. Examples
> > > > ==================================================================================
> > > >  
> > > > On this server, we have two NVIDIA M60 GPUs.
> > > >  
> > > > [root@cjia-vgx-kvm ~]# lspci -d 10de:13f2
> > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > > >  
> > > > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > > > accessing the "vgpu_supported_types" like following:
> > > >  
> > > > [root@cjia-vgx-kvm ~]# cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > > 11:GRID M60-0B
> > > > 12:GRID M60-0Q
> > > > 13:GRID M60-1B
> > > > 14:GRID M60-1Q
> > > > 15:GRID M60-2B
> > > > 16:GRID M60-2Q
> > > > 17:GRID M60-4Q
> > > > 18:GRID M60-8Q
> > > >  
> > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we would
> > > > like to create "GRID M60-4Q" VM on it.
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > > >  
> > > > Note: the number 0 here is for vGPU device index. So far the change is not tested
> > > > for multiple vgpu devices yet, but we will support it.
> > > >  
> > > > At this moment, if you query the "vgpu_supported_types" it will still show all
> > > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > > >  
> > > > Starting VM:
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > > >  
> > > > then, the supported vGPU type query will return:
> > > >  
> > > > [root@cjia-vgx-kvm /home/cjia]$
> > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > > 17:GRID M60-4Q
> > > >  
> > > > So vgpu_supported_config needs to be called whenever a new virtual device gets
> > > > created as the underlying HW might limit the supported types if there are
> > > > any existing VM runnings.
> > > >  
> > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info the
> > > > GPU driver vendor to clean up resource.
> > > >  
> > > > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy under
> > > > device sysfs.
> > > 
> > > 
> > > I'd like to hear Intel's thoughts on this interface.  Are there
> > > different vgpu capacities or priority classes that would necessitate
> > > different types of vcpus on Intel?
> > > 
> > > I think there are some gaps in translating from named vgpu types to
> > > indexes here, along with my previous mention of the UUID/set oddity.
> > > 
> > > Does Intel have a need for start and shutdown interfaces?
> > > 
> > > Neo, wasn't there at some point information about how many of each type
> > > could be supported through these interfaces?  How does a user know their
> > > capacity limits?
> > > 
> > 
> > Thanks for reminding me that, I think we probably forget to put that *important*
> > information as the output of "vgpu_supported_types".
> > 
> > Regarding the capacity, we can provide the frame buffer size as part of the
> > "vgpu_supported_types" output as well, I would imagine those will be eventually
> > show up on the openstack management interface or virt-mgr.
> > 
> > Basically, yes there would be a separate col to show the number of instance you
> > can create for each type of VGPU on a specific physical GPU.
> 
> Ok, Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  8:06                       ` [Qemu-devel] " Kirti Wankhede
@ 2016-01-27 16:00                         ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 16:00 UTC (permalink / raw)
  To: Kirti Wankhede, Neo Jia, Tian, Kevin
  Cc: Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org

On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
> 
> On 1/27/2016 1:36 AM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > >   
> > > Hi Alex, Kevin and Jike,
> > >   
> > > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > > inline at the end)
> > >   
> > > Thanks for adding me to this technical discussion, a great opportunity
> > > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > > KVM platform.
> > >   
> > > Instead of directly jumping to the proposal that we have been working on
> > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > > quick comments / thoughts regarding the existing discussions on this thread as
> > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > >   
> > > Then we can look at what we have, hopefully we can reach some consensus soon.
> > >   
> > > > Yes, and since you're creating and destroying the vgpu here, this is
> > > > where I'd expect a struct device to be created and added to an IOMMU
> > > > group.  The lifecycle management should really include links between
> > > > the vGPU and physical GPU, which would be much, much easier to do with
> > > > struct devices create here rather than at the point where we start
> > > > doing vfio "stuff".
> > >   
> > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > > group and VFIO group.
> > Is this really a good idea?  The concept of a vgpu is not unique to
> > vfio, we want vfio to be a driver for a vgpu, not an integral part of
> > the lifecycle of a vgpu.  That certainly doesn't exclude adding
> > infrastructure to make lifecycle management of a vgpu more consistent
> > between drivers, but it should be done independently of vfio.  I'll go
> > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> > does not create the VF, that's done in coordination with the PF making
> > use of some PCI infrastructure for consistency between drivers.
> > 
> > It seems like we need to take more advantage of the class and driver
> > core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> > being a driver for those devices.
> 
> For device passthrough or SR-IOV model, PCI devices are created by PCI 
> bus driver and from the probe routine each device is added in vfio group.

An SR-IOV VF is created by the PF driver using standard interfaces
provided by the PCI core.  The IOMMU group for a VF is added by the
IOMMU driver when the device is created on the pci_bus_type.  The probe
routine of the vfio bus driver (vfio-pci) is what adds the device into
the vfio group.

> For vgpu, there should be a common module that create vgpu device, say 
> vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
> group.  This module can handle management of vgpus. Advantage of keeping 
> this module a separate module than doing device creation in vendor 
> modules is to have generic interface for vgpu management, for example, 
> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
> vgpu driver registration interface.

But you're suggesting something very different from the SR-IOV model.
If we wanted to mimic that model, the GPU specific driver should create
the vgpu using services provided by a common interface.  For instance
i915 could call a new vgpu_device_create() which creates the device,
adds it to the vgpu class, etc.  That vgpu device should not be assumed
to be used with vfio though, that should happen via a separate probe
using a vfio-vgpu driver.  It's that vfio bus driver that will add the
device to a vfio group.

> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
> device. In the vgpu module, vgpu devices are created on request, so 
> vgpu_group_init() should be called explicitly for per vgpu device. 
>   That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
> module.  Vgpu_vfio would remain separate entity but merged with vgpu 
> module.

I disagree with this design, creation of a vgpu necessarily involves the
GPU driver and should not be tied to use of the vgpu with vfio.  vfio
should be a driver for the device, maybe eventually not the only driver
for the device.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27 16:00                         ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 16:00 UTC (permalink / raw)
  To: Kirti Wankhede, Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
> 
> On 1/27/2016 1:36 AM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > >   
> > > Hi Alex, Kevin and Jike,
> > >   
> > > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > > inline at the end)
> > >   
> > > Thanks for adding me to this technical discussion, a great opportunity
> > > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > > KVM platform.
> > >   
> > > Instead of directly jumping to the proposal that we have been working on
> > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > > quick comments / thoughts regarding the existing discussions on this thread as
> > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > >   
> > > Then we can look at what we have, hopefully we can reach some consensus soon.
> > >   
> > > > Yes, and since you're creating and destroying the vgpu here, this is
> > > > where I'd expect a struct device to be created and added to an IOMMU
> > > > group.  The lifecycle management should really include links between
> > > > the vGPU and physical GPU, which would be much, much easier to do with
> > > > struct devices create here rather than at the point where we start
> > > > doing vfio "stuff".
> > >   
> > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > > group and VFIO group.
> > Is this really a good idea?  The concept of a vgpu is not unique to
> > vfio, we want vfio to be a driver for a vgpu, not an integral part of
> > the lifecycle of a vgpu.  That certainly doesn't exclude adding
> > infrastructure to make lifecycle management of a vgpu more consistent
> > between drivers, but it should be done independently of vfio.  I'll go
> > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> > does not create the VF, that's done in coordination with the PF making
> > use of some PCI infrastructure for consistency between drivers.
> > 
> > It seems like we need to take more advantage of the class and driver
> > core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> > being a driver for those devices.
> 
> For device passthrough or SR-IOV model, PCI devices are created by PCI 
> bus driver and from the probe routine each device is added in vfio group.

An SR-IOV VF is created by the PF driver using standard interfaces
provided by the PCI core.  The IOMMU group for a VF is added by the
IOMMU driver when the device is created on the pci_bus_type.  The probe
routine of the vfio bus driver (vfio-pci) is what adds the device into
the vfio group.

> For vgpu, there should be a common module that create vgpu device, say 
> vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
> group.  This module can handle management of vgpus. Advantage of keeping 
> this module a separate module than doing device creation in vendor 
> modules is to have generic interface for vgpu management, for example, 
> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
> vgpu driver registration interface.

But you're suggesting something very different from the SR-IOV model.
If we wanted to mimic that model, the GPU specific driver should create
the vgpu using services provided by a common interface.  For instance
i915 could call a new vgpu_device_create() which creates the device,
adds it to the vgpu class, etc.  That vgpu device should not be assumed
to be used with vfio though, that should happen via a separate probe
using a vfio-vgpu driver.  It's that vfio bus driver that will add the
device to a vfio group.

> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
> device. In the vgpu module, vgpu devices are created on request, so 
> vgpu_group_init() should be called explicitly for per vgpu device. 
>   That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
> module.  Vgpu_vfio would remain separate entity but merged with vgpu 
> module.

I disagree with this design, creation of a vgpu necessarily involves the
GPU driver and should not be tied to use of the vgpu with vfio.  vfio
should be a driver for the device, maybe eventually not the only driver
for the device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  9:14                           ` [Qemu-devel] " Neo Jia
@ 2016-01-27 16:10                             ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 16:10 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > 1.1 Under per-physical device sysfs:
> > > > > ----------------------------------------------------------------------------------
> > > > >  
> > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > "vgpu_supported_types".
> > > > >                             
> > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > >  
> > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > target physical GPU
> > > >  
> > > >  
> > > > I've noted in previous discussions that we need to separate user policy
> > > > from kernel policy here, the kernel policy should not require a "VM
> > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > or is independently used is up to the user policy when creating the
> > > > device.
> > > >  
> > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > UUID set of devices and instead have each device be independent.  This
> > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > interface design.
> > > >  
> > >  
> > > Hi Alex,
> > >  
> > > I agree with you that we should not put UUID concept into a kernel API. At
> > > this point (without any prototyping), I am thinking of using a list of virtual
> > > devices instead of UUID.
> > 
> > Hi Neo,
> > 
> > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > not the UUID matching some specific use case.
> > 
> > > > >  
> > > > > int vgpu_map_virtual_bar
> > > > > (
> > > > >     uint64_t virt_bar_addr,
> > > > >     uint64_t phys_bar_addr,
> > > > >     uint32_t len,
> > > > >     uint32_t flags
> > > > > )
> > > > >  
> > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > >  
> > > >  
> > > > Per the implementation provided, this needs to be implemented in the
> > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > vfio device file interface using vm_ops.
> > > >  
> > >  
> > > So you are basically suggesting that we are going to take a mmap fault and
> > > within that fault handler, we will go into vendor driver to look up the
> > > "pre-registered" mapping and remap there.
> > >  
> > > Is my understanding correct?
> > 
> > Essentially, hopefully the vendor driver will have already registered
> > the backing for the mmap prior to the fault, but either way could work.
> > I think the key though is that you want to remap it onto the vma
> > accessing the vfio device file, not scanning it out of an IOVA mapping
> > that might be dynamic and doing a vma lookup based on the point in time
> > mapping of the BAR.  The latter doesn't give me much confidence that
> > mappings couldn't change while the former should be a one time fault.
> 
> Hi Alex,
> 
> The fact is that the vendor driver can only prevent such mmap fault by looking
> up the <iova, hva> mapping table that we have saved from IOMMU memory listerner

Why do we need to prevent the fault?  We need to handle the fault when
it occurs.

> when the guest region gets programmed. Also, like you have mentioned below, such
> mapping between iova and hva shouldn't be changed as long as the SBIOS and
> guest OS are done with their job. 

But you don't know they're done with their job.

> Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Why does that matter?  We're talking about the first time the VM
accesses the range of the BAR that will be direct mapped to the physical
GPU.  This isn't going to happen in the middle of a benchmark, it's
going to happen during driver initialization in the guest.

> Probably we should just limit this interface to guest MMIO region and we can have
> some crosscheck between the VFIO driver who has monitored the config spcae
> access to make sure nothing getting moved around?

No, the solution for the bar is very clear, map on fault to the vma
accessing the mmap and be done with it for the remainder of this
instance of the VM.

> > In case it's not clear to folks at Intel, the purpose of this is that a
> > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > may not know what segment that is at setup time, when QEMU does an mmap
> > of the vfio device file descriptor.  The thought is that we can create
> > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > accessed until later, then we can fault in the real mmap on demand.  Do
> > you need anything similar?
> > 
> > > >  
> > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > >  
> > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > >  
> > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > >  
> > > > Particularly, handling of mapping changes is completely missing.  This
> > > > cannot be a point in time translation, the user is free to remap
> > > > addresses whenever they wish and device translations need to be updated
> > > > accordingly.
> > > >  
> > >  
> > > When you say "user", do you mean the QEMU?
> > 
> > vfio is a generic userspace driver interface, QEMU is a very, very
> > important user of the interface, but not the only user.  So for this
> > conversation, we're mostly talking about QEMU as the user, but we should
> > be careful about assuming QEMU is the only user.
> > 
> 
> Understood. I have to say that our focus at this moment is to support QEMU and
> KVM, but I know VFIO interface is much more than that, and that is why I think
> it is right to leverage this framework so we can together explore future use
> case in the userland.
> 
> 
> > > Here, whenever the DMA that
> > > the guest driver is going to launch will be first pinned within VM, and then
> > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > will be pinned by the GPU or DMA engine.
> > >  
> > > Since we are keeping the upper level code same, thinking about passthru case,
> > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > 
> > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > BAR to a given address is at the guest discretion, it may be mapped and
> > unmapped, it may be mapped to different addresses at different points in
> > time, the guest BIOS may choose to map it at yet another address, etc.
> > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > are lots of ways that IOVA could change.  But even with RAM, we can
> > support memory hotplug in a VM.  What was once a DMA target may be
> > removed or may now be backed by something else.  Chipset configuration
> > on the emulated platform may change how guest physical memory appears
> > and that might change between VM boots.
> > 
> > Currently with physical device assignment the memory listener watches
> > for both maps and unmaps and updates the iotlb to match.  Just like real
> > hardware doing these same sorts of things, we rely on the guest to stop
> > using memory that's going to be moved as a DMA target prior to moving
> > it.
> 
> Right,  you can only do that when the device is quiescent.
> 
> As long as this will be notified to the guest, I think we should be able to
> support it although the real implementation will depend on how the device gets into 
> quiescent state.
> 
> This is definitely a very interesting feature we should explore, but I hope we
> probably can first focus on the most basic functionality.

If we only do a point-in-time translation and assume it never changes,
that's good enough for a proof of concept, but it's not a complete
solution.  I think this is  practical problem, not just an academic
problem.  There needs to be a mechanism mappings to be invalidated based
on VM memory changes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27 16:10                             ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 16:10 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > 1.1 Under per-physical device sysfs:
> > > > > ----------------------------------------------------------------------------------
> > > > >  
> > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > "vgpu_supported_types".
> > > > >                             
> > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > >  
> > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > target physical GPU
> > > >  
> > > >  
> > > > I've noted in previous discussions that we need to separate user policy
> > > > from kernel policy here, the kernel policy should not require a "VM
> > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > or is independently used is up to the user policy when creating the
> > > > device.
> > > >  
> > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > UUID set of devices and instead have each device be independent.  This
> > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > interface design.
> > > >  
> > >  
> > > Hi Alex,
> > >  
> > > I agree with you that we should not put UUID concept into a kernel API. At
> > > this point (without any prototyping), I am thinking of using a list of virtual
> > > devices instead of UUID.
> > 
> > Hi Neo,
> > 
> > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > not the UUID matching some specific use case.
> > 
> > > > >  
> > > > > int vgpu_map_virtual_bar
> > > > > (
> > > > >     uint64_t virt_bar_addr,
> > > > >     uint64_t phys_bar_addr,
> > > > >     uint32_t len,
> > > > >     uint32_t flags
> > > > > )
> > > > >  
> > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > >  
> > > >  
> > > > Per the implementation provided, this needs to be implemented in the
> > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > vfio device file interface using vm_ops.
> > > >  
> > >  
> > > So you are basically suggesting that we are going to take a mmap fault and
> > > within that fault handler, we will go into vendor driver to look up the
> > > "pre-registered" mapping and remap there.
> > >  
> > > Is my understanding correct?
> > 
> > Essentially, hopefully the vendor driver will have already registered
> > the backing for the mmap prior to the fault, but either way could work.
> > I think the key though is that you want to remap it onto the vma
> > accessing the vfio device file, not scanning it out of an IOVA mapping
> > that might be dynamic and doing a vma lookup based on the point in time
> > mapping of the BAR.  The latter doesn't give me much confidence that
> > mappings couldn't change while the former should be a one time fault.
> 
> Hi Alex,
> 
> The fact is that the vendor driver can only prevent such mmap fault by looking
> up the <iova, hva> mapping table that we have saved from IOMMU memory listerner

Why do we need to prevent the fault?  We need to handle the fault when
it occurs.

> when the guest region gets programmed. Also, like you have mentioned below, such
> mapping between iova and hva shouldn't be changed as long as the SBIOS and
> guest OS are done with their job. 

But you don't know they're done with their job.

> Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Why does that matter?  We're talking about the first time the VM
accesses the range of the BAR that will be direct mapped to the physical
GPU.  This isn't going to happen in the middle of a benchmark, it's
going to happen during driver initialization in the guest.

> Probably we should just limit this interface to guest MMIO region and we can have
> some crosscheck between the VFIO driver who has monitored the config spcae
> access to make sure nothing getting moved around?

No, the solution for the bar is very clear, map on fault to the vma
accessing the mmap and be done with it for the remainder of this
instance of the VM.

> > In case it's not clear to folks at Intel, the purpose of this is that a
> > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > may not know what segment that is at setup time, when QEMU does an mmap
> > of the vfio device file descriptor.  The thought is that we can create
> > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > accessed until later, then we can fault in the real mmap on demand.  Do
> > you need anything similar?
> > 
> > > >  
> > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > >  
> > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > >  
> > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > >  
> > > > Particularly, handling of mapping changes is completely missing.  This
> > > > cannot be a point in time translation, the user is free to remap
> > > > addresses whenever they wish and device translations need to be updated
> > > > accordingly.
> > > >  
> > >  
> > > When you say "user", do you mean the QEMU?
> > 
> > vfio is a generic userspace driver interface, QEMU is a very, very
> > important user of the interface, but not the only user.  So for this
> > conversation, we're mostly talking about QEMU as the user, but we should
> > be careful about assuming QEMU is the only user.
> > 
> 
> Understood. I have to say that our focus at this moment is to support QEMU and
> KVM, but I know VFIO interface is much more than that, and that is why I think
> it is right to leverage this framework so we can together explore future use
> case in the userland.
> 
> 
> > > Here, whenever the DMA that
> > > the guest driver is going to launch will be first pinned within VM, and then
> > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > will be pinned by the GPU or DMA engine.
> > >  
> > > Since we are keeping the upper level code same, thinking about passthru case,
> > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > 
> > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > BAR to a given address is at the guest discretion, it may be mapped and
> > unmapped, it may be mapped to different addresses at different points in
> > time, the guest BIOS may choose to map it at yet another address, etc.
> > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > are lots of ways that IOVA could change.  But even with RAM, we can
> > support memory hotplug in a VM.  What was once a DMA target may be
> > removed or may now be backed by something else.  Chipset configuration
> > on the emulated platform may change how guest physical memory appears
> > and that might change between VM boots.
> > 
> > Currently with physical device assignment the memory listener watches
> > for both maps and unmaps and updates the iotlb to match.  Just like real
> > hardware doing these same sorts of things, we rely on the guest to stop
> > using memory that's going to be moved as a DMA target prior to moving
> > it.
> 
> Right,  you can only do that when the device is quiescent.
> 
> As long as this will be notified to the guest, I think we should be able to
> support it although the real implementation will depend on how the device gets into 
> quiescent state.
> 
> This is definitely a very interesting feature we should explore, but I hope we
> probably can first focus on the most basic functionality.

If we only do a point-in-time translation and assume it never changes,
that's good enough for a proof of concept, but it's not a complete
solution.  I think this is  practical problem, not just an academic
problem.  There needs to be a mechanism mappings to be invalidated based
on VM memory changes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27  5:43                                           ` [Qemu-devel] " Jike Song
@ 2016-01-27 16:19                                             ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 16:19 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Yang Zhang, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> On 01/27/2016 11:07 AM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> > > On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > > >  
> > > > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > > >  
> > > > > > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > > > >  
> > > > > > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > > > > > layering in place?  Performance is important, but it's not an excuse to
> > > > > > > > abandon designing interfaces between independent components.  Thanks,
> > > > > > > >  
> > > > > > >  
> > > > > > > Two are not controversial. My point is to remove unnecessary long trip
> > > > > > > as possible. After another thought, yes we can reuse existing read/write
> > > > > > > flags:
> > > > > > >  	- KVMGT will expose a private control variable whether in-kernel
> > > > > > > delivery is required;
> > > > > >  
> > > > > > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > > > > > deliver in-kernel any time it possibly could?
> > > > > >  
> > > > > > >  	- when the variable is true, KVMGT will register in-kernel MMIO
> > > > > > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > > > > > directly;
> > > > > > >  	- when the variable is false, KVMGT will not register anything.
> > > > > > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > > > > > will be used to finally reach KVMGT emulation logic;
> > > > > >  
> > > > > > No, that means the interface is entirely dependent on a backdoor through
> > > > > > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > > > > > region with KVM handled via a provided file descriptor and offset,
> > > > > > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > > > > >  
> > > > >  
> > > > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > > > a kernel exit definitely we can adapt to it. :-)
> > > >  
> > > > I only thought of it when replying to the last email and have been doing
> > > > some research, but we already do quite a bit of synchronization through
> > > > file descriptors.  The kvm-vfio pseudo device uses a group file
> > > > descriptor to ensure a user has access to a group, allowing some degree
> > > > of interaction between modules.  Eventfds and irqfds already make use of
> > > > f_ops on file descriptors to poke data.  So, if KVM had information that
> > > > an MMIO region was backed by a file descriptor for which it already has
> > > > a reference via fdget() (and verified access rights and whatnot), then
> > > > it ought to be a simple matter to get to f_ops->read/write knowing the
> > > > base offset of that MMIO region.  Perhaps it could even simply use
> > > > __vfs_read/write().  Then we've got a proper reference to the file
> > > > descriptor for ownership purposes and we've transparently jumped across
> > > > modules without any implicit knowledge of the other end.  Could it work?
> > >  
> > > This is OK for KVMGT, from fops to vgpu device-model would always be simple.
> > > The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?
> > 
> > Hi Jike,
> > 
> > Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
> > to the fd via fdget(), so the vfio device wouldn't be closed until the
> > VM exits and KVM releases that reference.
> > 
> 
> Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervisor.
> 
> > > copy-and-paste the current implementation of vcpu_mmio_write(), seems
> > > nothing but GPA and len are provided:
> > 
> > I presume that an MMIO region is already registered with a GPA and
> > length, the additional information necessary would be a file descriptor
> > and offset into the file descriptor for the base of the MMIO space.
> > 
> > >  	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
> > >  				   const void *v)
> > >  	{
> > >  		int handled = 0;
> > >  		int n;
> > >  
> > >  		do {
> > >  			n = min(len, 8);
> > >  			if (!(vcpu->arch.apic &&
> > >  			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
> > >  			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
> > >  				break;
> > >  			handled += n;
> > >  			addr += n;
> > >  			len -= n;
> > >  			v += n;
> > >  		} while (len);
> > >  
> > >  		return handled;
> > >  	}
> > >  
> > > If we back a GPA range with a fd, this will also be a 'backdoor'?
> > 
> > KVM would simply be able to service the MMIO access using the provided
> > fd and offset.  It's not a back door because we will have created an API
> > for KVM to have a file descriptor and offset registered (by userspace)
> > to handle the access.  Also, KVM does not know the file descriptor is
> > handled by a VFIO device and VFIO doesn't know the read/write accesses
> > is initiated by KVM.  Seems like the question is whether we can fit
> > something like that into the existing KVM MMIO bus/device handlers
> > in-kernel.  Thanks,
> > 
> 
> Had a look at eventfd, I would say yes, technically we are able to
> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
> call into vgpu device-model, also an iodev registered for a MMIO GPA
> range to invoke the fop->{read|write}.  I just didn't understand why
> userspace can't register an iodev via API directly.

Please elaborate on how it would work via iodev.

> Besides, this doesn't necessarily require another thread, right?
> I guess it can be within the VCPU thread? 

I would think so too, the vcpu is blocked on the MMIO access, we should
be able to service it in that context.  I hope.

> And this brought another question: except the vfio bus drvier and
> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> becoming less and less willing to do that with VFIO, it's still better
> to know that before going wrong.

kvm and vfio are separate modules, for the most part, they know nothing
about each other and have no hard dependencies between them.  We do have
various accelerations we can use to avoid paths through userspace, but
these are all via APIs that are agnostic of the party on the other end.
For example, vfio signals interrups through eventfds and has no concept
of whether that eventfd terminates in userspace or into an irqfd in KVM.
vfio supports direct access to device MMIO regions via mmaps, but vfio
has no idea if that mmap gets directly mapped into a VM address space.
Even with posted interrupts, we've introduced an irq bypass manager
allowing interrupt producers and consumers to register independently to
form a connection without directly knowing anything about the other
module.  That sort or proper software layering needs to continue.  It
would be wrong for a vfio bus driver to assume KVM is the user and
directly call into KVM interfaces.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27 16:19                                             ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 16:19 UTC (permalink / raw)
  To: Jike Song
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> On 01/27/2016 11:07 AM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 09:47 +0800, Jike Song wrote:
> > > On 01/27/2016 06:56 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, January 27, 2016 6:27 AM
> > > > > >  
> > > > > > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > > > > > >  
> > > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to
> > > > > > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly for
> > > > > > > > > > > emulation in kernel. If we reuse above R/W flags, the whole emulation
> > > > > > > > > > > path would be unnecessarily long with obvious performance impact. We
> > > > > > > > > > > either need a new flag here to indicate in-kernel emulation (bias from
> > > > > > > > > > > passthrough support), or just hide the region alternatively (let KVMGT
> > > > > > > > > > > to handle I/O emulation itself like today).
> > > > > > > > > >  
> > > > > > > > > > That sounds like a future optimization TBH.  There's very strict
> > > > > > > > > > layering between vfio and kvm.  Physical device assignment could make
> > > > > > > > > > use of it as well, avoiding a round trip through userspace when an
> > > > > > > > > > ioread/write would do.  Userspace also needs to orchestrate those kinds
> > > > > > > > > > of accelerators, there might be cases where userspace wants to see those
> > > > > > > > > > transactions for debugging or manipulating the device.  We can't simply
> > > > > > > > > > take shortcuts to provide such direct access.  Thanks,
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > But we have to balance such debugging flexibility and acceptable performance.
> > > > > > > > > To me the latter one is more important otherwise there'd be no real usage
> > > > > > > > > around this technique, while for debugging there are other alternative (e.g.
> > > > > > > > > ftrace) Consider some extreme case with 100k traps/second and then see
> > > > > > > > > how much impact a 2-3x longer emulation path can bring...
> > > > > > > >  
> > > > > > > > Are you jumping to the conclusion that it cannot be done with proper
> > > > > > > > layering in place?  Performance is important, but it's not an excuse to
> > > > > > > > abandon designing interfaces between independent components.  Thanks,
> > > > > > > >  
> > > > > > >  
> > > > > > > Two are not controversial. My point is to remove unnecessary long trip
> > > > > > > as possible. After another thought, yes we can reuse existing read/write
> > > > > > > flags:
> > > > > > >  	- KVMGT will expose a private control variable whether in-kernel
> > > > > > > delivery is required;
> > > > > >  
> > > > > > But in-kernel delivery is never *required*.  Wouldn't userspace want to
> > > > > > deliver in-kernel any time it possibly could?
> > > > > >  
> > > > > > >  	- when the variable is true, KVMGT will register in-kernel MMIO
> > > > > > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > > > > > directly;
> > > > > > >  	- when the variable is false, KVMGT will not register anything.
> > > > > > > VM MMIO request will then be delivered to Qemu and then ioread/write
> > > > > > > will be used to finally reach KVMGT emulation logic;
> > > > > >  
> > > > > > No, that means the interface is entirely dependent on a backdoor through
> > > > > > KVM.  Why can't userspace (QEMU) do something like register an MMIO
> > > > > > region with KVM handled via a provided file descriptor and offset,
> > > > > > couldn't KVM then call the file ops without a kernel exit?  Thanks,
> > > > > >  
> > > > >  
> > > > > Could you elaborate this thought? If it can achieve the purpose w/o
> > > > > a kernel exit definitely we can adapt to it. :-)
> > > >  
> > > > I only thought of it when replying to the last email and have been doing
> > > > some research, but we already do quite a bit of synchronization through
> > > > file descriptors.  The kvm-vfio pseudo device uses a group file
> > > > descriptor to ensure a user has access to a group, allowing some degree
> > > > of interaction between modules.  Eventfds and irqfds already make use of
> > > > f_ops on file descriptors to poke data.  So, if KVM had information that
> > > > an MMIO region was backed by a file descriptor for which it already has
> > > > a reference via fdget() (and verified access rights and whatnot), then
> > > > it ought to be a simple matter to get to f_ops->read/write knowing the
> > > > base offset of that MMIO region.  Perhaps it could even simply use
> > > > __vfs_read/write().  Then we've got a proper reference to the file
> > > > descriptor for ownership purposes and we've transparently jumped across
> > > > modules without any implicit knowledge of the other end.  Could it work?
> > >  
> > > This is OK for KVMGT, from fops to vgpu device-model would always be simple.
> > > The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings?
> > 
> > Hi Jike,
> > 
> > Sorry, I don't understand "on VM-exiting".  KVM would hold a reference
> > to the fd via fdget(), so the vfio device wouldn't be closed until the
> > VM exits and KVM releases that reference.
> > 
> 
> Sorry for my bad English, I meant VMEXIT, from non-root to kvm hypervisor.
> 
> > > copy-and-paste the current implementation of vcpu_mmio_write(), seems
> > > nothing but GPA and len are provided:
> > 
> > I presume that an MMIO region is already registered with a GPA and
> > length, the additional information necessary would be a file descriptor
> > and offset into the file descriptor for the base of the MMIO space.
> > 
> > >  	static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
> > >  				   const void *v)
> > >  	{
> > >  		int handled = 0;
> > >  		int n;
> > >  
> > >  		do {
> > >  			n = min(len, 8);
> > >  			if (!(vcpu->arch.apic &&
> > >  			      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
> > >  			    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
> > >  				break;
> > >  			handled += n;
> > >  			addr += n;
> > >  			len -= n;
> > >  			v += n;
> > >  		} while (len);
> > >  
> > >  		return handled;
> > >  	}
> > >  
> > > If we back a GPA range with a fd, this will also be a 'backdoor'?
> > 
> > KVM would simply be able to service the MMIO access using the provided
> > fd and offset.  It's not a back door because we will have created an API
> > for KVM to have a file descriptor and offset registered (by userspace)
> > to handle the access.  Also, KVM does not know the file descriptor is
> > handled by a VFIO device and VFIO doesn't know the read/write accesses
> > is initiated by KVM.  Seems like the question is whether we can fit
> > something like that into the existing KVM MMIO bus/device handlers
> > in-kernel.  Thanks,
> > 
> 
> Had a look at eventfd, I would say yes, technically we are able to
> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
> call into vgpu device-model, also an iodev registered for a MMIO GPA
> range to invoke the fop->{read|write}.  I just didn't understand why
> userspace can't register an iodev via API directly.

Please elaborate on how it would work via iodev.

> Besides, this doesn't necessarily require another thread, right?
> I guess it can be within the VCPU thread? 

I would think so too, the vcpu is blocked on the MMIO access, we should
be able to service it in that context.  I hope.

> And this brought another question: except the vfio bus drvier and
> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> becoming less and less willing to do that with VFIO, it's still better
> to know that before going wrong.

kvm and vfio are separate modules, for the most part, they know nothing
about each other and have no hard dependencies between them.  We do have
various accelerations we can use to avoid paths through userspace, but
these are all via APIs that are agnostic of the party on the other end.
For example, vfio signals interrups through eventfds and has no concept
of whether that eventfd terminates in userspace or into an irqfd in KVM.
vfio supports direct access to device MMIO regions via mmaps, but vfio
has no idea if that mmap gets directly mapped into a VM address space.
Even with posted interrupts, we've introduced an irq bypass manager
allowing interrupt producers and consumers to register independently to
form a connection without directly knowing anything about the other
module.  That sort or proper software layering needs to continue.  It
would be wrong for a vfio bus driver to assume KVM is the user and
directly call into KVM interfaces.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27 16:00                         ` [Qemu-devel] " Alex Williamson
@ 2016-01-27 20:55                           ` Kirti Wankhede
  -1 siblings, 0 replies; 118+ messages in thread
From: Kirti Wankhede @ 2016-01-27 20:55 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia, Tian, Kevin
  Cc: Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org



On 1/27/2016 9:30 PM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
>>
>> On 1/27/2016 1:36 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>
>>>> Hi Alex, Kevin and Jike,
>>>>
>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>>>> inline at the end)
>>>>
>>>> Thanks for adding me to this technical discussion, a great opportunity
>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>>>> KVM platform.
>>>>
>>>> Instead of directly jumping to the proposal that we have been working on
>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>>>> quick comments / thoughts regarding the existing discussions on this thread as
>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>>>
>>>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>>>
>>>>> Yes, and since you're creating and destroying the vgpu here, this is
>>>>> where I'd expect a struct device to be created and added to an IOMMU
>>>>> group.  The lifecycle management should really include links between
>>>>> the vGPU and physical GPU, which would be much, much easier to do with
>>>>> struct devices create here rather than at the point where we start
>>>>> doing vfio "stuff".
>>>>
>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>>>> group and VFIO group.
>>> Is this really a good idea?  The concept of a vgpu is not unique to
>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of
>>> the lifecycle of a vgpu.  That certainly doesn't exclude adding
>>> infrastructure to make lifecycle management of a vgpu more consistent
>>> between drivers, but it should be done independently of vfio.  I'll go
>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
>>> does not create the VF, that's done in coordination with the PF making
>>> use of some PCI infrastructure for consistency between drivers.
>>>
>>> It seems like we need to take more advantage of the class and driver
>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
>>> being a driver for those devices.
>>
>> For device passthrough or SR-IOV model, PCI devices are created by PCI
>> bus driver and from the probe routine each device is added in vfio group.
>
> An SR-IOV VF is created by the PF driver using standard interfaces
> provided by the PCI core.  The IOMMU group for a VF is added by the
> IOMMU driver when the device is created on the pci_bus_type.  The probe
> routine of the vfio bus driver (vfio-pci) is what adds the device into
> the vfio group.
>
>> For vgpu, there should be a common module that create vgpu device, say
>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio
>> group.  This module can handle management of vgpus. Advantage of keeping
>> this module a separate module than doing device creation in vendor
>> modules is to have generic interface for vgpu management, for example,
>> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
>> vgpu driver registration interface.
>
> But you're suggesting something very different from the SR-IOV model.
> If we wanted to mimic that model, the GPU specific driver should create
> the vgpu using services provided by a common interface.  For instance
> i915 could call a new vgpu_device_create() which creates the device,
> adds it to the vgpu class, etc.  That vgpu device should not be assumed
> to be used with vfio though, that should happen via a separate probe
> using a vfio-vgpu driver.  It's that vfio bus driver that will add the
> device to a vfio group.
>

In that case vgpu driver should provide a driver registration interface 
to register vfio-vgpu driver.

struct vgpu_driver {
	const char *name;
	int (*probe) (struct vgpu_device *vdev);
	void (*remove) (struct vgpu_device *vdev);
}

int vgpu_register_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_register_driver);

int vgpu_unregister_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_unregister_driver);

vfio-vgpu driver registers to vgpu driver. Then from 
vgpu_device_create(), after creating the device it calls 
vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
vfio group.

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vfio_vgpu.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                 |                 ^                 |
                                 | callback        |                 |
                                 |         +-------+--------+        |
                                 |         |vgpu_register_device()   |
                                 |         |                |        |
                                 +---^-----+-----+    +-----+------+-+
                                     | nvidia.ko |    |  i915.ko   |
                                     |           |    |            |
                                     +-----------+    +------------+

Is my understanding correct?

Thanks,
Kirti


>> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and
>> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to
>> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per
>> device. In the vgpu module, vgpu devices are created on request, so
>> vgpu_group_init() should be called explicitly for per vgpu device.
>>    That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu
>> module.  Vgpu_vfio would remain separate entity but merged with vgpu
>> module.
>
> I disagree with this design, creation of a vgpu necessarily involves the
> GPU driver and should not be tied to use of the vgpu with vfio.  vfio
> should be a driver for the device, maybe eventually not the only driver
> for the device.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27 20:55                           ` Kirti Wankhede
  0 siblings, 0 replies; 118+ messages in thread
From: Kirti Wankhede @ 2016-01-27 20:55 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan



On 1/27/2016 9:30 PM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
>>
>> On 1/27/2016 1:36 AM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>
>>>> Hi Alex, Kevin and Jike,
>>>>
>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>>>> inline at the end)
>>>>
>>>> Thanks for adding me to this technical discussion, a great opportunity
>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>>>> KVM platform.
>>>>
>>>> Instead of directly jumping to the proposal that we have been working on
>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>>>> quick comments / thoughts regarding the existing discussions on this thread as
>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>>>
>>>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>>>
>>>>> Yes, and since you're creating and destroying the vgpu here, this is
>>>>> where I'd expect a struct device to be created and added to an IOMMU
>>>>> group.  The lifecycle management should really include links between
>>>>> the vGPU and physical GPU, which would be much, much easier to do with
>>>>> struct devices create here rather than at the point where we start
>>>>> doing vfio "stuff".
>>>>
>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>>>> group and VFIO group.
>>> Is this really a good idea?  The concept of a vgpu is not unique to
>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of
>>> the lifecycle of a vgpu.  That certainly doesn't exclude adding
>>> infrastructure to make lifecycle management of a vgpu more consistent
>>> between drivers, but it should be done independently of vfio.  I'll go
>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
>>> does not create the VF, that's done in coordination with the PF making
>>> use of some PCI infrastructure for consistency between drivers.
>>>
>>> It seems like we need to take more advantage of the class and driver
>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
>>> being a driver for those devices.
>>
>> For device passthrough or SR-IOV model, PCI devices are created by PCI
>> bus driver and from the probe routine each device is added in vfio group.
>
> An SR-IOV VF is created by the PF driver using standard interfaces
> provided by the PCI core.  The IOMMU group for a VF is added by the
> IOMMU driver when the device is created on the pci_bus_type.  The probe
> routine of the vfio bus driver (vfio-pci) is what adds the device into
> the vfio group.
>
>> For vgpu, there should be a common module that create vgpu device, say
>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio
>> group.  This module can handle management of vgpus. Advantage of keeping
>> this module a separate module than doing device creation in vendor
>> modules is to have generic interface for vgpu management, for example,
>> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
>> vgpu driver registration interface.
>
> But you're suggesting something very different from the SR-IOV model.
> If we wanted to mimic that model, the GPU specific driver should create
> the vgpu using services provided by a common interface.  For instance
> i915 could call a new vgpu_device_create() which creates the device,
> adds it to the vgpu class, etc.  That vgpu device should not be assumed
> to be used with vfio though, that should happen via a separate probe
> using a vfio-vgpu driver.  It's that vfio bus driver that will add the
> device to a vfio group.
>

In that case vgpu driver should provide a driver registration interface 
to register vfio-vgpu driver.

struct vgpu_driver {
	const char *name;
	int (*probe) (struct vgpu_device *vdev);
	void (*remove) (struct vgpu_device *vdev);
}

int vgpu_register_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_register_driver);

int vgpu_unregister_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_unregister_driver);

vfio-vgpu driver registers to vgpu driver. Then from 
vgpu_device_create(), after creating the device it calls 
vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
vfio group.

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vfio_vgpu.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                 |                 ^                 |
                                 | callback        |                 |
                                 |         +-------+--------+        |
                                 |         |vgpu_register_device()   |
                                 |         |                |        |
                                 +---^-----+-----+    +-----+------+-+
                                     | nvidia.ko |    |  i915.ko   |
                                     |           |    |            |
                                     +-----------+    +------------+

Is my understanding correct?

Thanks,
Kirti


>> In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and
>> vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to
>> vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per
>> device. In the vgpu module, vgpu devices are created on request, so
>> vgpu_group_init() should be called explicitly for per vgpu device.
>>    That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu
>> module.  Vgpu_vfio would remain separate entity but merged with vgpu
>> module.
>
> I disagree with this design, creation of a vgpu necessarily involves the
> GPU driver and should not be tied to use of the vgpu with vfio.  vfio
> should be a driver for the device, maybe eventually not the only driver
> for the device.  Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27 16:10                             ` [Qemu-devel] " Alex Williamson
@ 2016-01-27 21:48                               ` Neo Jia
  -1 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-27 21:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Kirti Wankhede

On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote:
> On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > > 1.1 Under per-physical device sysfs:
> > > > > > ----------------------------------------------------------------------------------
> > > > > >  
> > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > > "vgpu_supported_types".
> > > > > >                             
> > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > > >  
> > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > > target physical GPU
> > > > >  
> > > > >  
> > > > > I've noted in previous discussions that we need to separate user policy
> > > > > from kernel policy here, the kernel policy should not require a "VM
> > > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > > or is independently used is up to the user policy when creating the
> > > > > device.
> > > > >  
> > > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > > UUID set of devices and instead have each device be independent.  This
> > > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > > interface design.
> > > > >  
> > > >  
> > > > Hi Alex,
> > > >  
> > > > I agree with you that we should not put UUID concept into a kernel API. At
> > > > this point (without any prototyping), I am thinking of using a list of virtual
> > > > devices instead of UUID.
> > > 
> > > Hi Neo,
> > > 
> > > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > > not the UUID matching some specific use case.
> > > 
> > > > > >  
> > > > > > int vgpu_map_virtual_bar
> > > > > > (
> > > > > >     uint64_t virt_bar_addr,
> > > > > >     uint64_t phys_bar_addr,
> > > > > >     uint32_t len,
> > > > > >     uint32_t flags
> > > > > > )
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > > >  
> > > > >  
> > > > > Per the implementation provided, this needs to be implemented in the
> > > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > > vfio device file interface using vm_ops.
> > > > >  
> > > >  
> > > > So you are basically suggesting that we are going to take a mmap fault and
> > > > within that fault handler, we will go into vendor driver to look up the
> > > > "pre-registered" mapping and remap there.
> > > >  
> > > > Is my understanding correct?
> > > 
> > > Essentially, hopefully the vendor driver will have already registered
> > > the backing for the mmap prior to the fault, but either way could work.
> > > I think the key though is that you want to remap it onto the vma
> > > accessing the vfio device file, not scanning it out of an IOVA mapping
> > > that might be dynamic and doing a vma lookup based on the point in time
> > > mapping of the BAR.  The latter doesn't give me much confidence that
> > > mappings couldn't change while the former should be a one time fault.
> > 
> > Hi Alex,
> > 
> > The fact is that the vendor driver can only prevent such mmap fault by looking
> > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
> 
> Why do we need to prevent the fault?  We need to handle the fault when
> it occurs.
> 
> > when the guest region gets programmed. Also, like you have mentioned below, such
> > mapping between iova and hva shouldn't be changed as long as the SBIOS and
> > guest OS are done with their job. 
> 
> But you don't know they're done with their job.
> 
> > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 
> 
> Why does that matter?  We're talking about the first time the VM
> accesses the range of the BAR that will be direct mapped to the physical
> GPU.  This isn't going to happen in the middle of a benchmark, it's
> going to happen during driver initialization in the guest.
> 
> > Probably we should just limit this interface to guest MMIO region and we can have
> > some crosscheck between the VFIO driver who has monitored the config spcae
> > access to make sure nothing getting moved around?
> 
> No, the solution for the bar is very clear, map on fault to the vma
> accessing the mmap and be done with it for the remainder of this
> instance of the VM.
> 

Hi Alex,

I totally get your points, my previous comments were just trying to explain the
reasoning behind our current implementation. I think I have found a way to hide
the latency of the mmap fault, which might happen in the middle of running a
benchmark.

I will add a new registration interface to allow the driver vendor to provide a
fault handler callback, and from there pages will be installed. 

> > > In case it's not clear to folks at Intel, the purpose of this is that a
> > > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > > may not know what segment that is at setup time, when QEMU does an mmap
> > > of the vfio device file descriptor.  The thought is that we can create
> > > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > > accessed until later, then we can fault in the real mmap on demand.  Do
> > > you need anything similar?
> > > 
> > > > >  
> > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > > >  
> > > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > > >  
> > > > > Particularly, handling of mapping changes is completely missing.  This
> > > > > cannot be a point in time translation, the user is free to remap
> > > > > addresses whenever they wish and device translations need to be updated
> > > > > accordingly.
> > > > >  
> > > >  
> > > > When you say "user", do you mean the QEMU?
> > > 
> > > vfio is a generic userspace driver interface, QEMU is a very, very
> > > important user of the interface, but not the only user.  So for this
> > > conversation, we're mostly talking about QEMU as the user, but we should
> > > be careful about assuming QEMU is the only user.
> > > 
> > 
> > Understood. I have to say that our focus at this moment is to support QEMU and
> > KVM, but I know VFIO interface is much more than that, and that is why I think
> > it is right to leverage this framework so we can together explore future use
> > case in the userland.
> > 
> > 
> > > > Here, whenever the DMA that
> > > > the guest driver is going to launch will be first pinned within VM, and then
> > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > > will be pinned by the GPU or DMA engine.
> > > >  
> > > > Since we are keeping the upper level code same, thinking about passthru case,
> > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > > 
> > > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > > BAR to a given address is at the guest discretion, it may be mapped and
> > > unmapped, it may be mapped to different addresses at different points in
> > > time, the guest BIOS may choose to map it at yet another address, etc.
> > > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > > are lots of ways that IOVA could change.  But even with RAM, we can
> > > support memory hotplug in a VM.  What was once a DMA target may be
> > > removed or may now be backed by something else.  Chipset configuration
> > > on the emulated platform may change how guest physical memory appears
> > > and that might change between VM boots.
> > > 
> > > Currently with physical device assignment the memory listener watches
> > > for both maps and unmaps and updates the iotlb to match.  Just like real
> > > hardware doing these same sorts of things, we rely on the guest to stop
> > > using memory that's going to be moved as a DMA target prior to moving
> > > it.
> > 
> > Right,  you can only do that when the device is quiescent.
> > 
> > As long as this will be notified to the guest, I think we should be able to
> > support it although the real implementation will depend on how the device gets into 
> > quiescent state.
> > 
> > This is definitely a very interesting feature we should explore, but I hope we
> > probably can first focus on the most basic functionality.
> 
> If we only do a point-in-time translation and assume it never changes,
> that's good enough for a proof of concept, but it's not a complete
> solution.  I think this is  practical problem, not just an academic
> problem.  There needs to be a mechanism mappings to be invalidated based
> on VM memory changes.  Thanks,
> 

Sorry, probably my previous comment is not very clear. I highly value your input and
the information related to the memory hotplug scenarios, and I never exclude the
support of such feature. The only question is when, that is why I would like to
defer such VM memory hotplug feature to phase 2 after the initial official
launch.

Thanks,
Neo

> Alex
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27 21:48                               ` Neo Jia
  0 siblings, 0 replies; 118+ messages in thread
From: Neo Jia @ 2016-01-27 21:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, igvt-g@lists.01.org, Song, Jike,
	qemu-devel, Kirti Wankhede, Lv, Zhiyuan, Paolo Bonzini,
	Gerd Hoffmann

On Wed, Jan 27, 2016 at 09:10:16AM -0700, Alex Williamson wrote:
> On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > > > 1.1 Under per-physical device sysfs:
> > > > > > ----------------------------------------------------------------------------------
> > > > > >  
> > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its
> > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > > > "vgpu_supported_types".
> > > > > >                             
> > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > > > > >  
> > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a
> > > > > > target physical GPU
> > > > >  
> > > > >  
> > > > > I've noted in previous discussions that we need to separate user policy
> > > > > from kernel policy here, the kernel policy should not require a "VM
> > > > > UUID".  A UUID simply represents a set of one or more devices and an
> > > > > index picks the device within the set.  Whether that UUID matches a VM
> > > > > or is independently used is up to the user policy when creating the
> > > > > device.
> > > > >  
> > > > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > > > UUID set of devices and instead have each device be independent.  This
> > > > > seems to be an imposition on the nvidia implementation into the kernel
> > > > > interface design.
> > > > >  
> > > >  
> > > > Hi Alex,
> > > >  
> > > > I agree with you that we should not put UUID concept into a kernel API. At
> > > > this point (without any prototyping), I am thinking of using a list of virtual
> > > > devices instead of UUID.
> > > 
> > > Hi Neo,
> > > 
> > > A UUID is a perfectly fine name, so long as we let it be just a UUID and
> > > not the UUID matching some specific use case.
> > > 
> > > > > >  
> > > > > > int vgpu_map_virtual_bar
> > > > > > (
> > > > > >     uint64_t virt_bar_addr,
> > > > > >     uint64_t phys_bar_addr,
> > > > > >     uint32_t len,
> > > > > >     uint32_t flags
> > > > > > )
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > > >  
> > > > >  
> > > > > Per the implementation provided, this needs to be implemented in the
> > > > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > > > of the device and replacing it is wrong.  It should be remapped at the
> > > > > vfio device file interface using vm_ops.
> > > > >  
> > > >  
> > > > So you are basically suggesting that we are going to take a mmap fault and
> > > > within that fault handler, we will go into vendor driver to look up the
> > > > "pre-registered" mapping and remap there.
> > > >  
> > > > Is my understanding correct?
> > > 
> > > Essentially, hopefully the vendor driver will have already registered
> > > the backing for the mmap prior to the fault, but either way could work.
> > > I think the key though is that you want to remap it onto the vma
> > > accessing the vfio device file, not scanning it out of an IOVA mapping
> > > that might be dynamic and doing a vma lookup based on the point in time
> > > mapping of the BAR.  The latter doesn't give me much confidence that
> > > mappings couldn't change while the former should be a one time fault.
> > 
> > Hi Alex,
> > 
> > The fact is that the vendor driver can only prevent such mmap fault by looking
> > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
> 
> Why do we need to prevent the fault?  We need to handle the fault when
> it occurs.
> 
> > when the guest region gets programmed. Also, like you have mentioned below, such
> > mapping between iova and hva shouldn't be changed as long as the SBIOS and
> > guest OS are done with their job. 
> 
> But you don't know they're done with their job.
> 
> > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 
> 
> Why does that matter?  We're talking about the first time the VM
> accesses the range of the BAR that will be direct mapped to the physical
> GPU.  This isn't going to happen in the middle of a benchmark, it's
> going to happen during driver initialization in the guest.
> 
> > Probably we should just limit this interface to guest MMIO region and we can have
> > some crosscheck between the VFIO driver who has monitored the config spcae
> > access to make sure nothing getting moved around?
> 
> No, the solution for the bar is very clear, map on fault to the vma
> accessing the mmap and be done with it for the remainder of this
> instance of the VM.
> 

Hi Alex,

I totally get your points, my previous comments were just trying to explain the
reasoning behind our current implementation. I think I have found a way to hide
the latency of the mmap fault, which might happen in the middle of running a
benchmark.

I will add a new registration interface to allow the driver vendor to provide a
fault handler callback, and from there pages will be installed. 

> > > In case it's not clear to folks at Intel, the purpose of this is that a
> > > vGPU may directly map a segment of the physical GPU MMIO space, but we
> > > may not know what segment that is at setup time, when QEMU does an mmap
> > > of the vfio device file descriptor.  The thought is that we can create
> > > an invalid mapping when QEMU calls mmap(), knowing that it won't be
> > > accessed until later, then we can fault in the real mmap on demand.  Do
> > > you need anything similar?
> > > 
> > > > >  
> > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > > > >  
> > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > > > >  
> > > > > > Still a lot to be added and modified, such as supporting multiple VMs and 
> > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU 
> > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. 
> > > > >  
> > > > > Particularly, handling of mapping changes is completely missing.  This
> > > > > cannot be a point in time translation, the user is free to remap
> > > > > addresses whenever they wish and device translations need to be updated
> > > > > accordingly.
> > > > >  
> > > >  
> > > > When you say "user", do you mean the QEMU?
> > > 
> > > vfio is a generic userspace driver interface, QEMU is a very, very
> > > important user of the interface, but not the only user.  So for this
> > > conversation, we're mostly talking about QEMU as the user, but we should
> > > be careful about assuming QEMU is the only user.
> > > 
> > 
> > Understood. I have to say that our focus at this moment is to support QEMU and
> > KVM, but I know VFIO interface is much more than that, and that is why I think
> > it is right to leverage this framework so we can together explore future use
> > case in the userland.
> > 
> > 
> > > > Here, whenever the DMA that
> > > > the guest driver is going to launch will be first pinned within VM, and then
> > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages
> > > > will be pinned by the GPU or DMA engine.
> > > >  
> > > > Since we are keeping the upper level code same, thinking about passthru case,
> > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
> > > > can change that mapping without causing an IOMMU fault on a active DMA device.
> > > 
> > > For the virtual BAR mapping above, it's easy to imagine that mapping a
> > > BAR to a given address is at the guest discretion, it may be mapped and
> > > unmapped, it may be mapped to different addresses at different points in
> > > time, the guest BIOS may choose to map it at yet another address, etc.
> > > So if somehow we were trying to setup a mapping for peer-to-peer, there
> > > are lots of ways that IOVA could change.  But even with RAM, we can
> > > support memory hotplug in a VM.  What was once a DMA target may be
> > > removed or may now be backed by something else.  Chipset configuration
> > > on the emulated platform may change how guest physical memory appears
> > > and that might change between VM boots.
> > > 
> > > Currently with physical device assignment the memory listener watches
> > > for both maps and unmaps and updates the iotlb to match.  Just like real
> > > hardware doing these same sorts of things, we rely on the guest to stop
> > > using memory that's going to be moved as a DMA target prior to moving
> > > it.
> > 
> > Right,  you can only do that when the device is quiescent.
> > 
> > As long as this will be notified to the guest, I think we should be able to
> > support it although the real implementation will depend on how the device gets into 
> > quiescent state.
> > 
> > This is definitely a very interesting feature we should explore, but I hope we
> > probably can first focus on the most basic functionality.
> 
> If we only do a point-in-time translation and assume it never changes,
> that's good enough for a proof of concept, but it's not a complete
> solution.  I think this is  practical problem, not just an academic
> problem.  There needs to be a mechanism mappings to be invalidated based
> on VM memory changes.  Thanks,
> 

Sorry, probably my previous comment is not very clear. I highly value your input and
the information related to the memory hotplug scenarios, and I never exclude the
support of such feature. The only question is when, that is why I would like to
defer such VM memory hotplug feature to phase 2 after the initial official
launch.

Thanks,
Neo

> Alex
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27 20:55                           ` [Qemu-devel] " Kirti Wankhede
@ 2016-01-27 21:58                             ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 21:58 UTC (permalink / raw)
  To: Kirti Wankhede, Neo Jia, Tian, Kevin
  Cc: Song, Jike, Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan, Ruan,
	Shuai, kvm, qemu-devel, igvt-g@lists.01.org

On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:
> 
> On 1/27/2016 9:30 PM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
> > > 
> > > On 1/27/2016 1:36 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > > > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > 
> > > > > Hi Alex, Kevin and Jike,
> > > > > 
> > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > > > > inline at the end)
> > > > > 
> > > > > Thanks for adding me to this technical discussion, a great opportunity
> > > > > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > > > > KVM platform.
> > > > > 
> > > > > Instead of directly jumping to the proposal that we have been working on
> > > > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > > > > quick comments / thoughts regarding the existing discussions on this thread as
> > > > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > > > > 
> > > > > Then we can look at what we have, hopefully we can reach some consensus soon.
> > > > > 
> > > > > > Yes, and since you're creating and destroying the vgpu here, this is
> > > > > > where I'd expect a struct device to be created and added to an IOMMU
> > > > > > group.  The lifecycle management should really include links between
> > > > > > the vGPU and physical GPU, which would be much, much easier to do with
> > > > > > struct devices create here rather than at the point where we start
> > > > > > doing vfio "stuff".
> > > > > 
> > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > > > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > > > > group and VFIO group.
> > > > Is this really a good idea?  The concept of a vgpu is not unique to
> > > > vfio, we want vfio to be a driver for a vgpu, not an integral part of
> > > > the lifecycle of a vgpu.  That certainly doesn't exclude adding
> > > > infrastructure to make lifecycle management of a vgpu more consistent
> > > > between drivers, but it should be done independently of vfio.  I'll go
> > > > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> > > > does not create the VF, that's done in coordination with the PF making
> > > > use of some PCI infrastructure for consistency between drivers.
> > > > 
> > > > It seems like we need to take more advantage of the class and driver
> > > > core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> > > > being a driver for those devices.
> > > 
> > > For device passthrough or SR-IOV model, PCI devices are created by PCI
> > > bus driver and from the probe routine each device is added in vfio group.
> > 
> > An SR-IOV VF is created by the PF driver using standard interfaces
> > provided by the PCI core.  The IOMMU group for a VF is added by the
> > IOMMU driver when the device is created on the pci_bus_type.  The probe
> > routine of the vfio bus driver (vfio-pci) is what adds the device into
> > the vfio group.
> > 
> > > For vgpu, there should be a common module that create vgpu device, say
> > > vgpu module, add vgpu device to an IOMMU group and then add it to vfio
> > > group.  This module can handle management of vgpus. Advantage of keeping
> > > this module a separate module than doing device creation in vendor
> > > modules is to have generic interface for vgpu management, for example,
> > > files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
> > > vgpu driver registration interface.
> > 
> > But you're suggesting something very different from the SR-IOV model.
> > If we wanted to mimic that model, the GPU specific driver should create
> > the vgpu using services provided by a common interface.  For instance
> > i915 could call a new vgpu_device_create() which creates the device,
> > adds it to the vgpu class, etc.  That vgpu device should not be assumed
> > to be used with vfio though, that should happen via a separate probe
> > using a vfio-vgpu driver.  It's that vfio bus driver that will add the
> > device to a vfio group.
> > 
> 
> In that case vgpu driver should provide a driver registration interface 
> to register vfio-vgpu driver.
> 
> struct vgpu_driver {
> 	const char *name;
> 	int (*probe) (struct vgpu_device *vdev);
> 	void (*remove) (struct vgpu_device *vdev);
> }
> 
> int vgpu_register_driver(struct vgpu_driver *driver)
> {
> ...
> }
> EXPORT_SYMBOL(vgpu_register_driver);
> 
> int vgpu_unregister_driver(struct vgpu_driver *driver)
> {
> ...
> }
> EXPORT_SYMBOL(vgpu_unregister_driver);
> 
> vfio-vgpu driver registers to vgpu driver. Then from 
> vgpu_device_create(), after creating the device it calls 
> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
> vfio group.
> 
> +--------------+    vgpu_register_driver()+---------------+
> >     __init() +------------------------->+               |
> >              |                          |               |
> >              +<-------------------------+    vgpu.ko    |
> > vfio_vgpu.ko |   probe()/remove()       |               |
> >              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                  |                 ^                 |
>                                  | callback        |                 |
>                                  |         +-------+--------+        |
>                                  |         |vgpu_register_device()   |
>                                  |         |                |        |
>                                  +---^-----+-----+    +-----+------+-+
>                                      | nvidia.ko |    |  i915.ko   |
>                                      |           |    |            |
>                                      +-----------+    +------------+
> 
> Is my understanding correct?

We have an entire driver core subsystem in Linux for the purpose of
matching devices to drivers, I don't think we should be re-inventing
that.  That's why I'm suggesting that we should have infrastructure
which facilitates GPU drivers to create vGPU devices in a common way,
perhaps even placing the devices on a virtual vgpu bus, and then allow a
vfio-vgpu driver to register as a driver for devices of that bus/class
and use the existing driver callbacks.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-27 21:58                             ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-27 21:58 UTC (permalink / raw)
  To: Kirti Wankhede, Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan

On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:
> 
> On 1/27/2016 9:30 PM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
> > > 
> > > On 1/27/2016 1:36 AM, Alex Williamson wrote:
> > > > On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > > > > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > 
> > > > > Hi Alex, Kevin and Jike,
> > > > > 
> > > > > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > > > > inline at the end)
> > > > > 
> > > > > Thanks for adding me to this technical discussion, a great opportunity
> > > > > for us to design together which can bring both Intel and NVIDIA vGPU solution to
> > > > > KVM platform.
> > > > > 
> > > > > Instead of directly jumping to the proposal that we have been working on
> > > > > recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
> > > > > quick comments / thoughts regarding the existing discussions on this thread as
> > > > > fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
> > > > > 
> > > > > Then we can look at what we have, hopefully we can reach some consensus soon.
> > > > > 
> > > > > > Yes, and since you're creating and destroying the vgpu here, this is
> > > > > > where I'd expect a struct device to be created and added to an IOMMU
> > > > > > group.  The lifecycle management should really include links between
> > > > > > the vGPU and physical GPU, which would be much, much easier to do with
> > > > > > struct devices create here rather than at the point where we start
> > > > > > doing vfio "stuff".
> > > > > 
> > > > > Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
> > > > > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > > > > group and VFIO group.
> > > > Is this really a good idea?  The concept of a vgpu is not unique to
> > > > vfio, we want vfio to be a driver for a vgpu, not an integral part of
> > > > the lifecycle of a vgpu.  That certainly doesn't exclude adding
> > > > infrastructure to make lifecycle management of a vgpu more consistent
> > > > between drivers, but it should be done independently of vfio.  I'll go
> > > > back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> > > > does not create the VF, that's done in coordination with the PF making
> > > > use of some PCI infrastructure for consistency between drivers.
> > > > 
> > > > It seems like we need to take more advantage of the class and driver
> > > > core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> > > > being a driver for those devices.
> > > 
> > > For device passthrough or SR-IOV model, PCI devices are created by PCI
> > > bus driver and from the probe routine each device is added in vfio group.
> > 
> > An SR-IOV VF is created by the PF driver using standard interfaces
> > provided by the PCI core.  The IOMMU group for a VF is added by the
> > IOMMU driver when the device is created on the pci_bus_type.  The probe
> > routine of the vfio bus driver (vfio-pci) is what adds the device into
> > the vfio group.
> > 
> > > For vgpu, there should be a common module that create vgpu device, say
> > > vgpu module, add vgpu device to an IOMMU group and then add it to vfio
> > > group.  This module can handle management of vgpus. Advantage of keeping
> > > this module a separate module than doing device creation in vendor
> > > modules is to have generic interface for vgpu management, for example,
> > > files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
> > > vgpu driver registration interface.
> > 
> > But you're suggesting something very different from the SR-IOV model.
> > If we wanted to mimic that model, the GPU specific driver should create
> > the vgpu using services provided by a common interface.  For instance
> > i915 could call a new vgpu_device_create() which creates the device,
> > adds it to the vgpu class, etc.  That vgpu device should not be assumed
> > to be used with vfio though, that should happen via a separate probe
> > using a vfio-vgpu driver.  It's that vfio bus driver that will add the
> > device to a vfio group.
> > 
> 
> In that case vgpu driver should provide a driver registration interface 
> to register vfio-vgpu driver.
> 
> struct vgpu_driver {
> 	const char *name;
> 	int (*probe) (struct vgpu_device *vdev);
> 	void (*remove) (struct vgpu_device *vdev);
> }
> 
> int vgpu_register_driver(struct vgpu_driver *driver)
> {
> ...
> }
> EXPORT_SYMBOL(vgpu_register_driver);
> 
> int vgpu_unregister_driver(struct vgpu_driver *driver)
> {
> ...
> }
> EXPORT_SYMBOL(vgpu_unregister_driver);
> 
> vfio-vgpu driver registers to vgpu driver. Then from 
> vgpu_device_create(), after creating the device it calls 
> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
> vfio group.
> 
> +--------------+    vgpu_register_driver()+---------------+
> >     __init() +------------------------->+               |
> >              |                          |               |
> >              +<-------------------------+    vgpu.ko    |
> > vfio_vgpu.ko |   probe()/remove()       |               |
> >              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                  |                 ^                 |
>                                  | callback        |                 |
>                                  |         +-------+--------+        |
>                                  |         |vgpu_register_device()   |
>                                  |         |                |        |
>                                  +---^-----+-----+    +-----+------+-+
>                                      | nvidia.ko |    |  i915.ko   |
>                                      |           |    |            |
>                                      +-----------+    +------------+
> 
> Is my understanding correct?

We have an entire driver core subsystem in Linux for the purpose of
matching devices to drivers, I don't think we should be re-inventing
that.  That's why I'm suggesting that we should have infrastructure
which facilitates GPU drivers to create vGPU devices in a common way,
perhaps even placing the devices on a virtual vgpu bus, and then allow a
vfio-vgpu driver to register as a driver for devices of that bus/class
and use the existing driver callbacks.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27 21:58                             ` [Qemu-devel] " Alex Williamson
@ 2016-01-28  3:01                               ` Kirti Wankhede
  -1 siblings, 0 replies; 118+ messages in thread
From: Kirti Wankhede @ 2016-01-28  3:01 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan



On 1/28/2016 3:28 AM, Alex Williamson wrote:
> On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:
>>
>> On 1/27/2016 9:30 PM, Alex Williamson wrote:
>>> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
>>>>
>>>> On 1/27/2016 1:36 AM, Alex Williamson wrote:
>>>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>>>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>
>>>>>> Hi Alex, Kevin and Jike,
>>>>>>
>>>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>>>>>> inline at the end)
>>>>>>
>>>>>> Thanks for adding me to this technical discussion, a great opportunity
>>>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>>>>>> KVM platform.
>>>>>>
>>>>>> Instead of directly jumping to the proposal that we have been working on
>>>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>>>>>> quick comments / thoughts regarding the existing discussions on this thread as
>>>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>>>>>
>>>>>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>>>>>
>>>>>>> Yes, and since you're creating and destroying the vgpu here, this is
>>>>>>> where I'd expect a struct device to be created and added to an IOMMU
>>>>>>> group.  The lifecycle management should really include links between
>>>>>>> the vGPU and physical GPU, which would be much, much easier to do with
>>>>>>> struct devices create here rather than at the point where we start
>>>>>>> doing vfio "stuff".
>>>>>>
>>>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>>>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>>>>>> group and VFIO group.
>>>>> Is this really a good idea?  The concept of a vgpu is not unique to
>>>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of
>>>>> the lifecycle of a vgpu.  That certainly doesn't exclude adding
>>>>> infrastructure to make lifecycle management of a vgpu more consistent
>>>>> between drivers, but it should be done independently of vfio.  I'll go
>>>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
>>>>> does not create the VF, that's done in coordination with the PF making
>>>>> use of some PCI infrastructure for consistency between drivers.
>>>>>
>>>>> It seems like we need to take more advantage of the class and driver
>>>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
>>>>> being a driver for those devices.
>>>>
>>>> For device passthrough or SR-IOV model, PCI devices are created by PCI
>>>> bus driver and from the probe routine each device is added in vfio group.
>>>
>>> An SR-IOV VF is created by the PF driver using standard interfaces
>>> provided by the PCI core.  The IOMMU group for a VF is added by the
>>> IOMMU driver when the device is created on the pci_bus_type.  The probe
>>> routine of the vfio bus driver (vfio-pci) is what adds the device into
>>> the vfio group.
>>>
>>>> For vgpu, there should be a common module that create vgpu device, say
>>>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio
>>>> group.  This module can handle management of vgpus. Advantage of keeping
>>>> this module a separate module than doing device creation in vendor
>>>> modules is to have generic interface for vgpu management, for example,
>>>> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
>>>> vgpu driver registration interface.
>>>
>>> But you're suggesting something very different from the SR-IOV model.
>>> If we wanted to mimic that model, the GPU specific driver should create
>>> the vgpu using services provided by a common interface.  For instance
>>> i915 could call a new vgpu_device_create() which creates the device,
>>> adds it to the vgpu class, etc.  That vgpu device should not be assumed
>>> to be used with vfio though, that should happen via a separate probe
>>> using a vfio-vgpu driver.  It's that vfio bus driver that will add the
>>> device to a vfio group.
>>>
>>
>> In that case vgpu driver should provide a driver registration interface
>> to register vfio-vgpu driver.
>>
>> struct vgpu_driver {
>>   	const char *name;
>>   	int (*probe) (struct vgpu_device *vdev);
>>   	void (*remove) (struct vgpu_device *vdev);
>> }
>>
>> int vgpu_register_driver(struct vgpu_driver *driver)
>> {
>> ...
>> }
>> EXPORT_SYMBOL(vgpu_register_driver);
>>
>> int vgpu_unregister_driver(struct vgpu_driver *driver)
>> {
>> ...
>> }
>> EXPORT_SYMBOL(vgpu_unregister_driver);
>>
>> vfio-vgpu driver registers to vgpu driver. Then from
>> vgpu_device_create(), after creating the device it calls
>> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to
>> vfio group.
>>
>> +--------------+    vgpu_register_driver()+---------------+
>>>      __init() +------------------------->+               |
>>>               |                          |               |
>>>               +<-------------------------+    vgpu.ko    |
>>> vfio_vgpu.ko |   probe()/remove()       |               |
>>>               |                +---------+               +---------+
>> +--------------+                |         +-------+-------+         |
>>                                   |                 ^                 |
>>                                   | callback        |                 |
>>                                   |         +-------+--------+        |
>>                                   |         |vgpu_register_device()   |
>>                                   |         |                |        |
>>                                   +---^-----+-----+    +-----+------+-+
>>                                       | nvidia.ko |    |  i915.ko   |
>>                                       |           |    |            |
>>                                       +-----------+    +------------+
>>
>> Is my understanding correct?
>
> We have an entire driver core subsystem in Linux for the purpose of
> matching devices to drivers, I don't think we should be re-inventing
> that.  That's why I'm suggesting that we should have infrastructure
> which facilitates GPU drivers to create vGPU devices in a common way,
> perhaps even placing the devices on a virtual vgpu bus, and then allow a
> vfio-vgpu driver to register as a driver for devices of that bus/class
> and use the existing driver callbacks.  Thanks,
>
> Alex
>

We will use Linux core subsystem, my point is we have to introduce vgpu 
module to provide such infrastructure to GPU drivers in common way. This 
module helps GPU drivers to create vGPU devices and allow vfio-vgpu 
driver to register for vGPU devices.

Kirti.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-28  3:01                               ` Kirti Wankhede
  0 siblings, 0 replies; 118+ messages in thread
From: Kirti Wankhede @ 2016-01-28  3:01 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia, Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, igvt-g@lists.01.org, qemu-devel,
	Gerd Hoffmann, Paolo Bonzini, Lv, Zhiyuan



On 1/28/2016 3:28 AM, Alex Williamson wrote:
> On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:
>>
>> On 1/27/2016 9:30 PM, Alex Williamson wrote:
>>> On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:
>>>>
>>>> On 1/27/2016 1:36 AM, Alex Williamson wrote:
>>>>> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
>>>>>> On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
>>>>>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>>>>
>>>>>> Hi Alex, Kevin and Jike,
>>>>>>
>>>>>> (Seems I shouldn't use attachment, resend it again to the list, patches are
>>>>>> inline at the end)
>>>>>>
>>>>>> Thanks for adding me to this technical discussion, a great opportunity
>>>>>> for us to design together which can bring both Intel and NVIDIA vGPU solution to
>>>>>> KVM platform.
>>>>>>
>>>>>> Instead of directly jumping to the proposal that we have been working on
>>>>>> recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
>>>>>> quick comments / thoughts regarding the existing discussions on this thread as
>>>>>> fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
>>>>>>
>>>>>> Then we can look at what we have, hopefully we can reach some consensus soon.
>>>>>>
>>>>>>> Yes, and since you're creating and destroying the vgpu here, this is
>>>>>>> where I'd expect a struct device to be created and added to an IOMMU
>>>>>>> group.  The lifecycle management should really include links between
>>>>>>> the vGPU and physical GPU, which would be much, much easier to do with
>>>>>>> struct devices create here rather than at the point where we start
>>>>>>> doing vfio "stuff".
>>>>>>
>>>>>> Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
>>>>>> can be centralized and done in vfio-vgpu. That also include adding to IOMMU
>>>>>> group and VFIO group.
>>>>> Is this really a good idea?  The concept of a vgpu is not unique to
>>>>> vfio, we want vfio to be a driver for a vgpu, not an integral part of
>>>>> the lifecycle of a vgpu.  That certainly doesn't exclude adding
>>>>> infrastructure to make lifecycle management of a vgpu more consistent
>>>>> between drivers, but it should be done independently of vfio.  I'll go
>>>>> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
>>>>> does not create the VF, that's done in coordination with the PF making
>>>>> use of some PCI infrastructure for consistency between drivers.
>>>>>
>>>>> It seems like we need to take more advantage of the class and driver
>>>>> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
>>>>> being a driver for those devices.
>>>>
>>>> For device passthrough or SR-IOV model, PCI devices are created by PCI
>>>> bus driver and from the probe routine each device is added in vfio group.
>>>
>>> An SR-IOV VF is created by the PF driver using standard interfaces
>>> provided by the PCI core.  The IOMMU group for a VF is added by the
>>> IOMMU driver when the device is created on the pci_bus_type.  The probe
>>> routine of the vfio bus driver (vfio-pci) is what adds the device into
>>> the vfio group.
>>>
>>>> For vgpu, there should be a common module that create vgpu device, say
>>>> vgpu module, add vgpu device to an IOMMU group and then add it to vfio
>>>> group.  This module can handle management of vgpus. Advantage of keeping
>>>> this module a separate module than doing device creation in vendor
>>>> modules is to have generic interface for vgpu management, for example,
>>>> files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
>>>> vgpu driver registration interface.
>>>
>>> But you're suggesting something very different from the SR-IOV model.
>>> If we wanted to mimic that model, the GPU specific driver should create
>>> the vgpu using services provided by a common interface.  For instance
>>> i915 could call a new vgpu_device_create() which creates the device,
>>> adds it to the vgpu class, etc.  That vgpu device should not be assumed
>>> to be used with vfio though, that should happen via a separate probe
>>> using a vfio-vgpu driver.  It's that vfio bus driver that will add the
>>> device to a vfio group.
>>>
>>
>> In that case vgpu driver should provide a driver registration interface
>> to register vfio-vgpu driver.
>>
>> struct vgpu_driver {
>>   	const char *name;
>>   	int (*probe) (struct vgpu_device *vdev);
>>   	void (*remove) (struct vgpu_device *vdev);
>> }
>>
>> int vgpu_register_driver(struct vgpu_driver *driver)
>> {
>> ...
>> }
>> EXPORT_SYMBOL(vgpu_register_driver);
>>
>> int vgpu_unregister_driver(struct vgpu_driver *driver)
>> {
>> ...
>> }
>> EXPORT_SYMBOL(vgpu_unregister_driver);
>>
>> vfio-vgpu driver registers to vgpu driver. Then from
>> vgpu_device_create(), after creating the device it calls
>> vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to
>> vfio group.
>>
>> +--------------+    vgpu_register_driver()+---------------+
>>>      __init() +------------------------->+               |
>>>               |                          |               |
>>>               +<-------------------------+    vgpu.ko    |
>>> vfio_vgpu.ko |   probe()/remove()       |               |
>>>               |                +---------+               +---------+
>> +--------------+                |         +-------+-------+         |
>>                                   |                 ^                 |
>>                                   | callback        |                 |
>>                                   |         +-------+--------+        |
>>                                   |         |vgpu_register_device()   |
>>                                   |         |                |        |
>>                                   +---^-----+-----+    +-----+------+-+
>>                                       | nvidia.ko |    |  i915.ko   |
>>                                       |           |    |            |
>>                                       +-----------+    +------------+
>>
>> Is my understanding correct?
>
> We have an entire driver core subsystem in Linux for the purpose of
> matching devices to drivers, I don't think we should be re-inventing
> that.  That's why I'm suggesting that we should have infrastructure
> which facilitates GPU drivers to create vGPU devices in a common way,
> perhaps even placing the devices on a virtual vgpu bus, and then allow a
> vfio-vgpu driver to register as a driver for devices of that bus/class
> and use the existing driver callbacks.  Thanks,
>
> Alex
>

We will use Linux core subsystem, my point is we have to introduce vgpu 
module to provide such infrastructure to GPU drivers in common way. This 
module helps GPU drivers to create vGPU devices and allow vfio-vgpu 
driver to register for vGPU devices.

Kirti.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-27 16:19                                             ` [Qemu-devel] " Alex Williamson
@ 2016-01-28  6:00                                               ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-28  6:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On 01/28/2016 12:19 AM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
{snip}

>> Had a look at eventfd, I would say yes, technically we are able to
>> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
>> call into vgpu device-model, also an iodev registered for a MMIO GPA
>> range to invoke the fop->{read|write}.  I just didn't understand why
>> userspace can't register an iodev via API directly.
> 
> Please elaborate on how it would work via iodev.
>

QEMU forwards BAR0 write to the bus driver, in the bus driver, if
found that MEM bit is enabled, register an iodev to KVM: with an
ops:

	const struct kvm_io_device_ops trap_mmio_ops = {
		.read	= kvmgt_guest_mmio_read,
		.write	= kvmgt_guest_mmio_write,
	};

I may not be able to illustrated it clearly with descriptions but this
should not be a problem, thanks to your explanation, I can understand
and adopt it for KVMGT.


>> Besides, this doesn't necessarily require another thread, right?
>> I guess it can be within the VCPU thread? 
> 
> I would think so too, the vcpu is blocked on the MMIO access, we should
> be able to service it in that context.  I hope.
> 

Thanks for confirmation.

>> And this brought another question: except the vfio bus drvier and
>> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
>> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
>> becoming less and less willing to do that with VFIO, it's still better
>> to know that before going wrong.
> 
> kvm and vfio are separate modules, for the most part, they know nothing
> about each other and have no hard dependencies between them.  We do have
> various accelerations we can use to avoid paths through userspace, but
> these are all via APIs that are agnostic of the party on the other end.
> For example, vfio signals interrups through eventfds and has no concept
> of whether that eventfd terminates in userspace or into an irqfd in KVM.
> vfio supports direct access to device MMIO regions via mmaps, but vfio
> has no idea if that mmap gets directly mapped into a VM address space.
> Even with posted interrupts, we've introduced an irq bypass manager
> allowing interrupt producers and consumers to register independently to
> form a connection without directly knowing anything about the other
> module.  That sort or proper software layering needs to continue.  It
> would be wrong for a vfio bus driver to assume KVM is the user and
> directly call into KVM interfaces.  Thanks,
> 

I understand and agree with your point, it's bad if the bus driver
assume KVM is the user and/or call into KVM interfaces.

However, the vgpu device-model, in intel case also a part of i915 driver,
will always need to call some hypervisor-specific interfaces.
For example, when a guest gfx driver submit GPU commands, the device-model
may want to scan it for security or whatever-else purpose:

	- get a GPA (from GPU page tables)
	- want to read 16 bytes from that GPA
	- call hypervisor-specific read_gpa() method
		- for Xen, the GPA belongs to a foreign domain, it must find
		  a way to map & read it - beyond our scope here;
		- for KVM, the GPA can converted to HVA, copy_from_user (if
		  called from vcpu thread) or access_remote_vm (if called from
		  other threads);

Please note that this is not from the vfio bus driver, but from the vgpu
device-model; also this is not DMA addr from GPU talbes, but real GPA.


> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-28  6:00                                               ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-28  6:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On 01/28/2016 12:19 AM, Alex Williamson wrote:
> On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
{snip}

>> Had a look at eventfd, I would say yes, technically we are able to
>> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
>> call into vgpu device-model, also an iodev registered for a MMIO GPA
>> range to invoke the fop->{read|write}.  I just didn't understand why
>> userspace can't register an iodev via API directly.
> 
> Please elaborate on how it would work via iodev.
>

QEMU forwards BAR0 write to the bus driver, in the bus driver, if
found that MEM bit is enabled, register an iodev to KVM: with an
ops:

	const struct kvm_io_device_ops trap_mmio_ops = {
		.read	= kvmgt_guest_mmio_read,
		.write	= kvmgt_guest_mmio_write,
	};

I may not be able to illustrated it clearly with descriptions but this
should not be a problem, thanks to your explanation, I can understand
and adopt it for KVMGT.


>> Besides, this doesn't necessarily require another thread, right?
>> I guess it can be within the VCPU thread? 
> 
> I would think so too, the vcpu is blocked on the MMIO access, we should
> be able to service it in that context.  I hope.
> 

Thanks for confirmation.

>> And this brought another question: except the vfio bus drvier and
>> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
>> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
>> becoming less and less willing to do that with VFIO, it's still better
>> to know that before going wrong.
> 
> kvm and vfio are separate modules, for the most part, they know nothing
> about each other and have no hard dependencies between them.  We do have
> various accelerations we can use to avoid paths through userspace, but
> these are all via APIs that are agnostic of the party on the other end.
> For example, vfio signals interrups through eventfds and has no concept
> of whether that eventfd terminates in userspace or into an irqfd in KVM.
> vfio supports direct access to device MMIO regions via mmaps, but vfio
> has no idea if that mmap gets directly mapped into a VM address space.
> Even with posted interrupts, we've introduced an irq bypass manager
> allowing interrupt producers and consumers to register independently to
> form a connection without directly knowing anything about the other
> module.  That sort or proper software layering needs to continue.  It
> would be wrong for a vfio bus driver to assume KVM is the user and
> directly call into KVM interfaces.  Thanks,
> 

I understand and agree with your point, it's bad if the bus driver
assume KVM is the user and/or call into KVM interfaces.

However, the vgpu device-model, in intel case also a part of i915 driver,
will always need to call some hypervisor-specific interfaces.
For example, when a guest gfx driver submit GPU commands, the device-model
may want to scan it for security or whatever-else purpose:

	- get a GPA (from GPU page tables)
	- want to read 16 bytes from that GPA
	- call hypervisor-specific read_gpa() method
		- for Xen, the GPA belongs to a foreign domain, it must find
		  a way to map & read it - beyond our scope here;
		- for KVM, the GPA can converted to HVA, copy_from_user (if
		  called from vcpu thread) or access_remote_vm (if called from
		  other threads);

Please note that this is not from the vfio bus driver, but from the vgpu
device-model; also this is not DMA addr from GPU talbes, but real GPA.


> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-28  6:00                                               ` [Qemu-devel] " Jike Song
@ 2016-01-28 15:23                                                 ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-28 15:23 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Yang Zhang, Gerd Hoffmann, Paolo Bonzini, Lv,
	Zhiyuan, Ruan, Shuai, kvm, qemu-devel, igvt-g@lists.01.org,
	Neo Jia

On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
> On 01/28/2016 12:19 AM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> {snip}
> 
> > > Had a look at eventfd, I would say yes, technically we are able to
> > > achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
> > > call into vgpu device-model, also an iodev registered for a MMIO GPA
> > > range to invoke the fop->{read|write}.  I just didn't understand why
> > > userspace can't register an iodev via API directly.
> > 
> > Please elaborate on how it would work via iodev.
> > 
> 
> QEMU forwards BAR0 write to the bus driver, in the bus driver, if
> found that MEM bit is enabled, register an iodev to KVM: with an
> ops:
> 
> 	const struct kvm_io_device_ops trap_mmio_ops = {
> 		.read	= kvmgt_guest_mmio_read,
> 		.write	= kvmgt_guest_mmio_write,
> 	};
> 
> I may not be able to illustrated it clearly with descriptions but this
> should not be a problem, thanks to your explanation, I can understand
> and adopt it for KVMGT.

You're still crossing modules with direct callbacks, right?  What's the
advantage versus using the file descriptor + offset approach which could
offer the same performance and improve KVM overall by creating a new
option for generically handling MMIO?

> > > Besides, this doesn't necessarily require another thread, right?
> > > I guess it can be within the VCPU thread? 
> > 
> > I would think so too, the vcpu is blocked on the MMIO access, we should
> > be able to service it in that context.  I hope.
> > 
> 
> Thanks for confirmation.
> 
> > > And this brought another question: except the vfio bus drvier and
> > > iommu backend (and the page_track ulitiy used for guest memory write-protection), 
> > > is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> > > becoming less and less willing to do that with VFIO, it's still better
> > > to know that before going wrong.
> > 
> > kvm and vfio are separate modules, for the most part, they know nothing
> > about each other and have no hard dependencies between them.  We do have
> > various accelerations we can use to avoid paths through userspace, but
> > these are all via APIs that are agnostic of the party on the other end.
> > For example, vfio signals interrups through eventfds and has no concept
> > of whether that eventfd terminates in userspace or into an irqfd in KVM.
> > vfio supports direct access to device MMIO regions via mmaps, but vfio
> > has no idea if that mmap gets directly mapped into a VM address space.
> > Even with posted interrupts, we've introduced an irq bypass manager
> > allowing interrupt producers and consumers to register independently to
> > form a connection without directly knowing anything about the other
> > module.  That sort or proper software layering needs to continue.  It
> > would be wrong for a vfio bus driver to assume KVM is the user and
> > directly call into KVM interfaces.  Thanks,
> > 
> 
> I understand and agree with your point, it's bad if the bus driver
> assume KVM is the user and/or call into KVM interfaces.
> 
> However, the vgpu device-model, in intel case also a part of i915 driver,
> will always need to call some hypervisor-specific interfaces.

No, think differently.

> For example, when a guest gfx driver submit GPU commands, the device-model
> may want to scan it for security or whatever-else purpose:
> 
> 	- get a GPA (from GPU page tables)
> 	- want to read 16 bytes from that GPA
> 	- call hypervisor-specific read_gpa() method
> 		- for Xen, the GPA belongs to a foreign domain, it must find
> 		  a way to map & read it - beyond our scope here;
> 		- for KVM, the GPA can converted to HVA, copy_from_user (if
> 		  called from vcpu thread) or access_remote_vm (if called from
> 		  other threads);
> 
> Please note that this is not from the vfio bus driver, but from the vgpu
> device-model; also this is not DMA addr from GPU talbes, but real GPA.

This is exactly why we're proposing that the vfio IOMMU interface be
used as a database of guest translations.  The type1 IOMMU model in QEMU
maps all of guest memory through the IOMMU, in the vGPU model type1 is
simply collecting these and they map GPA to process virtual memory.
When the GPU driver wants to get a GPA, it does so from this database.
If it wants to read from it, it could get the mm and read from the
virtual memory or pin the page for a GPA to HPA translation and read
from the HPA.  There is no reason to poke directly through to the
hypervisor here.  Let's design what you need into the vgpu version of
the type1 IOMMU instead.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-28 15:23                                                 ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-28 15:23 UTC (permalink / raw)
  To: Jike Song
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
> On 01/28/2016 12:19 AM, Alex Williamson wrote:
> > On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> {snip}
> 
> > > Had a look at eventfd, I would say yes, technically we are able to
> > > achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
> > > call into vgpu device-model, also an iodev registered for a MMIO GPA
> > > range to invoke the fop->{read|write}.  I just didn't understand why
> > > userspace can't register an iodev via API directly.
> > 
> > Please elaborate on how it would work via iodev.
> > 
> 
> QEMU forwards BAR0 write to the bus driver, in the bus driver, if
> found that MEM bit is enabled, register an iodev to KVM: with an
> ops:
> 
> 	const struct kvm_io_device_ops trap_mmio_ops = {
> 		.read	= kvmgt_guest_mmio_read,
> 		.write	= kvmgt_guest_mmio_write,
> 	};
> 
> I may not be able to illustrated it clearly with descriptions but this
> should not be a problem, thanks to your explanation, I can understand
> and adopt it for KVMGT.

You're still crossing modules with direct callbacks, right?  What's the
advantage versus using the file descriptor + offset approach which could
offer the same performance and improve KVM overall by creating a new
option for generically handling MMIO?

> > > Besides, this doesn't necessarily require another thread, right?
> > > I guess it can be within the VCPU thread? 
> > 
> > I would think so too, the vcpu is blocked on the MMIO access, we should
> > be able to service it in that context.  I hope.
> > 
> 
> Thanks for confirmation.
> 
> > > And this brought another question: except the vfio bus drvier and
> > > iommu backend (and the page_track ulitiy used for guest memory write-protection), 
> > > is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> > > becoming less and less willing to do that with VFIO, it's still better
> > > to know that before going wrong.
> > 
> > kvm and vfio are separate modules, for the most part, they know nothing
> > about each other and have no hard dependencies between them.  We do have
> > various accelerations we can use to avoid paths through userspace, but
> > these are all via APIs that are agnostic of the party on the other end.
> > For example, vfio signals interrups through eventfds and has no concept
> > of whether that eventfd terminates in userspace or into an irqfd in KVM.
> > vfio supports direct access to device MMIO regions via mmaps, but vfio
> > has no idea if that mmap gets directly mapped into a VM address space.
> > Even with posted interrupts, we've introduced an irq bypass manager
> > allowing interrupt producers and consumers to register independently to
> > form a connection without directly knowing anything about the other
> > module.  That sort or proper software layering needs to continue.  It
> > would be wrong for a vfio bus driver to assume KVM is the user and
> > directly call into KVM interfaces.  Thanks,
> > 
> 
> I understand and agree with your point, it's bad if the bus driver
> assume KVM is the user and/or call into KVM interfaces.
> 
> However, the vgpu device-model, in intel case also a part of i915 driver,
> will always need to call some hypervisor-specific interfaces.

No, think differently.

> For example, when a guest gfx driver submit GPU commands, the device-model
> may want to scan it for security or whatever-else purpose:
> 
> 	- get a GPA (from GPU page tables)
> 	- want to read 16 bytes from that GPA
> 	- call hypervisor-specific read_gpa() method
> 		- for Xen, the GPA belongs to a foreign domain, it must find
> 		  a way to map & read it - beyond our scope here;
> 		- for KVM, the GPA can converted to HVA, copy_from_user (if
> 		  called from vcpu thread) or access_remote_vm (if called from
> 		  other threads);
> 
> Please note that this is not from the vfio bus driver, but from the vgpu
> device-model; also this is not DMA addr from GPU talbes, but real GPA.

This is exactly why we're proposing that the vfio IOMMU interface be
used as a database of guest translations.  The type1 IOMMU model in QEMU
maps all of guest memory through the IOMMU, in the vGPU model type1 is
simply collecting these and they map GPA to process virtual memory.
When the GPU driver wants to get a GPA, it does so from this database.
If it wants to read from it, it could get the mm and read from the
virtual memory or pin the page for a GPA to HPA translation and read
from the HPA.  There is no reason to poke directly through to the
hypervisor here.  Let's design what you need into the vgpu version of
the type1 IOMMU instead.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-28 15:23                                                 ` [Qemu-devel] " Alex Williamson
@ 2016-01-29  7:20                                                   ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-29  7:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

This discussion becomes a little difficult for a newbie like me :(

On 01/28/2016 11:23 PM, Alex Williamson wrote:
> On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
>> On 01/28/2016 12:19 AM, Alex Williamson wrote:
>>> On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
>> {snip}
>>  
>>>> Had a look at eventfd, I would say yes, technically we are able to
>>>> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
>>>> call into vgpu device-model, also an iodev registered for a MMIO GPA
>>>> range to invoke the fop->{read|write}.  I just didn't understand why
>>>> userspace can't register an iodev via API directly.
>>>  
>>> Please elaborate on how it would work via iodev.
>>>  
>>  
>> QEMU forwards BAR0 write to the bus driver, in the bus driver, if
>> found that MEM bit is enabled, register an iodev to KVM: with an
>> ops:
>>  
>>  	const struct kvm_io_device_ops trap_mmio_ops = {
>>  		.read	= kvmgt_guest_mmio_read,
>>  		.write	= kvmgt_guest_mmio_write,
>>  	};
>>  
>> I may not be able to illustrated it clearly with descriptions but this
>> should not be a problem, thanks to your explanation, I can understand
>> and adopt it for KVMGT.
> 
> You're still crossing modules with direct callbacks, right?  What's the
> advantage versus using the file descriptor + offset approach which could
> offer the same performance and improve KVM overall by creating a new
> option for generically handling MMIO?
> 

Yes, the method I gave above is the current way: calling kvm_io_device_ops
from KVM hypervisor, and then going to vgpu device-model directly.

>From KVMGT's side this is almost the same as what you suggested, I don't
think now we have a problem here. I will adopt your suggestion.

>>>> Besides, this doesn't necessarily require another thread, right?
>>>> I guess it can be within the VCPU thread? 
>>>  
>>> I would think so too, the vcpu is blocked on the MMIO access, we should
>>> be able to service it in that context.  I hope.
>>>  
>>  
>> Thanks for confirmation.
>>  
>>>> And this brought another question: except the vfio bus drvier and
>>>> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
>>>> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
>>>> becoming less and less willing to do that with VFIO, it's still better
>>>> to know that before going wrong.
>>>  
>>> kvm and vfio are separate modules, for the most part, they know nothing
>>> about each other and have no hard dependencies between them.  We do have
>>> various accelerations we can use to avoid paths through userspace, but
>>> these are all via APIs that are agnostic of the party on the other end.
>>> For example, vfio signals interrups through eventfds and has no concept
>>> of whether that eventfd terminates in userspace or into an irqfd in KVM.
>>> vfio supports direct access to device MMIO regions via mmaps, but vfio
>>> has no idea if that mmap gets directly mapped into a VM address space.
>>> Even with posted interrupts, we've introduced an irq bypass manager
>>> allowing interrupt producers and consumers to register independently to
>>> form a connection without directly knowing anything about the other
>>> module.  That sort or proper software layering needs to continue.  It
>>> would be wrong for a vfio bus driver to assume KVM is the user and
>>> directly call into KVM interfaces.  Thanks,
>>>  
>>  
>> I understand and agree with your point, it's bad if the bus driver
>> assume KVM is the user and/or call into KVM interfaces.
>>  
>> However, the vgpu device-model, in intel case also a part of i915 driver,
>> will always need to call some hypervisor-specific interfaces.
> 
> No, think differently.
> 
>> For example, when a guest gfx driver submit GPU commands, the device-model
>> may want to scan it for security or whatever-else purpose:
>>  
>>  	- get a GPA (from GPU page tables)
>>  	- want to read 16 bytes from that GPA
>>  	- call hypervisor-specific read_gpa() method
>>  		- for Xen, the GPA belongs to a foreign domain, it must find
>>  		  a way to map & read it - beyond our scope here;
>>  		- for KVM, the GPA can converted to HVA, copy_from_user (if
>>  		  called from vcpu thread) or access_remote_vm (if called from
>>  		  other threads);
>>  
>> Please note that this is not from the vfio bus driver, but from the vgpu
>> device-model; also this is not DMA addr from GPU talbes, but real GPA.
> 
> This is exactly why we're proposing that the vfio IOMMU interface be
> used as a database of guest translations. 
> The type1 IOMMU model in QEMU
> maps all of guest memory through the IOMMU, in the vGPU model type1 is
> simply collecting these and they map GPA to process virtual memory.

GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
Do you mean making type1 to duplicate the GPA <-> HVA/HPA translations from
KVM? Even technically this could be done, how to synchronize it with KVM
hypervisor? e.g. What is expected if guest hot-add a memslot?

What's more, GPA is totally a virtualization term. When VFIO is used for
device assignment, it uses GPA as IOVA, maps it to HPA, that's true.
But for KVMGT, since vGPU doesn't have its own DMA requester ID, VFIO
won't call IOMMU-API, but DMA-API instead.  GPAs from different guests
may be identical, while IGD can only have 1 single IOMMU domain ...


> When the GPU driver wants to get a GPA, it does so from this database.
> If it wants to read from it, it could get the mm and read from the
> virtual memory or pin the page for a GPA to HPA translation and read
> from the HPA.  There is no reason to poke directly through to the
> hypervisor here.  Let's design what you need into the vgpu version of
> the type1 IOMMU instead.  Thanks,

For KVM, to access a GPA, having it translated to HVA is enough.

IIUC this may be the only remaining problem between us: where should
a GPA be translated to HVA, KVM or VFIO?


PS, I'm going to take holiday leave for ~2weeks, with limited mail
access. May get back to you a while later, sorry and thanks for your
great patience!

> 
> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-29  7:20                                                   ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-29  7:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, Ruan, Shuai, Tian, Kevin, Neo Jia, kvm,
	igvt-g@lists.01.org, qemu-devel, Gerd Hoffmann, Paolo Bonzini,
	Lv, Zhiyuan

This discussion becomes a little difficult for a newbie like me :(

On 01/28/2016 11:23 PM, Alex Williamson wrote:
> On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
>> On 01/28/2016 12:19 AM, Alex Williamson wrote:
>>> On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
>> {snip}
>>  
>>>> Had a look at eventfd, I would say yes, technically we are able to
>>>> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
>>>> call into vgpu device-model, also an iodev registered for a MMIO GPA
>>>> range to invoke the fop->{read|write}.  I just didn't understand why
>>>> userspace can't register an iodev via API directly.
>>>  
>>> Please elaborate on how it would work via iodev.
>>>  
>>  
>> QEMU forwards BAR0 write to the bus driver, in the bus driver, if
>> found that MEM bit is enabled, register an iodev to KVM: with an
>> ops:
>>  
>>  	const struct kvm_io_device_ops trap_mmio_ops = {
>>  		.read	= kvmgt_guest_mmio_read,
>>  		.write	= kvmgt_guest_mmio_write,
>>  	};
>>  
>> I may not be able to illustrated it clearly with descriptions but this
>> should not be a problem, thanks to your explanation, I can understand
>> and adopt it for KVMGT.
> 
> You're still crossing modules with direct callbacks, right?  What's the
> advantage versus using the file descriptor + offset approach which could
> offer the same performance and improve KVM overall by creating a new
> option for generically handling MMIO?
> 

Yes, the method I gave above is the current way: calling kvm_io_device_ops
from KVM hypervisor, and then going to vgpu device-model directly.

>From KVMGT's side this is almost the same as what you suggested, I don't
think now we have a problem here. I will adopt your suggestion.

>>>> Besides, this doesn't necessarily require another thread, right?
>>>> I guess it can be within the VCPU thread? 
>>>  
>>> I would think so too, the vcpu is blocked on the MMIO access, we should
>>> be able to service it in that context.  I hope.
>>>  
>>  
>> Thanks for confirmation.
>>  
>>>> And this brought another question: except the vfio bus drvier and
>>>> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
>>>> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
>>>> becoming less and less willing to do that with VFIO, it's still better
>>>> to know that before going wrong.
>>>  
>>> kvm and vfio are separate modules, for the most part, they know nothing
>>> about each other and have no hard dependencies between them.  We do have
>>> various accelerations we can use to avoid paths through userspace, but
>>> these are all via APIs that are agnostic of the party on the other end.
>>> For example, vfio signals interrups through eventfds and has no concept
>>> of whether that eventfd terminates in userspace or into an irqfd in KVM.
>>> vfio supports direct access to device MMIO regions via mmaps, but vfio
>>> has no idea if that mmap gets directly mapped into a VM address space.
>>> Even with posted interrupts, we've introduced an irq bypass manager
>>> allowing interrupt producers and consumers to register independently to
>>> form a connection without directly knowing anything about the other
>>> module.  That sort or proper software layering needs to continue.  It
>>> would be wrong for a vfio bus driver to assume KVM is the user and
>>> directly call into KVM interfaces.  Thanks,
>>>  
>>  
>> I understand and agree with your point, it's bad if the bus driver
>> assume KVM is the user and/or call into KVM interfaces.
>>  
>> However, the vgpu device-model, in intel case also a part of i915 driver,
>> will always need to call some hypervisor-specific interfaces.
> 
> No, think differently.
> 
>> For example, when a guest gfx driver submit GPU commands, the device-model
>> may want to scan it for security or whatever-else purpose:
>>  
>>  	- get a GPA (from GPU page tables)
>>  	- want to read 16 bytes from that GPA
>>  	- call hypervisor-specific read_gpa() method
>>  		- for Xen, the GPA belongs to a foreign domain, it must find
>>  		  a way to map & read it - beyond our scope here;
>>  		- for KVM, the GPA can converted to HVA, copy_from_user (if
>>  		  called from vcpu thread) or access_remote_vm (if called from
>>  		  other threads);
>>  
>> Please note that this is not from the vfio bus driver, but from the vgpu
>> device-model; also this is not DMA addr from GPU talbes, but real GPA.
> 
> This is exactly why we're proposing that the vfio IOMMU interface be
> used as a database of guest translations. 
> The type1 IOMMU model in QEMU
> maps all of guest memory through the IOMMU, in the vGPU model type1 is
> simply collecting these and they map GPA to process virtual memory.

GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
Do you mean making type1 to duplicate the GPA <-> HVA/HPA translations from
KVM? Even technically this could be done, how to synchronize it with KVM
hypervisor? e.g. What is expected if guest hot-add a memslot?

What's more, GPA is totally a virtualization term. When VFIO is used for
device assignment, it uses GPA as IOVA, maps it to HPA, that's true.
But for KVMGT, since vGPU doesn't have its own DMA requester ID, VFIO
won't call IOMMU-API, but DMA-API instead.  GPAs from different guests
may be identical, while IGD can only have 1 single IOMMU domain ...


> When the GPU driver wants to get a GPA, it does so from this database.
> If it wants to read from it, it could get the mm and read from the
> virtual memory or pin the page for a GPA to HPA translation and read
> from the HPA.  There is no reason to poke directly through to the
> hypervisor here.  Let's design what you need into the vgpu version of
> the type1 IOMMU instead.  Thanks,

For KVM, to access a GPA, having it translated to HVA is enough.

IIUC this may be the only remaining problem between us: where should
a GPA be translated to HVA, KVM or VFIO?


PS, I'm going to take holiday leave for ~2weeks, with limited mail
access. May get back to you a while later, sorry and thanks for your
great patience!

> 
> Alex
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-29  7:20                                                   ` [Qemu-devel] " Jike Song
@ 2016-01-29  8:49                                                     ` Jike Song
  -1 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-29  8:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, kvm, igvt-g@lists.01.org, qemu-devel, Paolo Bonzini

On 01/29/2016 03:20 PM, Jike Song wrote:
> This discussion becomes a little difficult for a newbie like me :(
> 
> On 01/28/2016 11:23 PM, Alex Williamson wrote:
>> On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
>>> On 01/28/2016 12:19 AM, Alex Williamson wrote:
>>>> On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
>>> {snip}
>>>  
>>>>> Had a look at eventfd, I would say yes, technically we are able to
>>>>> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
>>>>> call into vgpu device-model, also an iodev registered for a MMIO GPA
>>>>> range to invoke the fop->{read|write}.  I just didn't understand why
>>>>> userspace can't register an iodev via API directly.
>>>>  
>>>> Please elaborate on how it would work via iodev.
>>>>  
>>>  
>>> QEMU forwards BAR0 write to the bus driver, in the bus driver, if
>>> found that MEM bit is enabled, register an iodev to KVM: with an
>>> ops:
>>>  
>>>  	const struct kvm_io_device_ops trap_mmio_ops = {
>>>  		.read	= kvmgt_guest_mmio_read,
>>>  		.write	= kvmgt_guest_mmio_write,
>>>  	};
>>>  
>>> I may not be able to illustrated it clearly with descriptions but this
>>> should not be a problem, thanks to your explanation, I can understand
>>> and adopt it for KVMGT.
>>
>> You're still crossing modules with direct callbacks, right?  What's the
>> advantage versus using the file descriptor + offset approach which could
>> offer the same performance and improve KVM overall by creating a new
>> option for generically handling MMIO?
>>
> 
> Yes, the method I gave above is the current way: calling kvm_io_device_ops
> from KVM hypervisor, and then going to vgpu device-model directly.
> 
> From KVMGT's side this is almost the same as what you suggested, I don't
> think now we have a problem here. I will adopt your suggestion.
> 
>>>>> Besides, this doesn't necessarily require another thread, right?
>>>>> I guess it can be within the VCPU thread? 
>>>>  
>>>> I would think so too, the vcpu is blocked on the MMIO access, we should
>>>> be able to service it in that context.  I hope.
>>>>  
>>>  
>>> Thanks for confirmation.
>>>  
>>>>> And this brought another question: except the vfio bus drvier and
>>>>> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
>>>>> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
>>>>> becoming less and less willing to do that with VFIO, it's still better
>>>>> to know that before going wrong.
>>>>  
>>>> kvm and vfio are separate modules, for the most part, they know nothing
>>>> about each other and have no hard dependencies between them.  We do have
>>>> various accelerations we can use to avoid paths through userspace, but
>>>> these are all via APIs that are agnostic of the party on the other end.
>>>> For example, vfio signals interrups through eventfds and has no concept
>>>> of whether that eventfd terminates in userspace or into an irqfd in KVM.
>>>> vfio supports direct access to device MMIO regions via mmaps, but vfio
>>>> has no idea if that mmap gets directly mapped into a VM address space.
>>>> Even with posted interrupts, we've introduced an irq bypass manager
>>>> allowing interrupt producers and consumers to register independently to
>>>> form a connection without directly knowing anything about the other
>>>> module.  That sort or proper software layering needs to continue.  It
>>>> would be wrong for a vfio bus driver to assume KVM is the user and
>>>> directly call into KVM interfaces.  Thanks,
>>>>  
>>>  
>>> I understand and agree with your point, it's bad if the bus driver
>>> assume KVM is the user and/or call into KVM interfaces.
>>>  
>>> However, the vgpu device-model, in intel case also a part of i915 driver,
>>> will always need to call some hypervisor-specific interfaces.
>>
>> No, think differently.
>>
>>> For example, when a guest gfx driver submit GPU commands, the device-model
>>> may want to scan it for security or whatever-else purpose:
>>>  
>>>  	- get a GPA (from GPU page tables)
>>>  	- want to read 16 bytes from that GPA
>>>  	- call hypervisor-specific read_gpa() method
>>>  		- for Xen, the GPA belongs to a foreign domain, it must find
>>>  		  a way to map & read it - beyond our scope here;
>>>  		- for KVM, the GPA can converted to HVA, copy_from_user (if
>>>  		  called from vcpu thread) or access_remote_vm (if called from
>>>  		  other threads);
>>>  
>>> Please note that this is not from the vfio bus driver, but from the vgpu
>>> device-model; also this is not DMA addr from GPU talbes, but real GPA.
>>
>> This is exactly why we're proposing that the vfio IOMMU interface be
>> used as a database of guest translations. 
>> The type1 IOMMU model in QEMU
>> maps all of guest memory through the IOMMU, in the vGPU model type1 is
>> simply collecting these and they map GPA to process virtual memory.
> 
> GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
> Do you mean making type1 to duplicate the GPA <-> HVA/HPA translations from
> KVM? Even technically this could be done, how to synchronize it with KVM
> hypervisor? e.g. What is expected if guest hot-add a memslot?
> 
> What's more, GPA is totally a virtualization term. When VFIO is used for
> device assignment, it uses GPA as IOVA, maps it to HPA, that's true.
> But for KVMGT, since vGPU doesn't have its own DMA requester ID, VFIO
> won't call IOMMU-API, but DMA-API instead.  GPAs from different guests
> may be identical, while IGD can only have 1 single IOMMU domain ...
> 
> 
>> When the GPU driver wants to get a GPA, it does so from this database.
>> If it wants to read from it, it could get the mm and read from the
>> virtual memory or pin the page for a GPA to HPA translation and read
>> from the HPA.  There is no reason to poke directly through to the
>> hypervisor here.  Let's design what you need into the vgpu version of
>> the type1 IOMMU instead.  Thanks,
> 
> For KVM, to access a GPA, having it translated to HVA is enough.
> 
> IIUC this may be the only remaining problem between us: where should
> a GPA be translated to HVA, KVM or VFIO?
> 

Unfortunately it's not the only one. Another example is, device-model
may want to write-protect a gfn (RAM). In case that this request goes
to VFIO .. how it is supposed to reach KVM MMU?

> 
--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-29  8:49                                                     ` Jike Song
  0 siblings, 0 replies; 118+ messages in thread
From: Jike Song @ 2016-01-29  8:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, igvt-g@lists.01.org, qemu-devel, kvm, Paolo Bonzini

On 01/29/2016 03:20 PM, Jike Song wrote:
> This discussion becomes a little difficult for a newbie like me :(
> 
> On 01/28/2016 11:23 PM, Alex Williamson wrote:
>> On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
>>> On 01/28/2016 12:19 AM, Alex Williamson wrote:
>>>> On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
>>> {snip}
>>>  
>>>>> Had a look at eventfd, I would say yes, technically we are able to
>>>>> achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
>>>>> call into vgpu device-model, also an iodev registered for a MMIO GPA
>>>>> range to invoke the fop->{read|write}.  I just didn't understand why
>>>>> userspace can't register an iodev via API directly.
>>>>  
>>>> Please elaborate on how it would work via iodev.
>>>>  
>>>  
>>> QEMU forwards BAR0 write to the bus driver, in the bus driver, if
>>> found that MEM bit is enabled, register an iodev to KVM: with an
>>> ops:
>>>  
>>>  	const struct kvm_io_device_ops trap_mmio_ops = {
>>>  		.read	= kvmgt_guest_mmio_read,
>>>  		.write	= kvmgt_guest_mmio_write,
>>>  	};
>>>  
>>> I may not be able to illustrated it clearly with descriptions but this
>>> should not be a problem, thanks to your explanation, I can understand
>>> and adopt it for KVMGT.
>>
>> You're still crossing modules with direct callbacks, right?  What's the
>> advantage versus using the file descriptor + offset approach which could
>> offer the same performance and improve KVM overall by creating a new
>> option for generically handling MMIO?
>>
> 
> Yes, the method I gave above is the current way: calling kvm_io_device_ops
> from KVM hypervisor, and then going to vgpu device-model directly.
> 
> From KVMGT's side this is almost the same as what you suggested, I don't
> think now we have a problem here. I will adopt your suggestion.
> 
>>>>> Besides, this doesn't necessarily require another thread, right?
>>>>> I guess it can be within the VCPU thread? 
>>>>  
>>>> I would think so too, the vcpu is blocked on the MMIO access, we should
>>>> be able to service it in that context.  I hope.
>>>>  
>>>  
>>> Thanks for confirmation.
>>>  
>>>>> And this brought another question: except the vfio bus drvier and
>>>>> iommu backend (and the page_track ulitiy used for guest memory write-protection), 
>>>>> is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
>>>>> becoming less and less willing to do that with VFIO, it's still better
>>>>> to know that before going wrong.
>>>>  
>>>> kvm and vfio are separate modules, for the most part, they know nothing
>>>> about each other and have no hard dependencies between them.  We do have
>>>> various accelerations we can use to avoid paths through userspace, but
>>>> these are all via APIs that are agnostic of the party on the other end.
>>>> For example, vfio signals interrups through eventfds and has no concept
>>>> of whether that eventfd terminates in userspace or into an irqfd in KVM.
>>>> vfio supports direct access to device MMIO regions via mmaps, but vfio
>>>> has no idea if that mmap gets directly mapped into a VM address space.
>>>> Even with posted interrupts, we've introduced an irq bypass manager
>>>> allowing interrupt producers and consumers to register independently to
>>>> form a connection without directly knowing anything about the other
>>>> module.  That sort or proper software layering needs to continue.  It
>>>> would be wrong for a vfio bus driver to assume KVM is the user and
>>>> directly call into KVM interfaces.  Thanks,
>>>>  
>>>  
>>> I understand and agree with your point, it's bad if the bus driver
>>> assume KVM is the user and/or call into KVM interfaces.
>>>  
>>> However, the vgpu device-model, in intel case also a part of i915 driver,
>>> will always need to call some hypervisor-specific interfaces.
>>
>> No, think differently.
>>
>>> For example, when a guest gfx driver submit GPU commands, the device-model
>>> may want to scan it for security or whatever-else purpose:
>>>  
>>>  	- get a GPA (from GPU page tables)
>>>  	- want to read 16 bytes from that GPA
>>>  	- call hypervisor-specific read_gpa() method
>>>  		- for Xen, the GPA belongs to a foreign domain, it must find
>>>  		  a way to map & read it - beyond our scope here;
>>>  		- for KVM, the GPA can converted to HVA, copy_from_user (if
>>>  		  called from vcpu thread) or access_remote_vm (if called from
>>>  		  other threads);
>>>  
>>> Please note that this is not from the vfio bus driver, but from the vgpu
>>> device-model; also this is not DMA addr from GPU talbes, but real GPA.
>>
>> This is exactly why we're proposing that the vfio IOMMU interface be
>> used as a database of guest translations. 
>> The type1 IOMMU model in QEMU
>> maps all of guest memory through the IOMMU, in the vGPU model type1 is
>> simply collecting these and they map GPA to process virtual memory.
> 
> GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
> Do you mean making type1 to duplicate the GPA <-> HVA/HPA translations from
> KVM? Even technically this could be done, how to synchronize it with KVM
> hypervisor? e.g. What is expected if guest hot-add a memslot?
> 
> What's more, GPA is totally a virtualization term. When VFIO is used for
> device assignment, it uses GPA as IOVA, maps it to HPA, that's true.
> But for KVMGT, since vGPU doesn't have its own DMA requester ID, VFIO
> won't call IOMMU-API, but DMA-API instead.  GPAs from different guests
> may be identical, while IGD can only have 1 single IOMMU domain ...
> 
> 
>> When the GPU driver wants to get a GPA, it does so from this database.
>> If it wants to read from it, it could get the mm and read from the
>> virtual memory or pin the page for a GPA to HPA translation and read
>> from the HPA.  There is no reason to poke directly through to the
>> hypervisor here.  Let's design what you need into the vgpu version of
>> the type1 IOMMU instead.  Thanks,
> 
> For KVM, to access a GPA, having it translated to HVA is enough.
> 
> IIUC this may be the only remaining problem between us: where should
> a GPA be translated to HVA, KVM or VFIO?
> 

Unfortunately it's not the only one. Another example is, device-model
may want to write-protect a gfn (RAM). In case that this request goes
to VFIO .. how it is supposed to reach KVM MMU?

> 
--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-29  8:49                                                     ` [Qemu-devel] " Jike Song
@ 2016-01-29 18:50                                                       ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-29 18:50 UTC (permalink / raw)
  To: Jike Song; +Cc: Yang Zhang, kvm, igvt-g@lists.01.org, qemu-devel, Paolo Bonzini

Hi Jike,

On Fri, 2016-01-29 at 16:49 +0800, Jike Song wrote:
> On 01/29/2016 03:20 PM, Jike Song wrote:
> > This discussion becomes a little difficult for a newbie like me :(
> > 
> > On 01/28/2016 11:23 PM, Alex Williamson wrote:
> > > On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
> > > > On 01/28/2016 12:19 AM, Alex Williamson wrote:
> > > > > On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> > > > {snip}
> > > >  
> > > > > > Had a look at eventfd, I would say yes, technically we are able to
> > > > > > achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
> > > > > > call into vgpu device-model, also an iodev registered for a MMIO GPA
> > > > > > range to invoke the fop->{read|write}.  I just didn't understand why
> > > > > > userspace can't register an iodev via API directly.
> > > > >  
> > > > > Please elaborate on how it would work via iodev.
> > > > >  
> > > >  
> > > > QEMU forwards BAR0 write to the bus driver, in the bus driver, if
> > > > found that MEM bit is enabled, register an iodev to KVM: with an
> > > > ops:
> > > >  
> > > >  	const struct kvm_io_device_ops trap_mmio_ops = {
> > > >  		.read	= kvmgt_guest_mmio_read,
> > > >  		.write	= kvmgt_guest_mmio_write,
> > > >  	};
> > > >  
> > > > I may not be able to illustrated it clearly with descriptions but this
> > > > should not be a problem, thanks to your explanation, I can understand
> > > > and adopt it for KVMGT.
> > > 
> > > You're still crossing modules with direct callbacks, right?  What's the
> > > advantage versus using the file descriptor + offset approach which could
> > > offer the same performance and improve KVM overall by creating a new
> > > option for generically handling MMIO?
> > > 
> > 
> > Yes, the method I gave above is the current way: calling kvm_io_device_ops
> > from KVM hypervisor, and then going to vgpu device-model directly.
> > 
> > From KVMGT's side this is almost the same as what you suggested, I don't
> > think now we have a problem here. I will adopt your suggestion.

Great

> > > > > > Besides, this doesn't necessarily require another thread, right?
> > > > > > I guess it can be within the VCPU thread? 
> > > > >  
> > > > > I would think so too, the vcpu is blocked on the MMIO access, we should
> > > > > be able to service it in that context.  I hope.
> > > > >  
> > > >  
> > > > Thanks for confirmation.
> > > >  
> > > > > > And this brought another question: except the vfio bus drvier and
> > > > > > iommu backend (and the page_track ulitiy used for guest memory write-protection), 
> > > > > > is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> > > > > > becoming less and less willing to do that with VFIO, it's still better
> > > > > > to know that before going wrong.
> > > > >  
> > > > > kvm and vfio are separate modules, for the most part, they know nothing
> > > > > about each other and have no hard dependencies between them.  We do have
> > > > > various accelerations we can use to avoid paths through userspace, but
> > > > > these are all via APIs that are agnostic of the party on the other end.
> > > > > For example, vfio signals interrups through eventfds and has no concept
> > > > > of whether that eventfd terminates in userspace or into an irqfd in KVM.
> > > > > vfio supports direct access to device MMIO regions via mmaps, but vfio
> > > > > has no idea if that mmap gets directly mapped into a VM address space.
> > > > > Even with posted interrupts, we've introduced an irq bypass manager
> > > > > allowing interrupt producers and consumers to register independently to
> > > > > form a connection without directly knowing anything about the other
> > > > > module.  That sort or proper software layering needs to continue.  It
> > > > > would be wrong for a vfio bus driver to assume KVM is the user and
> > > > > directly call into KVM interfaces.  Thanks,
> > > > >  
> > > >  
> > > > I understand and agree with your point, it's bad if the bus driver
> > > > assume KVM is the user and/or call into KVM interfaces.
> > > >  
> > > > However, the vgpu device-model, in intel case also a part of i915 driver,
> > > > will always need to call some hypervisor-specific interfaces.
> > > 
> > > No, think differently.
> > > 
> > > > For example, when a guest gfx driver submit GPU commands, the device-model
> > > > may want to scan it for security or whatever-else purpose:
> > > >  
> > > >  	- get a GPA (from GPU page tables)
> > > >  	- want to read 16 bytes from that GPA
> > > >  	- call hypervisor-specific read_gpa() method
> > > >  		- for Xen, the GPA belongs to a foreign domain, it must find
> > > >  		  a way to map & read it - beyond our scope here;
> > > >  		- for KVM, the GPA can converted to HVA, copy_from_user (if
> > > >  		  called from vcpu thread) or access_remote_vm (if called from
> > > >  		  other threads);
> > > >  
> > > > Please note that this is not from the vfio bus driver, but from the vgpu
> > > > device-model; also this is not DMA addr from GPU talbes, but real GPA.
> > > 
> > > This is exactly why we're proposing that the vfio IOMMU interface be
> > > used as a database of guest translations. 
> > > The type1 IOMMU model in QEMU
> > > maps all of guest memory through the IOMMU, in the vGPU model type1 is
> > > simply collecting these and they map GPA to process virtual memory.
> > 
> > GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
> > Do you mean making type1 to duplicate the GPA <-> HVA/HPA translations from
> > KVM? Even technically this could be done, how to synchronize it with KVM
> > hypervisor? e.g. What is expected if guest hot-add a memslot?

This is exactly what we do today with physical device assignment with
vfio, the vfio code in QEMU registers a MemoryListener and does DMA map
and unmap operations any time the DMA capable memory of the VM changes.
This is where the suggestion that a vgpu version of the type1 interface,
that records and makes available these translations to the vgpu driver
for the purpose of pinning and GPA to HPA translation comes in.  We do
need to devise a notification scheme so that vgpu drivers can invalidate
mappings when things change, but the information is available outside of
KVM.

> > What's more, GPA is totally a virtualization term. When VFIO is used for
> > device assignment, it uses GPA as IOVA, maps it to HPA, that's true.
> > But for KVMGT, since vGPU doesn't have its own DMA requester ID, VFIO
> > won't call IOMMU-API, but DMA-API instead.  GPAs from different guests
> > may be identical, while IGD can only have 1 single IOMMU domain ...

The proposal is that we maintain exactly the vfio type1 API to QEMU
where QEMU uses MemoryListeners to relay changes in the VM address space
to vfio.  While we maintain the type1 API to QEMU, the implementation is
different, the vgpu-type1 IOMMU backend does not map memory through the
IOMMU API, nor does it care about the requester ID of the device in use.
vgpus provide isolation and translation through device specific means,
such as mediation of the device and per process paging structures.  It's
therefore the responsibility of the GPU driver to call into this
database of VM mappings to get the translations it needs and register
those translations with the DMA API in case a physical IOMMU is present.
KVM uses the same type of listener to fill memory slots, so by doing
this, we can provide everything we need directly within the vfio
infrastructure without needing to assume we're operating with a
KVM-based VM.

> > > When the GPU driver wants to get a GPA, it does so from this database.
> > > If it wants to read from it, it could get the mm and read from the
> > > virtual memory or pin the page for a GPA to HPA translation and read
> > > from the HPA.  There is no reason to poke directly through to the
> > > hypervisor here.  Let's design what you need into the vgpu version of
> > > the type1 IOMMU instead.  Thanks,
> > 
> > For KVM, to access a GPA, having it translated to HVA is enough.
> > 
> > IIUC this may be the only remaining problem between us: where should
> > a GPA be translated to HVA, KVM or VFIO?

Via the mechanism I describe above, the vgpu-type1 vfio backend will
implement a database of GPA to HVA addresses.  The architecture we're
trying to create here should provide interfaces to get that information
and keep it current to the state of the VM.

> Unfortunately it's not the only one. Another example is, device-model
> may want to write-protect a gfn (RAM). In case that this request goes
> to VFIO .. how it is supposed to reach KVM MMU?

Well, let's work through the problem.  How is the GFN related to the
device?  Is this some sort of page table for device mappings with a base
register in the vgpu hardware?  If so, then the vgpu driver can find the
HVA via the vgpu-type1 interface above.  What's required to
write-protect the page?  Can we do this at the page level, without
needing KVM?  If we wanted to write-protect a page for a user process,
how would we do it?  I think there are likely solutions to each of these
problems, but we need to start with respecting the software layering and
abstraction between various kernel components rather than calling into
them directly as our first inclination.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-01-29 18:50                                                       ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-01-29 18:50 UTC (permalink / raw)
  To: Jike Song; +Cc: Yang Zhang, igvt-g@lists.01.org, qemu-devel, kvm, Paolo Bonzini

Hi Jike,

On Fri, 2016-01-29 at 16:49 +0800, Jike Song wrote:
> On 01/29/2016 03:20 PM, Jike Song wrote:
> > This discussion becomes a little difficult for a newbie like me :(
> > 
> > On 01/28/2016 11:23 PM, Alex Williamson wrote:
> > > On Thu, 2016-01-28 at 14:00 +0800, Jike Song wrote:
> > > > On 01/28/2016 12:19 AM, Alex Williamson wrote:
> > > > > On Wed, 2016-01-27 at 13:43 +0800, Jike Song wrote:
> > > > {snip}
> > > >  
> > > > > > Had a look at eventfd, I would say yes, technically we are able to
> > > > > > achieve the goal: introduce a fd, with fop->{read|write} defined in KVM,
> > > > > > call into vgpu device-model, also an iodev registered for a MMIO GPA
> > > > > > range to invoke the fop->{read|write}.  I just didn't understand why
> > > > > > userspace can't register an iodev via API directly.
> > > > >  
> > > > > Please elaborate on how it would work via iodev.
> > > > >  
> > > >  
> > > > QEMU forwards BAR0 write to the bus driver, in the bus driver, if
> > > > found that MEM bit is enabled, register an iodev to KVM: with an
> > > > ops:
> > > >  
> > > >  	const struct kvm_io_device_ops trap_mmio_ops = {
> > > >  		.read	= kvmgt_guest_mmio_read,
> > > >  		.write	= kvmgt_guest_mmio_write,
> > > >  	};
> > > >  
> > > > I may not be able to illustrated it clearly with descriptions but this
> > > > should not be a problem, thanks to your explanation, I can understand
> > > > and adopt it for KVMGT.
> > > 
> > > You're still crossing modules with direct callbacks, right?  What's the
> > > advantage versus using the file descriptor + offset approach which could
> > > offer the same performance and improve KVM overall by creating a new
> > > option for generically handling MMIO?
> > > 
> > 
> > Yes, the method I gave above is the current way: calling kvm_io_device_ops
> > from KVM hypervisor, and then going to vgpu device-model directly.
> > 
> > From KVMGT's side this is almost the same as what you suggested, I don't
> > think now we have a problem here. I will adopt your suggestion.

Great

> > > > > > Besides, this doesn't necessarily require another thread, right?
> > > > > > I guess it can be within the VCPU thread? 
> > > > >  
> > > > > I would think so too, the vcpu is blocked on the MMIO access, we should
> > > > > be able to service it in that context.  I hope.
> > > > >  
> > > >  
> > > > Thanks for confirmation.
> > > >  
> > > > > > And this brought another question: except the vfio bus drvier and
> > > > > > iommu backend (and the page_track ulitiy used for guest memory write-protection), 
> > > > > > is it KVMGT allowed to call into kvm.ko (or modify)? Though we are
> > > > > > becoming less and less willing to do that with VFIO, it's still better
> > > > > > to know that before going wrong.
> > > > >  
> > > > > kvm and vfio are separate modules, for the most part, they know nothing
> > > > > about each other and have no hard dependencies between them.  We do have
> > > > > various accelerations we can use to avoid paths through userspace, but
> > > > > these are all via APIs that are agnostic of the party on the other end.
> > > > > For example, vfio signals interrups through eventfds and has no concept
> > > > > of whether that eventfd terminates in userspace or into an irqfd in KVM.
> > > > > vfio supports direct access to device MMIO regions via mmaps, but vfio
> > > > > has no idea if that mmap gets directly mapped into a VM address space.
> > > > > Even with posted interrupts, we've introduced an irq bypass manager
> > > > > allowing interrupt producers and consumers to register independently to
> > > > > form a connection without directly knowing anything about the other
> > > > > module.  That sort or proper software layering needs to continue.  It
> > > > > would be wrong for a vfio bus driver to assume KVM is the user and
> > > > > directly call into KVM interfaces.  Thanks,
> > > > >  
> > > >  
> > > > I understand and agree with your point, it's bad if the bus driver
> > > > assume KVM is the user and/or call into KVM interfaces.
> > > >  
> > > > However, the vgpu device-model, in intel case also a part of i915 driver,
> > > > will always need to call some hypervisor-specific interfaces.
> > > 
> > > No, think differently.
> > > 
> > > > For example, when a guest gfx driver submit GPU commands, the device-model
> > > > may want to scan it for security or whatever-else purpose:
> > > >  
> > > >  	- get a GPA (from GPU page tables)
> > > >  	- want to read 16 bytes from that GPA
> > > >  	- call hypervisor-specific read_gpa() method
> > > >  		- for Xen, the GPA belongs to a foreign domain, it must find
> > > >  		  a way to map & read it - beyond our scope here;
> > > >  		- for KVM, the GPA can converted to HVA, copy_from_user (if
> > > >  		  called from vcpu thread) or access_remote_vm (if called from
> > > >  		  other threads);
> > > >  
> > > > Please note that this is not from the vfio bus driver, but from the vgpu
> > > > device-model; also this is not DMA addr from GPU talbes, but real GPA.
> > > 
> > > This is exactly why we're proposing that the vfio IOMMU interface be
> > > used as a database of guest translations. 
> > > The type1 IOMMU model in QEMU
> > > maps all of guest memory through the IOMMU, in the vGPU model type1 is
> > > simply collecting these and they map GPA to process virtual memory.
> > 
> > GPA to HVA mappings are maintained in KVM/QEMU, via memslots.
> > Do you mean making type1 to duplicate the GPA <-> HVA/HPA translations from
> > KVM? Even technically this could be done, how to synchronize it with KVM
> > hypervisor? e.g. What is expected if guest hot-add a memslot?

This is exactly what we do today with physical device assignment with
vfio, the vfio code in QEMU registers a MemoryListener and does DMA map
and unmap operations any time the DMA capable memory of the VM changes.
This is where the suggestion that a vgpu version of the type1 interface,
that records and makes available these translations to the vgpu driver
for the purpose of pinning and GPA to HPA translation comes in.  We do
need to devise a notification scheme so that vgpu drivers can invalidate
mappings when things change, but the information is available outside of
KVM.

> > What's more, GPA is totally a virtualization term. When VFIO is used for
> > device assignment, it uses GPA as IOVA, maps it to HPA, that's true.
> > But for KVMGT, since vGPU doesn't have its own DMA requester ID, VFIO
> > won't call IOMMU-API, but DMA-API instead.  GPAs from different guests
> > may be identical, while IGD can only have 1 single IOMMU domain ...

The proposal is that we maintain exactly the vfio type1 API to QEMU
where QEMU uses MemoryListeners to relay changes in the VM address space
to vfio.  While we maintain the type1 API to QEMU, the implementation is
different, the vgpu-type1 IOMMU backend does not map memory through the
IOMMU API, nor does it care about the requester ID of the device in use.
vgpus provide isolation and translation through device specific means,
such as mediation of the device and per process paging structures.  It's
therefore the responsibility of the GPU driver to call into this
database of VM mappings to get the translations it needs and register
those translations with the DMA API in case a physical IOMMU is present.
KVM uses the same type of listener to fill memory slots, so by doing
this, we can provide everything we need directly within the vfio
infrastructure without needing to assume we're operating with a
KVM-based VM.

> > > When the GPU driver wants to get a GPA, it does so from this database.
> > > If it wants to read from it, it could get the mm and read from the
> > > virtual memory or pin the page for a GPA to HPA translation and read
> > > from the HPA.  There is no reason to poke directly through to the
> > > hypervisor here.  Let's design what you need into the vgpu version of
> > > the type1 IOMMU instead.  Thanks,
> > 
> > For KVM, to access a GPA, having it translated to HVA is enough.
> > 
> > IIUC this may be the only remaining problem between us: where should
> > a GPA be translated to HVA, KVM or VFIO?

Via the mechanism I describe above, the vgpu-type1 vfio backend will
implement a database of GPA to HVA addresses.  The architecture we're
trying to create here should provide interfaces to get that information
and keep it current to the state of the VM.

> Unfortunately it's not the only one. Another example is, device-model
> may want to write-protect a gfn (RAM). In case that this request goes
> to VFIO .. how it is supposed to reach KVM MMU?

Well, let's work through the problem.  How is the GFN related to the
device?  Is this some sort of page table for device mappings with a base
register in the vgpu hardware?  If so, then the vgpu driver can find the
HVA via the vgpu-type1 interface above.  What's required to
write-protect the page?  Can we do this at the page level, without
needing KVM?  If we wanted to write-protect a page for a user process,
how would we do it?  I think there are likely solutions to each of these
problems, but we need to start with respecting the software layering and
abstraction between various kernel components rather than calling into
them directly as our first inclination.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-01-29 18:50                                                       ` [Qemu-devel] " Alex Williamson
@ 2016-02-01 13:10                                                         ` Gerd Hoffmann
  -1 siblings, 0 replies; 118+ messages in thread
From: Gerd Hoffmann @ 2016-02-01 13:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, Yang Zhang, igvt-g@lists.01.org, qemu-devel, kvm,
	Paolo Bonzini

  Hi,

> > Unfortunately it's not the only one. Another example is, device-model
> > may want to write-protect a gfn (RAM). In case that this request goes
> > to VFIO .. how it is supposed to reach KVM MMU?
> 
> Well, let's work through the problem.  How is the GFN related to the
> device?  Is this some sort of page table for device mappings with a base
> register in the vgpu hardware?

IIRC this is needed to make sure the guest can't bypass execbuffer
verification and works like this:

  (1) guest submits execbuffer.
  (2) host makes execbuffer readonly for the guest
  (3) verify the buffer (make sure it only accesses resources owned by
      the vm).
  (4) pass on execbuffer to the hardware.
  (5) when the gpu is done with it make the execbuffer writable again.

cheers,
  Gerd


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-02-01 13:10                                                         ` Gerd Hoffmann
  0 siblings, 0 replies; 118+ messages in thread
From: Gerd Hoffmann @ 2016-02-01 13:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, igvt-g@lists.01.org, Jike Song, kvm, qemu-devel,
	Paolo Bonzini

  Hi,

> > Unfortunately it's not the only one. Another example is, device-model
> > may want to write-protect a gfn (RAM). In case that this request goes
> > to VFIO .. how it is supposed to reach KVM MMU?
> 
> Well, let's work through the problem.  How is the GFN related to the
> device?  Is this some sort of page table for device mappings with a base
> register in the vgpu hardware?

IIRC this is needed to make sure the guest can't bypass execbuffer
verification and works like this:

  (1) guest submits execbuffer.
  (2) host makes execbuffer readonly for the guest
  (3) verify the buffer (make sure it only accesses resources owned by
      the vm).
  (4) pass on execbuffer to the hardware.
  (5) when the gpu is done with it make the execbuffer writable again.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-02-01 13:10                                                         ` [Qemu-devel] " Gerd Hoffmann
@ 2016-02-01 21:44                                                           ` Alex Williamson
  -1 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-02-01 21:44 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Jike Song, Yang Zhang, igvt-g@lists.01.org, qemu-devel, kvm,
	Paolo Bonzini

On Mon, 2016-02-01 at 14:10 +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > > Unfortunately it's not the only one. Another example is, device-model
> > > may want to write-protect a gfn (RAM). In case that this request goes
> > > to VFIO .. how it is supposed to reach KVM MMU?
> > 
> > Well, let's work through the problem.  How is the GFN related to the
> > device?  Is this some sort of page table for device mappings with a base
> > register in the vgpu hardware?
> 
> IIRC this is needed to make sure the guest can't bypass execbuffer
> verification and works like this:
> 
>   (1) guest submits execbuffer.
>   (2) host makes execbuffer readonly for the guest
>   (3) verify the buffer (make sure it only accesses resources owned by
>       the vm).
>   (4) pass on execbuffer to the hardware.
>   (5) when the gpu is done with it make the execbuffer writable again.

Ok, so are there opportunities to do those page protections outside of
KVM?  We should be able to get the vma for the buffer, can we do
something with that to make it read-only.  Alternatively can the vgpu
driver copy it to a private buffer and hardware can execute from that?
I'm not a virtual memory expert, but it doesn't seem like an
insurmountable problem.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-02-01 21:44                                                           ` Alex Williamson
  0 siblings, 0 replies; 118+ messages in thread
From: Alex Williamson @ 2016-02-01 21:44 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Yang Zhang, igvt-g@lists.01.org, Jike Song, kvm, qemu-devel,
	Paolo Bonzini

On Mon, 2016-02-01 at 14:10 +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > > Unfortunately it's not the only one. Another example is, device-model
> > > may want to write-protect a gfn (RAM). In case that this request goes
> > > to VFIO .. how it is supposed to reach KVM MMU?
> > 
> > Well, let's work through the problem.  How is the GFN related to the
> > device?  Is this some sort of page table for device mappings with a base
> > register in the vgpu hardware?
> 
> IIRC this is needed to make sure the guest can't bypass execbuffer
> verification and works like this:
> 
>   (1) guest submits execbuffer.
>   (2) host makes execbuffer readonly for the guest
>   (3) verify the buffer (make sure it only accesses resources owned by
>       the vm).
>   (4) pass on execbuffer to the hardware.
>   (5) when the gpu is done with it make the execbuffer writable again.

Ok, so are there opportunities to do those page protections outside of
KVM?  We should be able to get the vma for the buffer, can we do
something with that to make it read-only.  Alternatively can the vgpu
driver copy it to a private buffer and hardware can execute from that?
I'm not a virtual memory expert, but it doesn't seem like an
insurmountable problem.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-02-01 21:44                                                           ` [Qemu-devel] " Alex Williamson
@ 2016-02-02  7:28                                                             ` Gerd Hoffmann
  -1 siblings, 0 replies; 118+ messages in thread
From: Gerd Hoffmann @ 2016-02-02  7:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, Yang Zhang, igvt-g@lists.01.org, qemu-devel, kvm,
	Paolo Bonzini

  Hi,

> Alternatively can the vgpu
> driver copy it to a private buffer and hardware can execute from that?

Copying is an option, but given the size execbuffers can have this comes
with a noticeable performance difference.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-02-02  7:28                                                             ` Gerd Hoffmann
  0 siblings, 0 replies; 118+ messages in thread
From: Gerd Hoffmann @ 2016-02-02  7:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yang Zhang, igvt-g@lists.01.org, Jike Song, kvm, qemu-devel,
	Paolo Bonzini

  Hi,

> Alternatively can the vgpu
> driver copy it to a private buffer and hardware can execute from that?

Copying is an option, but given the size execbuffers can have this comes
with a noticeable performance difference.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
  2016-02-01 21:44                                                           ` [Qemu-devel] " Alex Williamson
@ 2016-02-02  7:35                                                             ` Zhiyuan Lv
  -1 siblings, 0 replies; 118+ messages in thread
From: Zhiyuan Lv @ 2016-02-02  7:35 UTC (permalink / raw)
  To: Alex Williamson, Gerd Hoffmann
  Cc: Yang Zhang, igvt-g@lists.01.org, kvm, qemu-devel, Paolo Bonzini

Hi Gerd/Alex,

On Mon, Feb 01, 2016 at 02:44:55PM -0700, Alex Williamson wrote:
> On Mon, 2016-02-01 at 14:10 +0100, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > > Unfortunately it's not the only one. Another example is, device-model
> > > > may want to write-protect a gfn (RAM). In case that this request goes
> > > > to VFIO .. how it is supposed to reach KVM MMU?
> > > 
> > > Well, let's work through the problem.  How is the GFN related to the
> > > device?  Is this some sort of page table for device mappings with a base
> > > register in the vgpu hardware?
> > 
> > IIRC this is needed to make sure the guest can't bypass execbuffer
> > verification and works like this:
> > 
> >   (1) guest submits execbuffer.
> >   (2) host makes execbuffer readonly for the guest
> >   (3) verify the buffer (make sure it only accesses resources owned by
> >       the vm).
> >   (4) pass on execbuffer to the hardware.
> >   (5) when the gpu is done with it make the execbuffer writable again.
> 
> Ok, so are there opportunities to do those page protections outside of
> KVM?  We should be able to get the vma for the buffer, can we do
> something with that to make it read-only.  Alternatively can the vgpu
> driver copy it to a private buffer and hardware can execute from that?
> I'm not a virtual memory expert, but it doesn't seem like an
> insurmountable problem.  Thanks,

Originally iGVT-g used write-protection for privilege execbuffers, as Gerd
described. Now the latest implementation has removed wp to do buffer copy
instead, since the privilege command buffers are usually small. So that part
is fine.

But we need write-protection for graphics page table shadowing as well. Once
guest driver modifies gpu page table, we need to know that and manipulate
shadow page table accordingly. buffer copy cannot help here. Thanks!

Regards,
-Zhiyuan

> 
> Alex
> 
> _______________________________________________
> iGVT-g mailing list
> iGVT-g@lists.01.org
> https://lists.01.org/mailman/listinfo/igvt-g

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [Qemu-devel] [iGVT-g] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
@ 2016-02-02  7:35                                                             ` Zhiyuan Lv
  0 siblings, 0 replies; 118+ messages in thread
From: Zhiyuan Lv @ 2016-02-02  7:35 UTC (permalink / raw)
  To: Alex Williamson, Gerd Hoffmann
  Cc: Yang Zhang, igvt-g@lists.01.org, qemu-devel, kvm, Paolo Bonzini

Hi Gerd/Alex,

On Mon, Feb 01, 2016 at 02:44:55PM -0700, Alex Williamson wrote:
> On Mon, 2016-02-01 at 14:10 +0100, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > > Unfortunately it's not the only one. Another example is, device-model
> > > > may want to write-protect a gfn (RAM). In case that this request goes
> > > > to VFIO .. how it is supposed to reach KVM MMU?
> > > 
> > > Well, let's work through the problem.  How is the GFN related to the
> > > device?  Is this some sort of page table for device mappings with a base
> > > register in the vgpu hardware?
> > 
> > IIRC this is needed to make sure the guest can't bypass execbuffer
> > verification and works like this:
> > 
> >   (1) guest submits execbuffer.
> >   (2) host makes execbuffer readonly for the guest
> >   (3) verify the buffer (make sure it only accesses resources owned by
> >       the vm).
> >   (4) pass on execbuffer to the hardware.
> >   (5) when the gpu is done with it make the execbuffer writable again.
> 
> Ok, so are there opportunities to do those page protections outside of
> KVM?  We should be able to get the vma for the buffer, can we do
> something with that to make it read-only.  Alternatively can the vgpu
> driver copy it to a private buffer and hardware can execute from that?
> I'm not a virtual memory expert, but it doesn't seem like an
> insurmountable problem.  Thanks,

Originally iGVT-g used write-protection for privilege execbuffers, as Gerd
described. Now the latest implementation has removed wp to do buffer copy
instead, since the privilege command buffers are usually small. So that part
is fine.

But we need write-protection for graphics page table shadowing as well. Once
guest driver modifies gpu page table, we need to know that and manipulate
shadow page table accordingly. buffer copy cannot help here. Thanks!

Regards,
-Zhiyuan

> 
> Alex
> 
> _______________________________________________
> iGVT-g mailing list
> iGVT-g@lists.01.org
> https://lists.01.org/mailman/listinfo/igvt-g

^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2016-02-02  7:46 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-18  2:39 VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) Jike Song
2016-01-18  2:39 ` [Qemu-devel] " Jike Song
2016-01-18  4:47 ` Alex Williamson
2016-01-18  4:47   ` [Qemu-devel] " Alex Williamson
2016-01-18  8:56   ` Jike Song
2016-01-18  8:56     ` [Qemu-devel] " Jike Song
2016-01-18 19:05     ` Alex Williamson
2016-01-18 19:05       ` [Qemu-devel] " Alex Williamson
2016-01-20  8:59       ` Jike Song
2016-01-20  8:59         ` [Qemu-devel] " Jike Song
2016-01-20  9:05         ` Tian, Kevin
2016-01-20  9:05           ` [Qemu-devel] " Tian, Kevin
2016-01-25 11:34           ` Jike Song
2016-01-25 11:34             ` [Qemu-devel] " Jike Song
2016-01-25 21:30             ` Alex Williamson
2016-01-25 21:30               ` [Qemu-devel] " Alex Williamson
2016-01-25 21:45               ` Tian, Kevin
2016-01-25 21:45                 ` [Qemu-devel] " Tian, Kevin
2016-01-25 21:48                 ` Tian, Kevin
2016-01-25 21:48                   ` [Qemu-devel] " Tian, Kevin
2016-01-26  9:48                 ` Neo Jia
2016-01-26  9:48                   ` [Qemu-devel] " Neo Jia
2016-01-26 10:20                 ` Neo Jia
2016-01-26 10:20                   ` [Qemu-devel] " Neo Jia
2016-01-26 19:24                   ` Tian, Kevin
2016-01-26 19:24                     ` [Qemu-devel] " Tian, Kevin
2016-01-26 19:29                     ` Neo Jia
2016-01-26 19:29                       ` [Qemu-devel] " Neo Jia
2016-01-26 20:06                   ` Alex Williamson
2016-01-26 20:06                     ` [Qemu-devel] " Alex Williamson
2016-01-26 21:38                     ` Tian, Kevin
2016-01-26 21:38                       ` [Qemu-devel] " Tian, Kevin
2016-01-26 22:28                     ` Neo Jia
2016-01-26 22:28                       ` [Qemu-devel] " Neo Jia
2016-01-26 23:30                       ` Alex Williamson
2016-01-26 23:30                         ` [Qemu-devel] " Alex Williamson
2016-01-27  9:14                         ` Neo Jia
2016-01-27  9:14                           ` [Qemu-devel] " Neo Jia
2016-01-27 16:10                           ` Alex Williamson
2016-01-27 16:10                             ` [Qemu-devel] " Alex Williamson
2016-01-27 21:48                             ` Neo Jia
2016-01-27 21:48                               ` [Qemu-devel] " Neo Jia
2016-01-27  8:06                     ` Kirti Wankhede
2016-01-27  8:06                       ` [Qemu-devel] " Kirti Wankhede
2016-01-27 16:00                       ` Alex Williamson
2016-01-27 16:00                         ` [Qemu-devel] " Alex Williamson
2016-01-27 20:55                         ` Kirti Wankhede
2016-01-27 20:55                           ` [Qemu-devel] " Kirti Wankhede
2016-01-27 21:58                           ` Alex Williamson
2016-01-27 21:58                             ` [Qemu-devel] " Alex Williamson
2016-01-28  3:01                             ` Kirti Wankhede
2016-01-28  3:01                               ` [Qemu-devel] " Kirti Wankhede
2016-01-26  7:41               ` Jike Song
2016-01-26  7:41                 ` [Qemu-devel] " Jike Song
2016-01-26 14:05                 ` Yang Zhang
2016-01-26 14:05                   ` [Qemu-devel] " Yang Zhang
2016-01-26 16:37                   ` Alex Williamson
2016-01-26 16:37                     ` [Qemu-devel] " Alex Williamson
2016-01-26 21:21                     ` Tian, Kevin
2016-01-26 21:21                       ` [Qemu-devel] " Tian, Kevin
2016-01-26 21:30                       ` Neo Jia
2016-01-26 21:30                         ` [Qemu-devel] " Neo Jia
2016-01-26 21:43                         ` Tian, Kevin
2016-01-26 21:43                           ` [Qemu-devel] " Tian, Kevin
2016-01-26 21:43                       ` Alex Williamson
2016-01-26 21:43                         ` [Qemu-devel] " Alex Williamson
2016-01-26 21:50                         ` Tian, Kevin
2016-01-26 21:50                           ` [Qemu-devel] " Tian, Kevin
2016-01-26 22:07                           ` Alex Williamson
2016-01-26 22:07                             ` [Qemu-devel] " Alex Williamson
2016-01-26 22:15                             ` Tian, Kevin
2016-01-26 22:15                               ` [Qemu-devel] " Tian, Kevin
2016-01-26 22:27                               ` Alex Williamson
2016-01-26 22:27                                 ` [Qemu-devel] " Alex Williamson
2016-01-26 22:39                                 ` Tian, Kevin
2016-01-26 22:39                                   ` [Qemu-devel] " Tian, Kevin
2016-01-26 22:56                                   ` Alex Williamson
2016-01-26 22:56                                     ` [Qemu-devel] " Alex Williamson
2016-01-27  1:47                                     ` Jike Song
2016-01-27  1:47                                       ` [Qemu-devel] " Jike Song
2016-01-27  3:07                                       ` Alex Williamson
2016-01-27  3:07                                         ` [Qemu-devel] " Alex Williamson
2016-01-27  5:43                                         ` Jike Song
2016-01-27  5:43                                           ` [Qemu-devel] " Jike Song
2016-01-27 16:19                                           ` Alex Williamson
2016-01-27 16:19                                             ` [Qemu-devel] " Alex Williamson
2016-01-28  6:00                                             ` Jike Song
2016-01-28  6:00                                               ` [Qemu-devel] " Jike Song
2016-01-28 15:23                                               ` Alex Williamson
2016-01-28 15:23                                                 ` [Qemu-devel] " Alex Williamson
2016-01-29  7:20                                                 ` Jike Song
2016-01-29  7:20                                                   ` [Qemu-devel] " Jike Song
2016-01-29  8:49                                                   ` [iGVT-g] " Jike Song
2016-01-29  8:49                                                     ` [Qemu-devel] " Jike Song
2016-01-29 18:50                                                     ` Alex Williamson
2016-01-29 18:50                                                       ` [Qemu-devel] " Alex Williamson
2016-02-01 13:10                                                       ` Gerd Hoffmann
2016-02-01 13:10                                                         ` [Qemu-devel] " Gerd Hoffmann
2016-02-01 21:44                                                         ` Alex Williamson
2016-02-01 21:44                                                           ` [Qemu-devel] " Alex Williamson
2016-02-02  7:28                                                           ` Gerd Hoffmann
2016-02-02  7:28                                                             ` [Qemu-devel] " Gerd Hoffmann
2016-02-02  7:35                                                           ` Zhiyuan Lv
2016-02-02  7:35                                                             ` [Qemu-devel] " Zhiyuan Lv
2016-01-27  1:52                                     ` Yang Zhang
2016-01-27  1:52                                       ` [Qemu-devel] " Yang Zhang
2016-01-27  3:37                                       ` Alex Williamson
2016-01-27  3:37                                         ` [Qemu-devel] " Alex Williamson
2016-01-27  0:06                   ` Jike Song
2016-01-27  0:06                     ` [Qemu-devel] " Jike Song
2016-01-27  1:34                     ` Yang Zhang
2016-01-27  1:34                       ` [Qemu-devel] " Yang Zhang
2016-01-27  1:51                       ` Jike Song
2016-01-27  1:51                         ` [Qemu-devel] " Jike Song
2016-01-26 16:12                 ` Alex Williamson
2016-01-26 16:12                   ` [Qemu-devel] " Alex Williamson
2016-01-26 21:57                   ` Tian, Kevin
2016-01-26 21:57                     ` [Qemu-devel] " Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.