All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] libvirt vGPU QEMU integration
@ 2016-08-18 16:41 Neo Jia
  2016-08-19 12:42 ` [Qemu-devel] [libvirt] " Michal Privoznik
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Neo Jia @ 2016-08-18 16:41 UTC (permalink / raw)
  To: libvir-list
  Cc: qemu-devel, alex.williamson, pbonzini, kwankhede, ACurrid,
	kevin.tian, jike.song, bjsdjshi, kraxel

Hi libvirt experts,

I am starting this email thread to discuss the potential solution / proposal of
integrating vGPU support into libvirt for QEMU.

Some quick background, NVIDIA is implementing a VFIO based mediated device
framework to allow people to virtualize their devices without SR-IOV, for
example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
VFIO API to process the memory / interrupt as what QEMU does today with passthru
device.

The difference here is that we are introducing a set of new sysfs file for
virtual device discovery and life cycle management due to its virtual nature.

Here is the summary of the sysfs, when they will be created and how they should
be used:

1. Discover mediated device

As part of physical device initialization process, vendor driver will register
their physical devices, which will be used to create virtual device (mediated
device, aka mdev) to the mediated framework.

Then, the sysfs file "mdev_supported_types" will be available under the physical
device sysfs, and it will indicate the supported mdev and configuration for this 
particular physical device, and the content may change dynamically based on the
system's current configurations, so libvirt needs to query this file every time
before create a mdev.

Note: different vendors might have their own specific configuration sysfs as
well, if they don't have pre-defined types.

For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
NVIDIA specific configuration on an idle system.

For example, to query the "mdev_supported_types" on this Tesla M60:

cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
# vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
max_resolution
11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160

2. Create/destroy mediated device

Two sysfs files are available under the physical device sysfs path : mdev_create
and mdev_destroy

The syntax of creating a mdev is:

    echo "$mdev_UUID:vendor_specific_argument_list" >
/sys/bus/pci/devices/.../mdev_create

The syntax of destroying a mdev is:

    echo "$mdev_UUID:vendor_specific_argument_list" >
/sys/bus/pci/devices/.../mdev_destroy

The $mdev_UUID is a unique identifier for this mdev device to be created, and it
is unique per system.

For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
above Tesla M60 output), and a VM UUID to be passed as
"vendor_specific_argument_list".

If there is no vendor specific arguments required, either "$mdev_UUID" or
"$mdev_UUID:" will be acceptable as input syntax for the above two commands.

To create a M60-4Q device, libvirt needs to do:

    echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
/sys/bus/pci/devices/0000\:86\:00.0/mdev_create

Then, you will see a virtual device shows up at:

    /sys/bus/mdev/devices/$mdev_UUID/

For NVIDIA, to create multiple virtual devices per VM, it has to be created
upfront before bringing any of them online.

Regarding error reporting and detection, on failure, write() to sysfs using fd
returns error code, and write to sysfs file through command prompt shows the
string corresponding to error code.

3. Start/stop mediated device

Under the virtual device sysfs, you will see a new "online" sysfs file.

you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
of this virtual device (0 or 1), and to start a virtual device or stop a virtual 
device you can do:

    echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online

libvirt needs to query the current state before changing state.

Note: if you have multiple devices, you need to write to the "online" file
individually.

For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
them "online" before starting QEMU.

4. Launch QEMU/VM

Pass the mdev sysfs path to QEMU as vfio-pci device:

    -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0

5. Shutdown sequence 

libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
virtual device

6. VM Reset

No change or requirement for libvirt as this will be handled via VFIO reset API
and QEMU process will keep running as before.

7. Hot-plug

It optional for vendors to support hot-plug.

And it is same syntax to create a virtual device for hot-plug. 

For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
to write to "destroy" sysfs to complete hot-unplug process.

Since hot-plug is optional, then mdev_create or mdev_destroy operations may
return an error if it is not supported.

Thanks, 
Neo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration
  2016-08-18 16:41 [Qemu-devel] [RFC] libvirt vGPU QEMU integration Neo Jia
@ 2016-08-19 12:42 ` Michal Privoznik
  2016-08-22  5:40   ` Neo Jia
  2016-08-19 19:22 ` Laine Stump
  2016-08-24 22:29 ` [Qemu-devel] " Daniel P. Berrange
  2 siblings, 1 reply; 7+ messages in thread
From: Michal Privoznik @ 2016-08-19 12:42 UTC (permalink / raw)
  To: Neo Jia, libvir-list
  Cc: ACurrid, kevin.tian, qemu-devel, kwankhede, jike.song, kraxel,
	pbonzini, bjsdjshi

On 18.08.2016 18:41, Neo Jia wrote:
> Hi libvirt experts,

Hi, welcome to the list.

> 
> I am starting this email thread to discuss the potential solution / proposal of
> integrating vGPU support into libvirt for QEMU.
> 
> Some quick background, NVIDIA is implementing a VFIO based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> VFIO API to process the memory / interrupt as what QEMU does today with passthru
> device.

So as far as I understand, this is solely NVIDIA's API and other vendors
(e.g. Intel) will use their own or is this a standard that others will
comply to?

> 
> The difference here is that we are introducing a set of new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
> 
> Here is the summary of the sysfs, when they will be created and how they should
> be used:
> 
> 1. Discover mediated device
> 
> As part of physical device initialization process, vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.
> 
> Then, the sysfs file "mdev_supported_types" will be available under the physical
> device sysfs, and it will indicate the supported mdev and configuration for this 
> particular physical device, and the content may change dynamically based on the
> system's current configurations, so libvirt needs to query this file every time
> before create a mdev.

Ah, that was gonna be my question. Because in the example below, you
used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I
was wondering where does the number 20 come from. Now what I am
wondering about is how libvirt should expose these to users. Moreover,
how it should let users to chose.
We have a node device driver where I guess we could expose possible
options and then require some explicit value in the domain XML (but what
value would that be? I don't think taking vgpu_type_id-s as they are
would be a great idea).

> 
> Note: different vendors might have their own specific configuration sysfs as
> well, if they don't have pre-defined types.
> 
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> NVIDIA specific configuration on an idle system.
> 
> For example, to query the "mdev_supported_types" on this Tesla M60:
> 
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> 
> 2. Create/destroy mediated device
> 
> Two sysfs files are available under the physical device sysfs path : mdev_create
> and mdev_destroy
> 
> The syntax of creating a mdev is:
> 
>     echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create
> 
> The syntax of destroying a mdev is:
> 
>     echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
> 
> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> is unique per system.

Ah, so a caller (the one doing the echo - e.g. libvirt) can generate
their own UUID under which the mdev will be known? I'm asking because of
migration - we might want to preserve UUIDs when a domain is migrated to
the other side. Speaking of which, is there such limitation or will
guest be able to migrate even if UUID's changed?

> 
> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".

I understand the need for vgpu_type_id, but can you shed more light on
the VM UUID? Why is that required?

> 
> If there is no vendor specific arguments required, either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> 
> To create a M60-4Q device, libvirt needs to do:
> 
>     echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> 
> Then, you will see a virtual device shows up at:
> 
>     /sys/bus/mdev/devices/$mdev_UUID/
> 
> For NVIDIA, to create multiple virtual devices per VM, it has to be created
> upfront before bringing any of them online.
> 
> Regarding error reporting and detection, on failure, write() to sysfs using fd
> returns error code, and write to sysfs file through command prompt shows the
> string corresponding to error code.
> 
> 3. Start/stop mediated device
> 
> Under the virtual device sysfs, you will see a new "online" sysfs file.
> 
> you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> of this virtual device (0 or 1), and to start a virtual device or stop a virtual 
> device you can do:
> 
>     echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> 
> libvirt needs to query the current state before changing state.
> 
> Note: if you have multiple devices, you need to write to the "online" file
> individually.
> 
> For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> them "online" before starting QEMU.

This is a valid requirement, indeed.

> 
> 4. Launch QEMU/VM
> 
> Pass the mdev sysfs path to QEMU as vfio-pci device:
> 
>     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0

One question here. Libvirt allows users to run qemu under different
user:group than root:root. If that's the case, libvirt sets security
labels on all files qemu can/will touch. Are we going to need to do
something in that respect here?

> 
> 5. Shutdown sequence 
> 
> libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> virtual device
> 
> 6. VM Reset
> 
> No change or requirement for libvirt as this will be handled via VFIO reset API
> and QEMU process will keep running as before.
> 
> 7. Hot-plug
> 
> It optional for vendors to support hot-plug.
> 
> And it is same syntax to create a virtual device for hot-plug. 
> 
> For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> to write to "destroy" sysfs to complete hot-unplug process.
> 
> Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> return an error if it is not supported.

Thank you for very detailed description! In general, I like the API as
it looks usable from my POV (I'm no VFIO devel though).

Michal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration
  2016-08-18 16:41 [Qemu-devel] [RFC] libvirt vGPU QEMU integration Neo Jia
  2016-08-19 12:42 ` [Qemu-devel] [libvirt] " Michal Privoznik
@ 2016-08-19 19:22 ` Laine Stump
  2016-08-22  5:41   ` Neo Jia
  2016-08-24 22:29 ` [Qemu-devel] " Daniel P. Berrange
  2 siblings, 1 reply; 7+ messages in thread
From: Laine Stump @ 2016-08-19 19:22 UTC (permalink / raw)
  To: libvir-list
  Cc: Neo Jia, ACurrid, kevin.tian, qemu-devel, kwankhede, jike.song,
	kraxel, pbonzini, bjsdjshi

On 08/18/2016 12:41 PM, Neo Jia wrote:
> Hi libvirt experts,
>
> I am starting this email thread to discuss the potential solution / proposal of
> integrating vGPU support into libvirt for QEMU.

Thanks for the detailed description. This is very helpful.


>
> Some quick background, NVIDIA is implementing a VFIO based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> VFIO API to process the memory / interrupt as what QEMU does today with passthru
> device.
>
> The difference here is that we are introducing a set of new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
>
> Here is the summary of the sysfs, when they will be created and how they should
> be used:
>
> 1. Discover mediated device
>
> As part of physical device initialization process, vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.


We've discussed this question offline, but I just want to make sure I 
understood correctly - all initialization of the physical device on the 
host is already handled "elsewhere", so libvirt doesn't need to be 
concerned with any physical device lifecycle or configuration (setting 
up the number or types of vGPUs), correct? Do you think this would also 
be the case for other vendors using the same APIs? I guess this all 
comes down to whether or not the setup of the physical device is defined 
within the bounds of the common infrastructure/API, or if it's something 
that's assumed to have just magically happened somewhere else.


>
> Then, the sysfs file "mdev_supported_types" will be available under the physical
> device sysfs, and it will indicate the supported mdev and configuration for this
> particular physical device, and the content may change dynamically based on the
> system's current configurations, so libvirt needs to query this file every time
> before create a mdev.

I had originally thought that libvirt would be setting up and managing a 
pool of virtual devices, similar to what we currently do with SRIOV VFs. 
But from this it sounds like the management of this pool is completely 
handled by your drivers (especially since the contents of the pool can 
apparently completely change at any instant). In one way that makes life 
easier for libvirt, because it doesn't need to manage anything.

On the other hand, it makes thing less predictable. For example, when 
libvirt defines a domain, it queries the host system to see what types 
of devices are legal in guests on this host, and expects those devices 
to be available at a later time. As I understand it (and I may be 
completely wrong), when no vGPUs are running on the hardware, there is a 
choice of several different models of vGPU (like the example you give 
below), but when the first vGPU is started up, that triggers the host 
driver to restrict the available models. If that's the case, then a 
particular vGPU could be "available" when a domain is defined, but not 
an option by the time the domain is started. That's not a show stopper, 
but I want to make sure I am understanding everything properly.

Also, is there any information about the maximum number of vGPUs that 
can be handled by a particular physical device (I think that changes 
based on which model of vGPU is being used, right?) Or maybe what is the 
current "load" on a physical device, in case there is more than one and 
libvirt (or management) wants to make a decision about which one to use?

>
> Note: different vendors might have their own specific configuration sysfs as
> well, if they don't have pre-defined types.
>
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> NVIDIA specific configuration on an idle system.
>
> For example, to query the "mdev_supported_types" on this Tesla M60:
>
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>
> 2. Create/destroy mediated device
>
> Two sysfs files are available under the physical device sysfs path : mdev_create
> and mdev_destroy
>
> The syntax of creating a mdev is:
>
>      echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create
>
> The syntax of destroying a mdev is:
>
>      echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
>
> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> is unique per system.

Is there any reason to try to maintain the same UUID from one run to the 
next? Or should we completely think of this as a cookie for this time 
only (so more like a file handle, but we get to pick the value)? (Michal 
has asked about this in relation to migration, but the question also 
applies in the general situation of simply stopping and restarting a guest).

Also, is it enforced that "UUID" actually be a 128 bit UUID, or can it 
be any unique string?

>
> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".
>
> If there is no vendor specific arguments required, either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
>
> To create a M60-4Q device, libvirt needs to do:
>
>      echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
>
> Then, you will see a virtual device shows up at:
>
>      /sys/bus/mdev/devices/$mdev_UUID/
>
> For NVIDIA, to create multiple virtual devices per VM, it has to be created
> upfront before bringing any of them online.
>
> Regarding error reporting and detection, on failure, write() to sysfs using fd
> returns error code, and write to sysfs file through command prompt shows the
> string corresponding to error code.
>
> 3. Start/stop mediated device
>
> Under the virtual device sysfs, you will see a new "online" sysfs file.
>
> you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> of this virtual device (0 or 1), and to start a virtual device or stop a virtual
> device you can do:
>
>      echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
>
> libvirt needs to query the current state before changing state.
>
> Note: if you have multiple devices, you need to write to the "online" file
> individually.
>
> For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> them "online" before starting QEMU.
>
> 4. Launch QEMU/VM
>
> Pass the mdev sysfs path to QEMU as vfio-pci device:
>
>      -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0

1) I have the same question as Michal - you're passing the path to the 
sysfs directory for the device to qemu, which implies that the qemu 
process will need to open/close/read/write files in that directory. 
Since libvirt is running as root, it can easily do that, but libvirt 
then runs the qemu process under a different uid and with a different 
selinux context. We need to make sure that we can change the uid/selinux 
labelling of the items in sysfs without adverse effect elsewhere.

Also it's important that qemu doesn't need to access anything else 
outside of this device-specific directory (each qemu process is running 
with different selinux labeling and potentially a different uid:gid, so 
if there is any common file/device node that must be accessed directly 
by qemu, it would need to be safely globally readable/writable.

How does this device show up in the guest?  guess it's a PCI device 
(since you're using vfio-pci :-), and all the standard options for 
setting PCI address apply. And is this device legacy PCI, or PCI 
Express? (Or perhaps it changes behavior depending on the type of slot 
used in the guest?)

>
> 5. Shutdown sequence
>
> libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> virtual device
>
> 6. VM Reset
>
> No change or requirement for libvirt as this will be handled via VFIO reset API
> and QEMU process will keep running as before.
>
> 7. Hot-plug
>
> It optional for vendors to support hot-plug.
>
> And it is same syntax to create a virtual device for hot-plug.
>
> For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> to write to "destroy" sysfs to complete hot-unplug process.
>
> Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> return an error if it is not supported.


 From what I understand here, it sounds like what's needed from libvirt is

1) exposing enough info in the output of nodedev-dumpxml for an 
application to use it to determine which devices are capable of creating 
vGPUs, and which models of vGPU they can create.


  2) to create+start (then later stop+destroy) individual vGPUs based on 
[something] in the domain XML. So the question that remains is how to 
put it in the domain config. My first instinct was to use some variation 
of <hostdev> (since the backend of it is vfio-pci), but on the other 
hand hostdev is usually used to take one device that could be used by 
the host, take it away from the host, and give it to the guest, and 
that's not exactly what's happening here. So I wonder if there would be 
any advantage to making this another model of <video> instead.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration
  2016-08-19 12:42 ` [Qemu-devel] [libvirt] " Michal Privoznik
@ 2016-08-22  5:40   ` Neo Jia
  0 siblings, 0 replies; 7+ messages in thread
From: Neo Jia @ 2016-08-22  5:40 UTC (permalink / raw)
  To: Michal Privoznik
  Cc: libvir-list, ACurrid, kevin.tian, qemu-devel, kwankhede,
	jike.song, kraxel, pbonzini, bjsdjshi

On Fri, Aug 19, 2016 at 02:42:27PM +0200, Michal Privoznik wrote:
> On 18.08.2016 18:41, Neo Jia wrote:
> > Hi libvirt experts,
> 
> Hi, welcome to the list.
> 
> > 
> > I am starting this email thread to discuss the potential solution / proposal of
> > integrating vGPU support into libvirt for QEMU.
> > 
> > Some quick background, NVIDIA is implementing a VFIO based mediated device
> > framework to allow people to virtualize their devices without SR-IOV, for
> > example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> > VFIO API to process the memory / interrupt as what QEMU does today with passthru
> > device.
> 
> So as far as I understand, this is solely NVIDIA's API and other vendors
> (e.g. Intel) will use their own or is this a standard that others will
> comply to?

Hi Michal,

Based on the initial vGPU VFIO design discussion thread on QEMU mailing, I
believe this is what both NVIDIA, Intel and even other companies will comply to.

(People from related parties are CC'ed in this email, such as Intel and IBM.)

As you know, I can't speak for Intel, so I would like to defer this question to
them, but above is my understanding based on the QEMU/KVM community discussions.

> 
> > 
> > The difference here is that we are introducing a set of new sysfs file for
> > virtual device discovery and life cycle management due to its virtual nature.
> > 
> > Here is the summary of the sysfs, when they will be created and how they should
> > be used:
> > 
> > 1. Discover mediated device
> > 
> > As part of physical device initialization process, vendor driver will register
> > their physical devices, which will be used to create virtual device (mediated
> > device, aka mdev) to the mediated framework.
> > 
> > Then, the sysfs file "mdev_supported_types" will be available under the physical
> > device sysfs, and it will indicate the supported mdev and configuration for this 
> > particular physical device, and the content may change dynamically based on the
> > system's current configurations, so libvirt needs to query this file every time
> > before create a mdev.
> 
> Ah, that was gonna be my question. Because in the example below, you
> used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I
> was wondering where does the number 20 come from. Now what I am
> wondering about is how libvirt should expose these to users. Moreover,
> how it should let users to chose.
> We have a node device driver where I guess we could expose possible
> options and then require some explicit value in the domain XML (but what
> value would that be? I don't think taking vgpu_type_id-s as they are
> would be a great idea).

Right, the vgpu_type_id is just a handle for a given type of vGPU device for
NVIDIA case.  How about expose the "vgpu_type" which is a meaningful name
for the vGPU end users?

Also, when you are saying "let users to chose", does this mean to expose some
virsh command to allow user to dump their potential virtual devices and pick
one?

> 
> > 
> > Note: different vendors might have their own specific configuration sysfs as
> > well, if they don't have pre-defined types.
> > 
> > For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> > NVIDIA specific configuration on an idle system.
> > 
> > For example, to query the "mdev_supported_types" on this Tesla M60:
> > 
> > cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> > # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > max_resolution
> > 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> > 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> > 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> > 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> > 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> > 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> > 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> > 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> > 
> > 2. Create/destroy mediated device
> > 
> > Two sysfs files are available under the physical device sysfs path : mdev_create
> > and mdev_destroy
> > 
> > The syntax of creating a mdev is:
> > 
> >     echo "$mdev_UUID:vendor_specific_argument_list" >
> > /sys/bus/pci/devices/.../mdev_create
> > 
> > The syntax of destroying a mdev is:
> > 
> >     echo "$mdev_UUID:vendor_specific_argument_list" >
> > /sys/bus/pci/devices/.../mdev_destroy
> > 
> > The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> > is unique per system.
> 
> Ah, so a caller (the one doing the echo - e.g. libvirt) can generate
> their own UUID under which the mdev will be known? I'm asking because of
> migration - we might want to preserve UUIDs when a domain is migrated to
> the other side. Speaking of which, is there such limitation or will
> guest be able to migrate even if UUID's changed?

Yes, and as long as the MDEV UUID is unique per system and even that gets
changed between migration process, it should be fine.

> 
> > 
> > For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> > above Tesla M60 output), and a VM UUID to be passed as
> > "vendor_specific_argument_list".
> 
> I understand the need for vgpu_type_id, but can you shed more light on
> the VM UUID? Why is that required?

Sure, this is required by NVIDIA vGPU, especially to support multiple vGPU devices per
VM as we have a SW entity to manage all vGPU devices per VM, it will also
reserve special GPU resources for multiple vGPU per VM cases.

> 
> > 
> > If there is no vendor specific arguments required, either "$mdev_UUID" or
> > "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> > 
> > To create a M60-4Q device, libvirt needs to do:
> > 
> >     echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> > /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> > 
> > Then, you will see a virtual device shows up at:
> > 
> >     /sys/bus/mdev/devices/$mdev_UUID/
> > 
> > For NVIDIA, to create multiple virtual devices per VM, it has to be created
> > upfront before bringing any of them online.
> > 
> > Regarding error reporting and detection, on failure, write() to sysfs using fd
> > returns error code, and write to sysfs file through command prompt shows the
> > string corresponding to error code.
> > 
> > 3. Start/stop mediated device
> > 
> > Under the virtual device sysfs, you will see a new "online" sysfs file.
> > 
> > you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> > of this virtual device (0 or 1), and to start a virtual device or stop a virtual 
> > device you can do:
> > 
> >     echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> > 
> > libvirt needs to query the current state before changing state.
> > 
> > Note: if you have multiple devices, you need to write to the "online" file
> > individually.
> > 
> > For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> > them "online" before starting QEMU.
> 
> This is a valid requirement, indeed.

Thanks!

> 
> > 
> > 4. Launch QEMU/VM
> > 
> > Pass the mdev sysfs path to QEMU as vfio-pci device:
> > 
> >     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0
> 
> One question here. Libvirt allows users to run qemu under different
> user:group than root:root. If that's the case, libvirt sets security
> labels on all files qemu can/will touch. Are we going to need to do
> something in that respect here?

As long as QEMU uses VFIO API and doesn't do anything extra for any particular
vendor, there shouldn't be any problem at QEMU side. So I don't see any issues
here.

But I would like to test it out with the proper setting for NVIDIA vGPU case.
Currently all our testing is using sysfs and launch QEMU directly, if I just
mimic how libvirt launches QEMU for normal VFIO passthru device, will that
cover the selinux label concerns?

Thanks,
Neo

> 
> > 
> > 5. Shutdown sequence 
> > 
> > libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> > virtual device
> > 
> > 6. VM Reset
> > 
> > No change or requirement for libvirt as this will be handled via VFIO reset API
> > and QEMU process will keep running as before.
> > 
> > 7. Hot-plug
> > 
> > It optional for vendors to support hot-plug.
> > 
> > And it is same syntax to create a virtual device for hot-plug. 
> > 
> > For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> > to write to "destroy" sysfs to complete hot-unplug process.
> > 
> > Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> > return an error if it is not supported.
> 
> Thank you for very detailed description! In general, I like the API as
> it looks usable from my POV (I'm no VFIO devel though).
> 
> Michal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration
  2016-08-19 19:22 ` Laine Stump
@ 2016-08-22  5:41   ` Neo Jia
  0 siblings, 0 replies; 7+ messages in thread
From: Neo Jia @ 2016-08-22  5:41 UTC (permalink / raw)
  To: Laine Stump
  Cc: libvir-list, ACurrid, jike.song, kevin.tian, qemu-devel,
	kwankhede, kraxel, pbonzini, bjsdjshi

On Fri, Aug 19, 2016 at 03:22:48PM -0400, Laine Stump wrote:
> On 08/18/2016 12:41 PM, Neo Jia wrote:
> > Hi libvirt experts,
> > 
> > I am starting this email thread to discuss the potential solution / proposal of
> > integrating vGPU support into libvirt for QEMU.
> 
> Thanks for the detailed description. This is very helpful.
> 
> 
> > 
> > Some quick background, NVIDIA is implementing a VFIO based mediated device
> > framework to allow people to virtualize their devices without SR-IOV, for
> > example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> > VFIO API to process the memory / interrupt as what QEMU does today with passthru
> > device.
> > 
> > The difference here is that we are introducing a set of new sysfs file for
> > virtual device discovery and life cycle management due to its virtual nature.
> > 
> > Here is the summary of the sysfs, when they will be created and how they should
> > be used:
> > 
> > 1. Discover mediated device
> > 
> > As part of physical device initialization process, vendor driver will register
> > their physical devices, which will be used to create virtual device (mediated
> > device, aka mdev) to the mediated framework.
> 
> 
> We've discussed this question offline, but I just want to make sure I
> understood correctly - all initialization of the physical device on the host
> is already handled "elsewhere", so libvirt doesn't need to be concerned with
> any physical device lifecycle or configuration (setting up the number or
> types of vGPUs), correct? 

Hi Laine,

Yes, that is right, at least for NVIDIA vGPU.

> Do you think this would also be the case for other
> vendors using the same APIs? I guess this all comes down to whether or not
> the setup of the physical device is defined within the bounds of the common
> infrastructure/API, or if it's something that's assumed to have just
> magically happened somewhere else.

I would assume that is the case for other vendors as well, although this common
infrastructure doesn't put any restrictions about the physical device setup or
initialization, so actually vendor can have options to defer some of them till
the point when virtual device gets created. 

But if we just look at from the API level which gets exposed to libvirt, it is
the vendor driver's responsibility to ensure that the virtual device will be
available in a reasonable amount of time after the "online" sysfs file is set to
1. But where to hide the HW setup is not enforced in this common API.

In NVIDIA case, once our kernel driver registers the physical devices that he
owns to the "common infrastructure", all the physical devices are already fully
initialized and ready for virtual device creation.

> 
> 
> > 
> > Then, the sysfs file "mdev_supported_types" will be available under the physical
> > device sysfs, and it will indicate the supported mdev and configuration for this
> > particular physical device, and the content may change dynamically based on the
> > system's current configurations, so libvirt needs to query this file every time
> > before create a mdev.
> 
> I had originally thought that libvirt would be setting up and managing a
> pool of virtual devices, similar to what we currently do with SRIOV VFs. But
> from this it sounds like the management of this pool is completely handled
> by your drivers (especially since the contents of the pool can apparently
> completely change at any instant). In one way that makes life easier for
> libvirt, because it doesn't need to manage anything.

The pool (vgpu type availabilities) will only subject to change when virtual
devices get created or destroyed, as for now we don't support heterogeneous vGPU
type on the same physical GPU. Even in the future we have added such support,
the point of change is still the same.

> 
> On the other hand, it makes thing less predictable. For example, when
> libvirt defines a domain, it queries the host system to see what types of
> devices are legal in guests on this host, and expects those devices to be
> available at a later time. As I understand it (and I may be completely
> wrong), when no vGPUs are running on the hardware, there is a choice of
> several different models of vGPU (like the example you give below), but when
> the first vGPU is started up, that triggers the host driver to restrict the
> available models. If that's the case, then a particular vGPU could be
> "available" when a domain is defined, but not an option by the time the
> domain is started. That's not a show stopper, but I want to make sure I am
> understanding everything properly.

Yes, your understanding is correct as I talked about no heterogeneous vGPU
support yet. But this will open up another interesting point of vGPU placement
policy that libvirt might need to consider.

> 
> Also, is there any information about the maximum number of vGPUs that can be
> handled by a particular physical device (I think that changes based on which
> model of vGPU is being used, right?) 

Yes, that is the "max_instance" in the example.

> Or maybe what is the current "load" on
> a physical device, in case there is more than one and libvirt (or
> management) wants to make a decision about which one to use?

If you refer "load" as "physical GPU utilization", we do have a tool to allow
you to find out such information, but it is not exposed via this mdev sysfs.

Here is the link of NVIDIA NVML high level overview:

https://developer.nvidia.com/nvidia-management-library-nvml

If you want to know more info/details about the integration with NVML, I am very
happy to talk to you and probably connect you with our nvml experts for vGPU
related topics.

> 
> > 
> > Note: different vendors might have their own specific configuration sysfs as
> > well, if they don't have pre-defined types.
> > 
> > For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> > NVIDIA specific configuration on an idle system.
> > 
> > For example, to query the "mdev_supported_types" on this Tesla M60:
> > 
> > cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> > # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > max_resolution
> > 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> > 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> > 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> > 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> > 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> > 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> > 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> > 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> > 
> > 2. Create/destroy mediated device
> > 
> > Two sysfs files are available under the physical device sysfs path : mdev_create
> > and mdev_destroy
> > 
> > The syntax of creating a mdev is:
> > 
> >      echo "$mdev_UUID:vendor_specific_argument_list" >
> > /sys/bus/pci/devices/.../mdev_create
> > 
> > The syntax of destroying a mdev is:
> > 
> >      echo "$mdev_UUID:vendor_specific_argument_list" >
> > /sys/bus/pci/devices/.../mdev_destroy
> > 
> > The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> > is unique per system.
> 
> Is there any reason to try to maintain the same UUID from one run to the
> next? Or should we completely think of this as a cookie for this time only
> (so more like a file handle, but we get to pick the value)? (Michal has
> asked about this in relation to migration, but the question also applies in
> the general situation of simply stopping and restarting a guest).

You don't have to maintain the same UUID from one run to the next. Yes, it is
more like a file handle, and you are going to pick the value.

> 
> Also, is it enforced that "UUID" actually be a 128 bit UUID, or can it be
> any unique string?

Yes, it is enforced by the framework that the UUID is a 128 bit UUID format.

> 
> > 
> > For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> > above Tesla M60 output), and a VM UUID to be passed as
> > "vendor_specific_argument_list".
> > 
> > If there is no vendor specific arguments required, either "$mdev_UUID" or
> > "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> > 
> > To create a M60-4Q device, libvirt needs to do:
> > 
> >      echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> > /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> > 
> > Then, you will see a virtual device shows up at:
> > 
> >      /sys/bus/mdev/devices/$mdev_UUID/
> > 
> > For NVIDIA, to create multiple virtual devices per VM, it has to be created
> > upfront before bringing any of them online.
> > 
> > Regarding error reporting and detection, on failure, write() to sysfs using fd
> > returns error code, and write to sysfs file through command prompt shows the
> > string corresponding to error code.
> > 
> > 3. Start/stop mediated device
> > 
> > Under the virtual device sysfs, you will see a new "online" sysfs file.
> > 
> > you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> > of this virtual device (0 or 1), and to start a virtual device or stop a virtual
> > device you can do:
> > 
> >      echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> > 
> > libvirt needs to query the current state before changing state.
> > 
> > Note: if you have multiple devices, you need to write to the "online" file
> > individually.
> > 
> > For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> > them "online" before starting QEMU.
> > 
> > 4. Launch QEMU/VM
> > 
> > Pass the mdev sysfs path to QEMU as vfio-pci device:
> > 
> >      -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0
> 
> 1) I have the same question as Michal - you're passing the path to the sysfs
> directory for the device to qemu, which implies that the qemu process will
> need to open/close/read/write files in that directory. Since libvirt is
> running as root, it can easily do that, but libvirt then runs the qemu
> process under a different uid and with a different selinux context. We need
> to make sure that we can change the uid/selinux labelling of the items in
> sysfs without adverse effect elsewhere.
> 
> Also it's important that qemu doesn't need to access anything else outside
> of this device-specific directory (each qemu process is running with
> different selinux labeling and potentially a different uid:gid, so if there
> is any common file/device node that must be accessed directly by qemu, it
> would need to be safely globally readable/writable.

Similar response to Michal here:

As long as QEMU uses VFIO API and doesn't do anything extra for any particular
vendor, there shouldn't be any problem at QEMU side. So I don't see any issues
here.

But I would like to test it out with the proper setting for NVIDIA vGPU case.
Currently all our testing is using sysfs and launch QEMU directly, if I just
mimic how libvirt launches QEMU for normal VFIO passthru device, will that
cover the selinux label concerns?

> 
> How does this device show up in the guest?  guess it's a PCI device (since
> you're using vfio-pci :-), and all the standard options for setting PCI
> address apply. And is this device legacy PCI, or PCI Express? (Or perhaps it
> changes behavior depending on the type of slot used in the guest?)

It depends on how vendor driver emulates capabilities in config space. For
NVIDIA vGPU we are defining it as PCI device. But other vendor could define
PCIe capabilities in config space and that would show PCIe device in guest.
For IBM solution, it's not even a PCI device, that is a channel IO device. It
all depends in how vendor driver simulates the device.

> > 
> > 5. Shutdown sequence
> > 
> > libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> > virtual device
> > 
> > 6. VM Reset
> > 
> > No change or requirement for libvirt as this will be handled via VFIO reset API
> > and QEMU process will keep running as before.
> > 
> > 7. Hot-plug
> > 
> > It optional for vendors to support hot-plug.
> > 
> > And it is same syntax to create a virtual device for hot-plug.
> > 
> > For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> > to write to "destroy" sysfs to complete hot-unplug process.
> > 
> > Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> > return an error if it is not supported.
> 
> 
> From what I understand here, it sounds like what's needed from libvirt is
> 
> 1) exposing enough info in the output of nodedev-dumpxml for an application
> to use it to determine which devices are capable of creating vGPUs, and
> which models of vGPU they can create.
> 
> 
>  2) to create+start (then later stop+destroy) individual vGPUs based on
> [something] in the domain XML. So the question that remains is how to put it
> in the domain config. My first instinct was to use some variation of
> <hostdev> (since the backend of it is vfio-pci), but on the other hand
> hostdev is usually used to take one device that could be used by the host,
> take it away from the host, and give it to the guest, and that's not exactly
> what's happening here. So I wonder if there would be any advantage to making
> this another model of <video> instead.

hostdev can be a sysfs path now right?

Thanks,
Neo

> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [RFC] libvirt vGPU QEMU integration
  2016-08-18 16:41 [Qemu-devel] [RFC] libvirt vGPU QEMU integration Neo Jia
  2016-08-19 12:42 ` [Qemu-devel] [libvirt] " Michal Privoznik
  2016-08-19 19:22 ` Laine Stump
@ 2016-08-24 22:29 ` Daniel P. Berrange
  2016-08-25 15:18   ` [Qemu-devel] [libvirt] " Laine Stump
  2 siblings, 1 reply; 7+ messages in thread
From: Daniel P. Berrange @ 2016-08-24 22:29 UTC (permalink / raw)
  To: Neo Jia
  Cc: libvir-list, ACurrid, kevin.tian, qemu-devel, kwankhede,
	jike.song, alex.williamson, kraxel, pbonzini, bjsdjshi

On Thu, Aug 18, 2016 at 09:41:59AM -0700, Neo Jia wrote:
> Hi libvirt experts,
> 
> I am starting this email thread to discuss the potential solution / proposal of
> integrating vGPU support into libvirt for QEMU.
> 
> Some quick background, NVIDIA is implementing a VFIO based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> VFIO API to process the memory / interrupt as what QEMU does today with passthru
> device.
> 
> The difference here is that we are introducing a set of new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
> 
> Here is the summary of the sysfs, when they will be created and how they should
> be used:
> 
> 1. Discover mediated device
> 
> As part of physical device initialization process, vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.
> 
> Then, the sysfs file "mdev_supported_types" will be available under the physical
> device sysfs, and it will indicate the supported mdev and configuration for this 
> particular physical device, and the content may change dynamically based on the
> system's current configurations, so libvirt needs to query this file every time
> before create a mdev.
> 
> Note: different vendors might have their own specific configuration sysfs as
> well, if they don't have pre-defined types.
> 
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> NVIDIA specific configuration on an idle system.
> 
> For example, to query the "mdev_supported_types" on this Tesla M60:
> 
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160

I'm unclear on the requirements about data format for this file.
Looking at the docs:

  http://www.spinics.net/lists/kvm/msg136476.html

the format is completely unspecified.

> 
> 2. Create/destroy mediated device
> 
> Two sysfs files are available under the physical device sysfs path : mdev_create
> and mdev_destroy
> 
> The syntax of creating a mdev is:
> 
>     echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create

I'm not really a fan of the idea of having to provide arbitrary vendor
specific arguments to the mdev_create call, as I don't really want to
have to create vendor specific code for each vendor's vGPU hardware in
libvirt.

What is the relationship between the mdev_supported_types data and
the vendor_specific_argument_list requirements ?


> The syntax of destroying a mdev is:
> 
>     echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
> 
> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> is unique per system.
> 
> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".
> 
> If there is no vendor specific arguments required, either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.

This raises the question of how an application discovers what
vendor specific arguments are required or not, and what they
might mean.

> To create a M60-4Q device, libvirt needs to do:
> 
>     echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create

Overall it doesn't seem like the proposed kernel interfaces provide
enough vendor abstraction to be able to use this functionality without
having to create vendor specific code in libvirt, which is something
I want to avoid us doing.



Ignoring the details though, in terms of libvirt integration, I think I'd
see us primarily doing work in the node device APIs / XML. Specifically
for physical devices, we'd have to report whether they support the
mediated device feature and some way to enumerate the validate device
types that can be created. The node device creation API would have to
support create/deletion of the devices (mapping to mdev_create/destroy)


When configuring a guest VM, we'd use the <hostdev> XML to point to one
or more mediated devices that have been created via the node device APIs
previously. When starting the guest, we'd set those mediate devices
online.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [libvirt]  [RFC] libvirt vGPU QEMU integration
  2016-08-24 22:29 ` [Qemu-devel] " Daniel P. Berrange
@ 2016-08-25 15:18   ` Laine Stump
  0 siblings, 0 replies; 7+ messages in thread
From: Laine Stump @ 2016-08-25 15:18 UTC (permalink / raw)
  To: libvir-list
  Cc: Daniel P. Berrange, Neo Jia, ACurrid, kevin.tian, qemu-devel,
	jike.song, kwankhede, kraxel, pbonzini, bjsdjshi

On 08/24/2016 06:29 PM, Daniel P. Berrange wrote:
> On Thu, Aug 18, 2016 at 09:41:59AM -0700, Neo Jia wrote:
>> Hi libvirt experts,
>>
>> I am starting this email thread to discuss the potential solution / proposal of
>> integrating vGPU support into libvirt for QEMU.
>>
>> Some quick background, NVIDIA is implementing a VFIO based mediated device
>> framework to allow people to virtualize their devices without SR-IOV, for
>> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
>> VFIO API to process the memory / interrupt as what QEMU does today with passthru
>> device.
>>
>> The difference here is that we are introducing a set of new sysfs file for
>> virtual device discovery and life cycle management due to its virtual nature.
>>
>> Here is the summary of the sysfs, when they will be created and how they should
>> be used:
>>
>> 1. Discover mediated device
>>
>> As part of physical device initialization process, vendor driver will register
>> their physical devices, which will be used to create virtual device (mediated
>> device, aka mdev) to the mediated framework.
>>
>> Then, the sysfs file "mdev_supported_types" will be available under the physical
>> device sysfs, and it will indicate the supported mdev and configuration for this
>> particular physical device, and the content may change dynamically based on the
>> system's current configurations, so libvirt needs to query this file every time
>> before create a mdev.
>>
>> Note: different vendors might have their own specific configuration sysfs as
>> well, if they don't have pre-defined types.
>>
>> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
>> NVIDIA specific configuration on an idle system.
>>
>> For example, to query the "mdev_supported_types" on this Tesla M60:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> I'm unclear on the requirements about data format for this file.
> Looking at the docs:
>
>    http://www.spinics.net/lists/kvm/msg136476.html
>
> the format is completely unspecified.
>
>> 2. Create/destroy mediated device
>>
>> Two sysfs files are available under the physical device sysfs path : mdev_create
>> and mdev_destroy
>>
>> The syntax of creating a mdev is:
>>
>>      echo "$mdev_UUID:vendor_specific_argument_list" >
>> /sys/bus/pci/devices/.../mdev_create
> I'm not really a fan of the idea of having to provide arbitrary vendor
> specific arguments to the mdev_create call, as I don't really want to
> have to create vendor specific code for each vendor's vGPU hardware in
> libvirt.
>
> What is the relationship between the mdev_supported_types data and
> the vendor_specific_argument_list requirements ?
>
>
>> The syntax of destroying a mdev is:
>>
>>      echo "$mdev_UUID:vendor_specific_argument_list" >
>> /sys/bus/pci/devices/.../mdev_destroy
>>
>> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
>> is unique per system.
>>
>> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
>> above Tesla M60 output), and a VM UUID to be passed as
>> "vendor_specific_argument_list".
>>
>> If there is no vendor specific arguments required, either "$mdev_UUID" or
>> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> This raises the question of how an application discovers what
> vendor specific arguments are required or not, and what they
> might mean.
>
>> To create a M60-4Q device, libvirt needs to do:
>>
>>      echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
>> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> Overall it doesn't seem like the proposed kernel interfaces provide
> enough vendor abstraction to be able to use this functionality without
> having to create vendor specific code in libvirt, which is something
> I want to avoid us doing.
>
>
>
> Ignoring the details though, in terms of libvirt integration, I think I'd
> see us primarily doing work in the node device APIs / XML. Specifically
> for physical devices, we'd have to report whether they support the
> mediated device feature and some way to enumerate the validate device
> types that can be created. The node device creation API would have to
> support create/deletion of the devices (mapping to mdev_create/destroy)
>
>
> When configuring a guest VM, we'd use the <hostdev> XML to point to one
> or more mediated devices that have been created via the node device APIs
> previously.

I'd originally thought of this as having two separate points of support 
in libvirt as well:

In the node device driver:

   * reporting of mdev capabilities in the nodedev-dumpxml output of any 
physdev (would this be adequate for discovery?  It would, after all, 
require doing a nodedev-list of all devices, then nodedev-dumpxml of 
every PCI device to search the XML for presence of this capability)

  * new APIs to start a pool of mdevs and destroy a pool of mdevs ( 
would virNodeDeviceCreateXML()/virNodeDeviceDestroy() be adequate for 
this? They create/destroy just a single device, so would need to be 
called multiple times, once for each mdev, which seems a bit ugly, 
although accurate)

  * the addition of persistent config objects in the node device driver 
that can be started/destroyed/set to autostart [*]

In the qemu driver:

  * some additional attributes in <hostdev> to point to a particular 
pool of mdevs managed by the node device driver

  * code to turn those new hostdev attributes into the proper mdev 
start/stop sequence, and qemu commandline option or QMP command

After learning that the GPU driver on the host was already doing all 
systemwide initialization, I began thinking that maybe (as Neo suggests) 
we could get by without the 2nd and 3rd items in the list for the node 
device driver - instead doing something more analogous to <hostdev 
managed='yes'>, where the mdev creation happens on demand (just like 
binding of a PCI device to the vfio-pci driver happens on demand).

I still have an uneasy feeling about creating mdevs on demand at domain 
startup though because, as I pointed out in my previous email in this 
thread, one problem is that while a GPU may be *potentially* capable of 
supporting several different models of vGPU, once the first vGPU is 
created, all subsequent vGPUs are restricted to  being the same model as 
the first, which could lead to unexpected surprises.

On the other hand, on-demand creation could be seen as more flexible, 
especially if the GPU driver were to ever gain the ability to have 
heterogenous collections of vGPUs. I also wonder how much of a resource 
burden it is to have a bunch of unused mdevs sitting around - is there 
any performance (or memory usage) disadvantage to having e.g. 16 vGPUs 
created vs  2, if only 2 are currently in use?

========

[*] Currently the node device driver has virNodeDeviceCreateXML() and 
virNodeDeviceDestroy(), but those are so far only used to tell udev to 
create fiber channel "vports", and there is no persistent config stored 
in libvirt for this - (does udev store persistent config for it? Or must 
it be redone at each host system reboot?). There is no place to define a 
set of devices that should be automatically created at boot time / 
libvirtd start  (i.e. analogous to virNetworkDefineFlags() + setting 
autostart for a network). This is what would be needed - 
virNodeDeviceDefineFlags() (and accompanying persistent object storage), 
virNodeDeviceSetAutostart(), and virNodeDeviceGetAutostart().

(NB: this could also be useful for setting the max. # of VFs for an 
SRIOV PF, although it's unclear exactly *how* - the old method of doing 
that (kernel driver module commandline arguments) is non-standard and 
deprecated, and there is no standard location for persistent config to 
set it up using the new method (sysfs)). In this case, the device 
already exists (the SRIOV PF), and it just needs one of its sysfs 
entries modified (it wouldn't really make sense to nodedev-define each 
VF separately, because the kernel API just doesn't work that way - you 
don't add each VF separately, you just set sriov_numvfs in the PF's 
sysfs to 0, then set it to the number of VFs that are desired. So I'm 
not sure how to shoehorn that into the idea of "creating a new node device")

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-25 15:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-18 16:41 [Qemu-devel] [RFC] libvirt vGPU QEMU integration Neo Jia
2016-08-19 12:42 ` [Qemu-devel] [libvirt] " Michal Privoznik
2016-08-22  5:40   ` Neo Jia
2016-08-19 19:22 ` Laine Stump
2016-08-22  5:41   ` Neo Jia
2016-08-24 22:29 ` [Qemu-devel] " Daniel P. Berrange
2016-08-25 15:18   ` [Qemu-devel] [libvirt] " Laine Stump

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.