linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC]Add new mdev interface for QoS
@ 2017-07-26 13:16 Gao, Ping A
  2017-07-26 16:43 ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Gao, Ping A @ 2017-07-26 13:16 UTC (permalink / raw)
  To: kwankhede, alex.williamson, kvm, linux-kernel
  Cc: Tian, Kevin, Zhenyu Wang, Jike Song

The vfio-mdev provide the capability to let different guest share the
same physical device through mediate sharing, as result it bring a
requirement about how to control the device sharing, we need a QoS
related interface for mdev to management virtual device resource.

E.g. In practical use, vGPUs assigned to different quests almost has
different performance requirements, some guests may need higher priority
for real time usage, some other may need more portion of the GPU
resource to get higher 3D performance, corresponding we can define some
interfaces like weight/cap for overall budget control, priority for
single submission control.

So I suggest to add some common attributes which are vendor agnostic in
mdev core sysfs for QoS purpose.

-Ping

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-07-26 13:16 [RFC]Add new mdev interface for QoS Gao, Ping A
@ 2017-07-26 16:43 ` Alex Williamson
  2017-07-27 16:00   ` Gao, Ping A
  2017-07-27 16:17   ` [libvirt] " Daniel P. Berrange
  0 siblings, 2 replies; 18+ messages in thread
From: Alex Williamson @ 2017-07-26 16:43 UTC (permalink / raw)
  To: Gao, Ping A
  Cc: kwankhede, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list

[cc +libvir-list]

On Wed, 26 Jul 2017 21:16:59 +0800
"Gao, Ping A" <ping.a.gao@intel.com> wrote:

> The vfio-mdev provide the capability to let different guest share the
> same physical device through mediate sharing, as result it bring a
> requirement about how to control the device sharing, we need a QoS
> related interface for mdev to management virtual device resource.
> 
> E.g. In practical use, vGPUs assigned to different quests almost has
> different performance requirements, some guests may need higher priority
> for real time usage, some other may need more portion of the GPU
> resource to get higher 3D performance, corresponding we can define some
> interfaces like weight/cap for overall budget control, priority for
> single submission control.
> 
> So I suggest to add some common attributes which are vendor agnostic in
> mdev core sysfs for QoS purpose.

I think what you're asking for is just some standardization of a QoS
attribute_group which a vendor can optionally include within the
existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
transparently enable this, but it really only provides the standard,
all of the support code is left for the vendor.  I'm fine with that,
but of course the trouble with and sort of standardization is arriving
at an agreed upon standard.  Are there QoS knobs that are generic
across any mdev device type?  Are there others that are more specific
to vGPU?  Are there existing examples of this that we can steal their
specification?

Also, mdev devices are not necessarily the exclusive users of the
hardware, we can have a native user such as a local X client.  They're
not an mdev user, so we can't support them via the mdev_attr_group.
Does there need to be a per mdev parent QoS attribute_group standard
for somehow defining the QoS of all the child mdev devices, or perhaps
representing the remaining host QoS attributes?

Ultimately libvirt and upper level management tools would be the
consumer of these control knobs, so let's immediately get libvirt
involved in the discussion.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-07-26 16:43 ` Alex Williamson
@ 2017-07-27 16:00   ` Gao, Ping A
  2017-08-01  5:54     ` Gao, Ping A
  2017-07-27 16:17   ` [libvirt] " Daniel P. Berrange
  1 sibling, 1 reply; 18+ messages in thread
From: Gao, Ping A @ 2017-07-27 16:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list


On 2017/7/27 0:43, Alex Williamson wrote:
> [cc +libvir-list]
>
> On Wed, 26 Jul 2017 21:16:59 +0800
> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>
>> The vfio-mdev provide the capability to let different guest share the
>> same physical device through mediate sharing, as result it bring a
>> requirement about how to control the device sharing, we need a QoS
>> related interface for mdev to management virtual device resource.
>>
>> E.g. In practical use, vGPUs assigned to different quests almost has
>> different performance requirements, some guests may need higher priority
>> for real time usage, some other may need more portion of the GPU
>> resource to get higher 3D performance, corresponding we can define some
>> interfaces like weight/cap for overall budget control, priority for
>> single submission control.
>>
>> So I suggest to add some common attributes which are vendor agnostic in
>> mdev core sysfs for QoS purpose.
> I think what you're asking for is just some standardization of a QoS
> attribute_group which a vendor can optionally include within the
> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> transparently enable this, but it really only provides the standard,
> all of the support code is left for the vendor.  I'm fine with that,
> but of course the trouble with and sort of standardization is arriving
> at an agreed upon standard.  Are there QoS knobs that are generic
> across any mdev device type?  Are there others that are more specific
> to vGPU?  Are there existing examples of this that we can steal their
> specification?

Yes, you are right, standardization QoS knobs are exactly what I wanted.
Only when it become a part of the mdev framework and libvirt, then QoS
such critical feature can be leveraged by cloud usage. HW vendor only
need to focus on the implementation of the corresponding QoS algorithm
in their back-end driver.

Vfio-mdev framework provide the capability to share the device that lack
of HW virtualization support to guests, no matter the device type,
mediated sharing actually is a time sharing multiplex method, from this
point of view, QoS can be take as a generic way about how to control the
time assignment for virtual mdev device that occupy HW. As result we can
define QoS knob generic across any device type by this way. Even if HW
has build in with some kind of QoS support, I think it's not a problem
for back-end driver to convert mdev standard QoS definition to their
specification to reach the same performance expectation. Seems there are
no examples for us to follow, we need define it from scratch.

I proposal universal QoS control interfaces like below:

Cap: The cap limits the maximum percentage of time a mdev device can own
physical device. e.g. cap=60, means mdev device cannot take over 60% of
total physical resource.

Weight: The weight define proportional control of the mdev device
resource between guests, it’s orthogonal with Cap, to target load
balancing. E.g. if guest 1 should take double mdev device resource
compare with guest 2, need set weight ratio to 2:1.

Priority: The guest who has higher priority will get execution first,
target to some real time usage and speeding interactive response.

Above QoS interfaces cover both overall budget control and single
submission control. I will sent out detail design later once get aligned.

> Also, mdev devices are not necessarily the exclusive users of the
> hardware, we can have a native user such as a local X client.  They're
> not an mdev user, so we can't support them via the mdev_attr_group.
> Does there need to be a per mdev parent QoS attribute_group standard
> for somehow defining the QoS of all the child mdev devices, or perhaps
> representing the remaining host QoS attributes?

That's really an open, if we don't take host workload into consideration
for cloud usage, it's not a problem any more, however such assumption is
not reasonable. Any way if we take mdev devices as clients of host
driver, and host driver provide the capability to divide out a portion
HW resource to mdev devices, then it's only need to take care about the
resource that host assigned for mdev devices. Follow this way QoS for
mdev focus on the relationship between mdev devices no need to take care
the host workload.

-Ping

> Ultimately libvirt and upper level management tools would be the
> consumer of these control knobs, so let's immediately get libvirt
> involved in the discussion.  Thanks,
>
> Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [libvirt] [RFC]Add new mdev interface for QoS
  2017-07-26 16:43 ` Alex Williamson
  2017-07-27 16:00   ` Gao, Ping A
@ 2017-07-27 16:17   ` Daniel P. Berrange
  2017-07-27 18:01     ` Alex Williamson
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel P. Berrange @ 2017-07-27 16:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Gao, Ping A, Tian, Kevin, kvm, libvir-list, Jike Song,
	Zhenyu Wang, linux-kernel, kwankhede

On Wed, Jul 26, 2017 at 10:43:43AM -0600, Alex Williamson wrote:
> [cc +libvir-list]
> 
> On Wed, 26 Jul 2017 21:16:59 +0800
> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> 
> > The vfio-mdev provide the capability to let different guest share the
> > same physical device through mediate sharing, as result it bring a
> > requirement about how to control the device sharing, we need a QoS
> > related interface for mdev to management virtual device resource.
> > 
> > E.g. In practical use, vGPUs assigned to different quests almost has
> > different performance requirements, some guests may need higher priority
> > for real time usage, some other may need more portion of the GPU
> > resource to get higher 3D performance, corresponding we can define some
> > interfaces like weight/cap for overall budget control, priority for
> > single submission control.
> > 
> > So I suggest to add some common attributes which are vendor agnostic in
> > mdev core sysfs for QoS purpose.
> 
> I think what you're asking for is just some standardization of a QoS
> attribute_group which a vendor can optionally include within the
> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> transparently enable this, but it really only provides the standard,
> all of the support code is left for the vendor.  I'm fine with that,
> but of course the trouble with and sort of standardization is arriving
> at an agreed upon standard.  Are there QoS knobs that are generic
> across any mdev device type?  Are there others that are more specific
> to vGPU?  Are there existing examples of this that we can steal their
> specification?
> 
> Also, mdev devices are not necessarily the exclusive users of the
> hardware, we can have a native user such as a local X client.  They're
> not an mdev user, so we can't support them via the mdev_attr_group.
> Does there need to be a per mdev parent QoS attribute_group standard
> for somehow defining the QoS of all the child mdev devices, or perhaps
> representing the remaining host QoS attributes?
> 
> Ultimately libvirt and upper level management tools would be the
> consumer of these control knobs, so let's immediately get libvirt
> involved in the discussion.  Thanks,

My view on this from libvirt side is pretty much unchanged since the
last time we discussed this.

We would like the kernel maintainers to define standard sets of properties
for mdevs, whether global to all mdevs, or scoped to certain classes of
mdev (eg a class=gpu). These properties would be exported in sysfs, with
one file per property.

Libvirt can then explicitly map each standardized property into a suitable
XML element, to report on which properties are available to use when creating
an mdev. It would then allow them to be set at time of creation, and/or
changed on the fly for existing mdevs.

Specifically we would like to avoid generic passthrough of arbitrary
vendor specific properties.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [libvirt] [RFC]Add new mdev interface for QoS
  2017-07-27 16:17   ` [libvirt] " Daniel P. Berrange
@ 2017-07-27 18:01     ` Alex Williamson
  2017-07-28  8:10       ` Daniel P. Berrange
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2017-07-27 18:01 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Gao, Ping A, Tian, Kevin, kvm, libvir-list, Jike Song,
	Zhenyu Wang, linux-kernel, kwankhede

On Thu, 27 Jul 2017 17:17:48 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Wed, Jul 26, 2017 at 10:43:43AM -0600, Alex Williamson wrote:
> > [cc +libvir-list]
> > 
> > On Wed, 26 Jul 2017 21:16:59 +0800
> > "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> >   
> > > The vfio-mdev provide the capability to let different guest share the
> > > same physical device through mediate sharing, as result it bring a
> > > requirement about how to control the device sharing, we need a QoS
> > > related interface for mdev to management virtual device resource.
> > > 
> > > E.g. In practical use, vGPUs assigned to different quests almost has
> > > different performance requirements, some guests may need higher priority
> > > for real time usage, some other may need more portion of the GPU
> > > resource to get higher 3D performance, corresponding we can define some
> > > interfaces like weight/cap for overall budget control, priority for
> > > single submission control.
> > > 
> > > So I suggest to add some common attributes which are vendor agnostic in
> > > mdev core sysfs for QoS purpose.  
> > 
> > I think what you're asking for is just some standardization of a QoS
> > attribute_group which a vendor can optionally include within the
> > existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> > transparently enable this, but it really only provides the standard,
> > all of the support code is left for the vendor.  I'm fine with that,
> > but of course the trouble with and sort of standardization is arriving
> > at an agreed upon standard.  Are there QoS knobs that are generic
> > across any mdev device type?  Are there others that are more specific
> > to vGPU?  Are there existing examples of this that we can steal their
> > specification?
> > 
> > Also, mdev devices are not necessarily the exclusive users of the
> > hardware, we can have a native user such as a local X client.  They're
> > not an mdev user, so we can't support them via the mdev_attr_group.
> > Does there need to be a per mdev parent QoS attribute_group standard
> > for somehow defining the QoS of all the child mdev devices, or perhaps
> > representing the remaining host QoS attributes?
> > 
> > Ultimately libvirt and upper level management tools would be the
> > consumer of these control knobs, so let's immediately get libvirt
> > involved in the discussion.  Thanks,  
> 
> My view on this from libvirt side is pretty much unchanged since the
> last time we discussed this.
> 
> We would like the kernel maintainers to define standard sets of properties
> for mdevs, whether global to all mdevs, or scoped to certain classes of
> mdev (eg a class=gpu). These properties would be exported in sysfs, with
> one file per property.

Yes, I think that much of the mechanics are obvious (standardized
sysfs layout, one property per file, properties under the device node
in sysfs, etc).  Are you saying that you don't want to be consulted on
which properties are exposed and how they operate and therefore won't
complain regardless of what we implement in the kernel? ;)

I'm hoping that libvirt folks have some experience managing basic
scheduling level QoS attributes and might have some input as to what
sorts of things work well vs what seems like a good idea, but falls
apart or isn't useful in practice.

> Libvirt can then explicitly map each standardized property into a suitable
> XML element, to report on which properties are available to use when creating
> an mdev. It would then allow them to be set at time of creation, and/or
> changed on the fly for existing mdevs.
> 
> Specifically we would like to avoid generic passthrough of arbitrary
> vendor specific properties.

Of course, I think that's the intent here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [libvirt] [RFC]Add new mdev interface for QoS
  2017-07-27 18:01     ` Alex Williamson
@ 2017-07-28  8:10       ` Daniel P. Berrange
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel P. Berrange @ 2017-07-28  8:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Gao, Ping A, Tian, Kevin, kvm, libvir-list, Jike Song,
	Zhenyu Wang, linux-kernel, kwankhede

On Thu, Jul 27, 2017 at 12:01:58PM -0600, Alex Williamson wrote:
> On Thu, 27 Jul 2017 17:17:48 +0100
> "Daniel P. Berrange" <berrange@redhat.com> wrote:
> 
> > On Wed, Jul 26, 2017 at 10:43:43AM -0600, Alex Williamson wrote:
> > > [cc +libvir-list]
> > > 
> > > On Wed, 26 Jul 2017 21:16:59 +0800
> > > "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> > >   
> > > > The vfio-mdev provide the capability to let different guest share the
> > > > same physical device through mediate sharing, as result it bring a
> > > > requirement about how to control the device sharing, we need a QoS
> > > > related interface for mdev to management virtual device resource.
> > > > 
> > > > E.g. In practical use, vGPUs assigned to different quests almost has
> > > > different performance requirements, some guests may need higher priority
> > > > for real time usage, some other may need more portion of the GPU
> > > > resource to get higher 3D performance, corresponding we can define some
> > > > interfaces like weight/cap for overall budget control, priority for
> > > > single submission control.
> > > > 
> > > > So I suggest to add some common attributes which are vendor agnostic in
> > > > mdev core sysfs for QoS purpose.  
> > > 
> > > I think what you're asking for is just some standardization of a QoS
> > > attribute_group which a vendor can optionally include within the
> > > existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> > > transparently enable this, but it really only provides the standard,
> > > all of the support code is left for the vendor.  I'm fine with that,
> > > but of course the trouble with and sort of standardization is arriving
> > > at an agreed upon standard.  Are there QoS knobs that are generic
> > > across any mdev device type?  Are there others that are more specific
> > > to vGPU?  Are there existing examples of this that we can steal their
> > > specification?
> > > 
> > > Also, mdev devices are not necessarily the exclusive users of the
> > > hardware, we can have a native user such as a local X client.  They're
> > > not an mdev user, so we can't support them via the mdev_attr_group.
> > > Does there need to be a per mdev parent QoS attribute_group standard
> > > for somehow defining the QoS of all the child mdev devices, or perhaps
> > > representing the remaining host QoS attributes?
> > > 
> > > Ultimately libvirt and upper level management tools would be the
> > > consumer of these control knobs, so let's immediately get libvirt
> > > involved in the discussion.  Thanks,  
> > 
> > My view on this from libvirt side is pretty much unchanged since the
> > last time we discussed this.
> > 
> > We would like the kernel maintainers to define standard sets of properties
> > for mdevs, whether global to all mdevs, or scoped to certain classes of
> > mdev (eg a class=gpu). These properties would be exported in sysfs, with
> > one file per property.
> 
> Yes, I think that much of the mechanics are obvious (standardized
> sysfs layout, one property per file, properties under the device node
> in sysfs, etc).  Are you saying that you don't want to be consulted on
> which properties are exposed and how they operate and therefore won't
> complain regardless of what we implement in the kernel? ;)

Well ultimately the kernel maintainers know what is possible from the
hardware / driver POV, so yeah, I think we can mostly leave it upto
you what individual things need to be exposed - not much different
scenario from all the knobs we already exposed for physical devices.


> I'm hoping that libvirt folks have some experience managing basic
> scheduling level QoS attributes and might have some input as to what
> sorts of things work well vs what seems like a good idea, but falls
> apart or isn't useful in practice.

Sure, happy to give feedback where desired.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-07-27 16:00   ` Gao, Ping A
@ 2017-08-01  5:54     ` Gao, Ping A
  2017-08-01 22:26       ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Gao, Ping A @ 2017-08-01  5:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kwankhede, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list, zhi.a.wang


On 2017/7/28 0:00, Gao, Ping A wrote:
> On 2017/7/27 0:43, Alex Williamson wrote:
>> [cc +libvir-list]
>>
>> On Wed, 26 Jul 2017 21:16:59 +0800
>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>
>>> The vfio-mdev provide the capability to let different guest share the
>>> same physical device through mediate sharing, as result it bring a
>>> requirement about how to control the device sharing, we need a QoS
>>> related interface for mdev to management virtual device resource.
>>>
>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>> different performance requirements, some guests may need higher priority
>>> for real time usage, some other may need more portion of the GPU
>>> resource to get higher 3D performance, corresponding we can define some
>>> interfaces like weight/cap for overall budget control, priority for
>>> single submission control.
>>>
>>> So I suggest to add some common attributes which are vendor agnostic in
>>> mdev core sysfs for QoS purpose.
>> I think what you're asking for is just some standardization of a QoS
>> attribute_group which a vendor can optionally include within the
>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>> transparently enable this, but it really only provides the standard,
>> all of the support code is left for the vendor.  I'm fine with that,
>> but of course the trouble with and sort of standardization is arriving
>> at an agreed upon standard.  Are there QoS knobs that are generic
>> across any mdev device type?  Are there others that are more specific
>> to vGPU?  Are there existing examples of this that we can steal their
>> specification?
> Yes, you are right, standardization QoS knobs are exactly what I wanted.
> Only when it become a part of the mdev framework and libvirt, then QoS
> such critical feature can be leveraged by cloud usage. HW vendor only
> need to focus on the implementation of the corresponding QoS algorithm
> in their back-end driver.
>
> Vfio-mdev framework provide the capability to share the device that lack
> of HW virtualization support to guests, no matter the device type,
> mediated sharing actually is a time sharing multiplex method, from this
> point of view, QoS can be take as a generic way about how to control the
> time assignment for virtual mdev device that occupy HW. As result we can
> define QoS knob generic across any device type by this way. Even if HW
> has build in with some kind of QoS support, I think it's not a problem
> for back-end driver to convert mdev standard QoS definition to their
> specification to reach the same performance expectation. Seems there are
> no examples for us to follow, we need define it from scratch.
>
> I proposal universal QoS control interfaces like below:
>
> Cap: The cap limits the maximum percentage of time a mdev device can own
> physical device. e.g. cap=60, means mdev device cannot take over 60% of
> total physical resource.
>
> Weight: The weight define proportional control of the mdev device
> resource between guests, it’s orthogonal with Cap, to target load
> balancing. E.g. if guest 1 should take double mdev device resource
> compare with guest 2, need set weight ratio to 2:1.
>
> Priority: The guest who has higher priority will get execution first,
> target to some real time usage and speeding interactive response.
>
> Above QoS interfaces cover both overall budget control and single
> submission control. I will sent out detail design later once get aligned.

Hi Alex,
Any comments about the interface mentioned above?

>> Also, mdev devices are not necessarily the exclusive users of the
>> hardware, we can have a native user such as a local X client.  They're
>> not an mdev user, so we can't support them via the mdev_attr_group.
>> Does there need to be a per mdev parent QoS attribute_group standard
>> for somehow defining the QoS of all the child mdev devices, or perhaps
>> representing the remaining host QoS attributes?
> That's really an open, if we don't take host workload into consideration
> for cloud usage, it's not a problem any more, however such assumption is
> not reasonable. Any way if we take mdev devices as clients of host
> driver, and host driver provide the capability to divide out a portion
> HW resource to mdev devices, then it's only need to take care about the
> resource that host assigned for mdev devices. Follow this way QoS for
> mdev focus on the relationship between mdev devices no need to take care
> the host workload.
>
> -Ping
>
>> Ultimately libvirt and upper level management tools would be the
>> consumer of these control knobs, so let's immediately get libvirt
>> involved in the discussion.  Thanks,
>>
>> Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-01  5:54     ` Gao, Ping A
@ 2017-08-01 22:26       ` Alex Williamson
  2017-08-02  2:50         ` Tian, Kevin
  2017-08-02 10:19         ` Kirti Wankhede
  0 siblings, 2 replies; 18+ messages in thread
From: Alex Williamson @ 2017-08-01 22:26 UTC (permalink / raw)
  To: Gao, Ping A
  Cc: kwankhede, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list, zhi.a.wang

On Tue, 1 Aug 2017 13:54:27 +0800
"Gao, Ping A" <ping.a.gao@intel.com> wrote:

> On 2017/7/28 0:00, Gao, Ping A wrote:
> > On 2017/7/27 0:43, Alex Williamson wrote:  
> >> [cc +libvir-list]
> >>
> >> On Wed, 26 Jul 2017 21:16:59 +0800
> >> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> >>  
> >>> The vfio-mdev provide the capability to let different guest share the
> >>> same physical device through mediate sharing, as result it bring a
> >>> requirement about how to control the device sharing, we need a QoS
> >>> related interface for mdev to management virtual device resource.
> >>>
> >>> E.g. In practical use, vGPUs assigned to different quests almost has
> >>> different performance requirements, some guests may need higher priority
> >>> for real time usage, some other may need more portion of the GPU
> >>> resource to get higher 3D performance, corresponding we can define some
> >>> interfaces like weight/cap for overall budget control, priority for
> >>> single submission control.
> >>>
> >>> So I suggest to add some common attributes which are vendor agnostic in
> >>> mdev core sysfs for QoS purpose.  
> >> I think what you're asking for is just some standardization of a QoS
> >> attribute_group which a vendor can optionally include within the
> >> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> >> transparently enable this, but it really only provides the standard,
> >> all of the support code is left for the vendor.  I'm fine with that,
> >> but of course the trouble with and sort of standardization is arriving
> >> at an agreed upon standard.  Are there QoS knobs that are generic
> >> across any mdev device type?  Are there others that are more specific
> >> to vGPU?  Are there existing examples of this that we can steal their
> >> specification?  
> > Yes, you are right, standardization QoS knobs are exactly what I wanted.
> > Only when it become a part of the mdev framework and libvirt, then QoS
> > such critical feature can be leveraged by cloud usage. HW vendor only
> > need to focus on the implementation of the corresponding QoS algorithm
> > in their back-end driver.
> >
> > Vfio-mdev framework provide the capability to share the device that lack
> > of HW virtualization support to guests, no matter the device type,
> > mediated sharing actually is a time sharing multiplex method, from this
> > point of view, QoS can be take as a generic way about how to control the
> > time assignment for virtual mdev device that occupy HW. As result we can
> > define QoS knob generic across any device type by this way. Even if HW
> > has build in with some kind of QoS support, I think it's not a problem
> > for back-end driver to convert mdev standard QoS definition to their
> > specification to reach the same performance expectation. Seems there are
> > no examples for us to follow, we need define it from scratch.
> >
> > I proposal universal QoS control interfaces like below:
> >
> > Cap: The cap limits the maximum percentage of time a mdev device can own
> > physical device. e.g. cap=60, means mdev device cannot take over 60% of
> > total physical resource.
> >
> > Weight: The weight define proportional control of the mdev device
> > resource between guests, it’s orthogonal with Cap, to target load
> > balancing. E.g. if guest 1 should take double mdev device resource
> > compare with guest 2, need set weight ratio to 2:1.
> >
> > Priority: The guest who has higher priority will get execution first,
> > target to some real time usage and speeding interactive response.
> >
> > Above QoS interfaces cover both overall budget control and single
> > submission control. I will sent out detail design later once get aligned.  
> 
> Hi Alex,
> Any comments about the interface mentioned above?

Not really.

Kirti, are there any QoS knobs that would be interesting
for NVIDIA devices?

Implementing libvirt support at the same time might be an interesting
exercise if we don't have a second user in the kernel to validate
against.  We could at least have two communities reviewing the feature
then.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC]Add new mdev interface for QoS
  2017-08-01 22:26       ` Alex Williamson
@ 2017-08-02  2:50         ` Tian, Kevin
  2017-08-02 10:19         ` Kirti Wankhede
  1 sibling, 0 replies; 18+ messages in thread
From: Tian, Kevin @ 2017-08-02  2:50 UTC (permalink / raw)
  To: Alex Williamson, Gao, Ping A
  Cc: kwankhede, kvm, linux-kernel, Zhenyu Wang, Song, Jike,
	libvir-list, Wang, Zhi A

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, August 2, 2017 6:26 AM
> 
> On Tue, 1 Aug 2017 13:54:27 +0800
> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> 
> > On 2017/7/28 0:00, Gao, Ping A wrote:
> > > On 2017/7/27 0:43, Alex Williamson wrote:
> > >> [cc +libvir-list]
> > >>
> > >> On Wed, 26 Jul 2017 21:16:59 +0800
> > >> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> > >>
> > >>> The vfio-mdev provide the capability to let different guest share the
> > >>> same physical device through mediate sharing, as result it bring a
> > >>> requirement about how to control the device sharing, we need a QoS
> > >>> related interface for mdev to management virtual device resource.
> > >>>
> > >>> E.g. In practical use, vGPUs assigned to different quests almost has
> > >>> different performance requirements, some guests may need higher
> priority
> > >>> for real time usage, some other may need more portion of the GPU
> > >>> resource to get higher 3D performance, corresponding we can define
> some
> > >>> interfaces like weight/cap for overall budget control, priority for
> > >>> single submission control.
> > >>>
> > >>> So I suggest to add some common attributes which are vendor agnostic
> in
> > >>> mdev core sysfs for QoS purpose.
> > >> I think what you're asking for is just some standardization of a QoS
> > >> attribute_group which a vendor can optionally include within the
> > >> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> > >> transparently enable this, but it really only provides the standard,
> > >> all of the support code is left for the vendor.  I'm fine with that,
> > >> but of course the trouble with and sort of standardization is arriving
> > >> at an agreed upon standard.  Are there QoS knobs that are generic
> > >> across any mdev device type?  Are there others that are more specific
> > >> to vGPU?  Are there existing examples of this that we can steal their
> > >> specification?
> > > Yes, you are right, standardization QoS knobs are exactly what I wanted.
> > > Only when it become a part of the mdev framework and libvirt, then QoS
> > > such critical feature can be leveraged by cloud usage. HW vendor only
> > > need to focus on the implementation of the corresponding QoS algorithm
> > > in their back-end driver.
> > >
> > > Vfio-mdev framework provide the capability to share the device that lack
> > > of HW virtualization support to guests, no matter the device type,
> > > mediated sharing actually is a time sharing multiplex method, from this
> > > point of view, QoS can be take as a generic way about how to control the
> > > time assignment for virtual mdev device that occupy HW. As result we can
> > > define QoS knob generic across any device type by this way. Even if HW
> > > has build in with some kind of QoS support, I think it's not a problem
> > > for back-end driver to convert mdev standard QoS definition to their
> > > specification to reach the same performance expectation. Seems there
> are
> > > no examples for us to follow, we need define it from scratch.
> > >
> > > I proposal universal QoS control interfaces like below:
> > >
> > > Cap: The cap limits the maximum percentage of time a mdev device can
> own
> > > physical device. e.g. cap=60, means mdev device cannot take over 60% of
> > > total physical resource.
> > >
> > > Weight: The weight define proportional control of the mdev device
> > > resource between guests, it’s orthogonal with Cap, to target load
> > > balancing. E.g. if guest 1 should take double mdev device resource
> > > compare with guest 2, need set weight ratio to 2:1.
> > >
> > > Priority: The guest who has higher priority will get execution first,
> > > target to some real time usage and speeding interactive response.
> > >
> > > Above QoS interfaces cover both overall budget control and single
> > > submission control. I will sent out detail design later once get aligned.
> >
> > Hi Alex,
> > Any comments about the interface mentioned above?
> 
> Not really.
> 
> Kirti, are there any QoS knobs that would be interesting
> for NVIDIA devices?
> 
> Implementing libvirt support at the same time might be an interesting
> exercise if we don't have a second user in the kernel to validate
> against.  We could at least have two communities reviewing the feature
> then.  Thanks,
> 

We planned to introduce new vdev types to indirectly validate 
some features (e.g. weight and cap) in our device model, which
however will not exercise the to-be-proposed sysfs interface.
yes, we can check/extend libvirt simultaneously to draw a
whole picture of all required changes in the stack...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-01 22:26       ` Alex Williamson
  2017-08-02  2:50         ` Tian, Kevin
@ 2017-08-02 10:19         ` Kirti Wankhede
  2017-08-02 12:59           ` Gao, Ping A
  1 sibling, 1 reply; 18+ messages in thread
From: Kirti Wankhede @ 2017-08-02 10:19 UTC (permalink / raw)
  To: Alex Williamson, Gao, Ping A
  Cc: kvm, linux-kernel, Tian, Kevin, Zhenyu Wang, Jike Song,
	libvir-list, zhi.a.wang



On 8/2/2017 3:56 AM, Alex Williamson wrote:
> On Tue, 1 Aug 2017 13:54:27 +0800
> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> 
>> On 2017/7/28 0:00, Gao, Ping A wrote:
>>> On 2017/7/27 0:43, Alex Williamson wrote:  
>>>> [cc +libvir-list]
>>>>
>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>  
>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>> same physical device through mediate sharing, as result it bring a
>>>>> requirement about how to control the device sharing, we need a QoS
>>>>> related interface for mdev to management virtual device resource.
>>>>>
>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>> different performance requirements, some guests may need higher priority
>>>>> for real time usage, some other may need more portion of the GPU
>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>> single submission control.
>>>>>
>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>> mdev core sysfs for QoS purpose.  
>>>> I think what you're asking for is just some standardization of a QoS
>>>> attribute_group which a vendor can optionally include within the
>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>> transparently enable this, but it really only provides the standard,
>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>> but of course the trouble with and sort of standardization is arriving
>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>> across any mdev device type?  Are there others that are more specific
>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>> specification?  
>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>> need to focus on the implementation of the corresponding QoS algorithm
>>> in their back-end driver.
>>>
>>> Vfio-mdev framework provide the capability to share the device that lack
>>> of HW virtualization support to guests, no matter the device type,
>>> mediated sharing actually is a time sharing multiplex method, from this
>>> point of view, QoS can be take as a generic way about how to control the
>>> time assignment for virtual mdev device that occupy HW. As result we can
>>> define QoS knob generic across any device type by this way. Even if HW
>>> has build in with some kind of QoS support, I think it's not a problem
>>> for back-end driver to convert mdev standard QoS definition to their
>>> specification to reach the same performance expectation. Seems there are
>>> no examples for us to follow, we need define it from scratch.
>>>
>>> I proposal universal QoS control interfaces like below:
>>>
>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>> total physical resource.
>>>
>>> Weight: The weight define proportional control of the mdev device
>>> resource between guests, it’s orthogonal with Cap, to target load
>>> balancing. E.g. if guest 1 should take double mdev device resource
>>> compare with guest 2, need set weight ratio to 2:1.
>>>
>>> Priority: The guest who has higher priority will get execution first,
>>> target to some real time usage and speeding interactive response.
>>>
>>> Above QoS interfaces cover both overall budget control and single
>>> submission control. I will sent out detail design later once get aligned.  
>>
>> Hi Alex,
>> Any comments about the interface mentioned above?
> 
> Not really.
> 
> Kirti, are there any QoS knobs that would be interesting
> for NVIDIA devices?
> 

We have different types of vGPU for different QoS factors.

When mdev devices are created, its resources are allocated irrespective
of which VM/userspace app is going to use that mdev device. Any
parameter we add here should be tied to particular mdev device and not
to the guest/app that are going to use it. 'Cap' and 'Priority' are
along that line. All mdev device might not need/use these parameters,
these can be made optional interfaces.

In the above proposal, I'm not sure how 'Weight' would work for mdev
devices on same physical device.

In the above example, "if guest 1 should take double mdev device
resource compare with guest 2" but what if guest 2 never booted, how
will you calculate resources?

If libvirt/other toolstack decides to do smart allocation based on type
name without taking physical host device as input, guest 1 and guest 2
might get mdev devices created on different physical device. Then would
weightage matter here?

Thanks,
Kirti


> Implementing libvirt support at the same time might be an interesting
> exercise if we don't have a second user in the kernel to validate
> against.  We could at least have two communities reviewing the feature
> then.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-02 10:19         ` Kirti Wankhede
@ 2017-08-02 12:59           ` Gao, Ping A
  2017-08-02 15:46             ` Kirti Wankhede
  0 siblings, 1 reply; 18+ messages in thread
From: Gao, Ping A @ 2017-08-02 12:59 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: kvm, linux-kernel, Tian, Kevin, Zhenyu Wang, Jike Song,
	libvir-list, zhi.a.wang


On 2017/8/2 18:19, Kirti Wankhede wrote:
>
> On 8/2/2017 3:56 AM, Alex Williamson wrote:
>> On Tue, 1 Aug 2017 13:54:27 +0800
>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>
>>> On 2017/7/28 0:00, Gao, Ping A wrote:
>>>> On 2017/7/27 0:43, Alex Williamson wrote:  
>>>>> [cc +libvir-list]
>>>>>
>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>  
>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>> related interface for mdev to management virtual device resource.
>>>>>>
>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>> different performance requirements, some guests may need higher priority
>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>> single submission control.
>>>>>>
>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>> mdev core sysfs for QoS purpose.  
>>>>> I think what you're asking for is just some standardization of a QoS
>>>>> attribute_group which a vendor can optionally include within the
>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>> transparently enable this, but it really only provides the standard,
>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>> but of course the trouble with and sort of standardization is arriving
>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>> across any mdev device type?  Are there others that are more specific
>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>> specification?  
>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>> in their back-end driver.
>>>>
>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>> of HW virtualization support to guests, no matter the device type,
>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>> point of view, QoS can be take as a generic way about how to control the
>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>> define QoS knob generic across any device type by this way. Even if HW
>>>> has build in with some kind of QoS support, I think it's not a problem
>>>> for back-end driver to convert mdev standard QoS definition to their
>>>> specification to reach the same performance expectation. Seems there are
>>>> no examples for us to follow, we need define it from scratch.
>>>>
>>>> I proposal universal QoS control interfaces like below:
>>>>
>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>> total physical resource.
>>>>
>>>> Weight: The weight define proportional control of the mdev device
>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>
>>>> Priority: The guest who has higher priority will get execution first,
>>>> target to some real time usage and speeding interactive response.
>>>>
>>>> Above QoS interfaces cover both overall budget control and single
>>>> submission control. I will sent out detail design later once get aligned.  
>>> Hi Alex,
>>> Any comments about the interface mentioned above?
>> Not really.
>>
>> Kirti, are there any QoS knobs that would be interesting
>> for NVIDIA devices?
>>
> We have different types of vGPU for different QoS factors.
>
> When mdev devices are created, its resources are allocated irrespective
> of which VM/userspace app is going to use that mdev device. Any
> parameter we add here should be tied to particular mdev device and not
> to the guest/app that are going to use it. 'Cap' and 'Priority' are
> along that line. All mdev device might not need/use these parameters,
> these can be made optional interfaces.

We also define some QoS parameters in Intel vGPU types, but it only
provided a default fool-style way. We still need a flexible approach
that give user the ability to change QoS parameters freely and
dynamically according to their requirement , not restrict to the current
limited and static vGPU types.

> In the above proposal, I'm not sure how 'Weight' would work for mdev
> devices on same physical device.
>
> In the above example, "if guest 1 should take double mdev device
> resource compare with guest 2" but what if guest 2 never booted, how
> will you calculate resources?

Cap is try to limit the max physical GPU resource for vGPU, it's a
vertical limitation, but weight is a horizontal limitation that define
the GPU resource consumption ratio between vGPUs. Cap is easy to
understand as it's just a percentage. For weight. for example, if we
define the max weight is 16, the vGPU_1 who get weight 8 should been
assigned double GPU resources compared to the vGPU_2 whose weight is 4,
we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
total_physical_GPU_resource.

If there is only one guest exist, then there is no target to compare, 
weight become meaningless and the single guest enjoy the whole physical GPU.

> If libvirt/other toolstack decides to do smart allocation based on type
> name without taking physical host device as input, guest 1 and guest 2
> might get mdev devices created on different physical device. Then would
> weightage matter here?

What your mean if it's the case that there are two discrete GPU cards
exist and the vGPU types can be freely allocated on them, IMO the
back-end driver should handle such case, as the number of physical
device is transparent to tool stack. e.g. present multi-physical device
as a logic one to mdev.

> Thanks,
> Kirti
>
>
>> Implementing libvirt support at the same time might be an interesting
>> exercise if we don't have a second user in the kernel to validate
>> against.  We could at least have two communities reviewing the feature
>> then.  Thanks,
>>
>> Alex
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-02 12:59           ` Gao, Ping A
@ 2017-08-02 15:46             ` Kirti Wankhede
  2017-08-02 16:58               ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Kirti Wankhede @ 2017-08-02 15:46 UTC (permalink / raw)
  To: Gao, Ping A, Alex Williamson
  Cc: kvm, linux-kernel, Tian, Kevin, Zhenyu Wang, Jike Song,
	libvir-list, zhi.a.wang



On 8/2/2017 6:29 PM, Gao, Ping A wrote:
> 
> On 2017/8/2 18:19, Kirti Wankhede wrote:
>>
>> On 8/2/2017 3:56 AM, Alex Williamson wrote:
>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>
>>>> On 2017/7/28 0:00, Gao, Ping A wrote:
>>>>> On 2017/7/27 0:43, Alex Williamson wrote:  
>>>>>> [cc +libvir-list]
>>>>>>
>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>  
>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>
>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>> single submission control.
>>>>>>>
>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>> mdev core sysfs for QoS purpose.  
>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>> attribute_group which a vendor can optionally include within the
>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>> transparently enable this, but it really only provides the standard,
>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>> specification?  
>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>> in their back-end driver.
>>>>>
>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>> of HW virtualization support to guests, no matter the device type,
>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>> specification to reach the same performance expectation. Seems there are
>>>>> no examples for us to follow, we need define it from scratch.
>>>>>
>>>>> I proposal universal QoS control interfaces like below:
>>>>>
>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>> total physical resource.
>>>>>
>>>>> Weight: The weight define proportional control of the mdev device
>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>
>>>>> Priority: The guest who has higher priority will get execution first,
>>>>> target to some real time usage and speeding interactive response.
>>>>>
>>>>> Above QoS interfaces cover both overall budget control and single
>>>>> submission control. I will sent out detail design later once get aligned.  
>>>> Hi Alex,
>>>> Any comments about the interface mentioned above?
>>> Not really.
>>>
>>> Kirti, are there any QoS knobs that would be interesting
>>> for NVIDIA devices?
>>>
>> We have different types of vGPU for different QoS factors.
>>
>> When mdev devices are created, its resources are allocated irrespective
>> of which VM/userspace app is going to use that mdev device. Any
>> parameter we add here should be tied to particular mdev device and not
>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>> along that line. All mdev device might not need/use these parameters,
>> these can be made optional interfaces.
> 
> We also define some QoS parameters in Intel vGPU types, but it only
> provided a default fool-style way. We still need a flexible approach
> that give user the ability to change QoS parameters freely and
> dynamically according to their requirement , not restrict to the current
> limited and static vGPU types.
> 
>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>> devices on same physical device.
>>
>> In the above example, "if guest 1 should take double mdev device
>> resource compare with guest 2" but what if guest 2 never booted, how
>> will you calculate resources?
> 
> Cap is try to limit the max physical GPU resource for vGPU, it's a
> vertical limitation, but weight is a horizontal limitation that define
> the GPU resource consumption ratio between vGPUs. Cap is easy to
> understand as it's just a percentage. For weight. for example, if we
> define the max weight is 16, the vGPU_1 who get weight 8 should been
> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
> total_physical_GPU_resource.
> 

How will vendor driver provide max weight to userspace
application/libvirt? Max weight will be per physical device, right?

How would such resource allocation reflect in 'available_instances'?
Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
FB free but you have reached max weight, so will you make
available_instances = 0 for all types on that physical GPU?

> If there is only one guest exist, then there is no target to compare, 
> weight become meaningless and the single guest enjoy the whole physical GPU.
> 

If single VM is running for long time say vGPU_1, i.e. it enjoy whole
GPU, but then other VM boots with weight 4, so you will cut down
resources of vGPU_1 at runtime? Doesn't that would show performance
degradation for VM with vGPU_1 at runtime?


>> If libvirt/other toolstack decides to do smart allocation based on type
>> name without taking physical host device as input, guest 1 and guest 2
>> might get mdev devices created on different physical device. Then would
>> weightage matter here?
> 
> What your mean if it's the case that there are two discrete GPU cards
> exist and the vGPU types can be freely allocated on them, IMO the
> back-end driver should handle such case, as the number of physical
> device is transparent to tool stack. e.g. present multi-physical device
> as a logic one to mdev.
> 

No, generally toolstack is aware of available physical devices and it
could have smart logic to decide on which physical device mdev device
should be created, i.e. to load one physical device first or to
distribute the load across physical devices when mdev devices are
created. Libvirt don't have such logic now, but it was discussed earlier
about having such logic in libvirt.
Then in that case as I said above doesn't that would show perf
degradation on running VMs at runtime?

Thanks,
Kirti

>> Thanks,
>> Kirti
>>
>>
>>> Implementing libvirt support at the same time might be an interesting
>>> exercise if we don't have a second user in the kernel to validate
>>> against.  We could at least have two communities reviewing the feature
>>> then.  Thanks,
>>>
>>> Alex
>>>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-02 15:46             ` Kirti Wankhede
@ 2017-08-02 16:58               ` Alex Williamson
  2017-08-03 12:26                 ` Gao, Ping A
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2017-08-02 16:58 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Gao, Ping A, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list, zhi.a.wang

On Wed, 2 Aug 2017 21:16:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/2/2017 6:29 PM, Gao, Ping A wrote:
> > 
> > On 2017/8/2 18:19, Kirti Wankhede wrote:  
> >>
> >> On 8/2/2017 3:56 AM, Alex Williamson wrote:  
> >>> On Tue, 1 Aug 2017 13:54:27 +0800
> >>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> >>>  
> >>>> On 2017/7/28 0:00, Gao, Ping A wrote:  
> >>>>> On 2017/7/27 0:43, Alex Williamson wrote:    
> >>>>>> [cc +libvir-list]
> >>>>>>
> >>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
> >>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> >>>>>>    
> >>>>>>> The vfio-mdev provide the capability to let different guest share the
> >>>>>>> same physical device through mediate sharing, as result it bring a
> >>>>>>> requirement about how to control the device sharing, we need a QoS
> >>>>>>> related interface for mdev to management virtual device resource.
> >>>>>>>
> >>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
> >>>>>>> different performance requirements, some guests may need higher priority
> >>>>>>> for real time usage, some other may need more portion of the GPU
> >>>>>>> resource to get higher 3D performance, corresponding we can define some
> >>>>>>> interfaces like weight/cap for overall budget control, priority for
> >>>>>>> single submission control.
> >>>>>>>
> >>>>>>> So I suggest to add some common attributes which are vendor agnostic in
> >>>>>>> mdev core sysfs for QoS purpose.    
> >>>>>> I think what you're asking for is just some standardization of a QoS
> >>>>>> attribute_group which a vendor can optionally include within the
> >>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> >>>>>> transparently enable this, but it really only provides the standard,
> >>>>>> all of the support code is left for the vendor.  I'm fine with that,
> >>>>>> but of course the trouble with and sort of standardization is arriving
> >>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
> >>>>>> across any mdev device type?  Are there others that are more specific
> >>>>>> to vGPU?  Are there existing examples of this that we can steal their
> >>>>>> specification?    
> >>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
> >>>>> Only when it become a part of the mdev framework and libvirt, then QoS
> >>>>> such critical feature can be leveraged by cloud usage. HW vendor only
> >>>>> need to focus on the implementation of the corresponding QoS algorithm
> >>>>> in their back-end driver.
> >>>>>
> >>>>> Vfio-mdev framework provide the capability to share the device that lack
> >>>>> of HW virtualization support to guests, no matter the device type,
> >>>>> mediated sharing actually is a time sharing multiplex method, from this
> >>>>> point of view, QoS can be take as a generic way about how to control the
> >>>>> time assignment for virtual mdev device that occupy HW. As result we can
> >>>>> define QoS knob generic across any device type by this way. Even if HW
> >>>>> has build in with some kind of QoS support, I think it's not a problem
> >>>>> for back-end driver to convert mdev standard QoS definition to their
> >>>>> specification to reach the same performance expectation. Seems there are
> >>>>> no examples for us to follow, we need define it from scratch.
> >>>>>
> >>>>> I proposal universal QoS control interfaces like below:
> >>>>>
> >>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
> >>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
> >>>>> total physical resource.
> >>>>>
> >>>>> Weight: The weight define proportional control of the mdev device
> >>>>> resource between guests, it’s orthogonal with Cap, to target load
> >>>>> balancing. E.g. if guest 1 should take double mdev device resource
> >>>>> compare with guest 2, need set weight ratio to 2:1.
> >>>>>
> >>>>> Priority: The guest who has higher priority will get execution first,
> >>>>> target to some real time usage and speeding interactive response.
> >>>>>
> >>>>> Above QoS interfaces cover both overall budget control and single
> >>>>> submission control. I will sent out detail design later once get aligned.    
> >>>> Hi Alex,
> >>>> Any comments about the interface mentioned above?  
> >>> Not really.
> >>>
> >>> Kirti, are there any QoS knobs that would be interesting
> >>> for NVIDIA devices?
> >>>  
> >> We have different types of vGPU for different QoS factors.
> >>
> >> When mdev devices are created, its resources are allocated irrespective
> >> of which VM/userspace app is going to use that mdev device. Any
> >> parameter we add here should be tied to particular mdev device and not
> >> to the guest/app that are going to use it. 'Cap' and 'Priority' are
> >> along that line. All mdev device might not need/use these parameters,
> >> these can be made optional interfaces.  
> > 
> > We also define some QoS parameters in Intel vGPU types, but it only
> > provided a default fool-style way. We still need a flexible approach
> > that give user the ability to change QoS parameters freely and
> > dynamically according to their requirement , not restrict to the current
> > limited and static vGPU types.
> >   
> >> In the above proposal, I'm not sure how 'Weight' would work for mdev
> >> devices on same physical device.
> >>
> >> In the above example, "if guest 1 should take double mdev device
> >> resource compare with guest 2" but what if guest 2 never booted, how
> >> will you calculate resources?  
> > 
> > Cap is try to limit the max physical GPU resource for vGPU, it's a
> > vertical limitation, but weight is a horizontal limitation that define
> > the GPU resource consumption ratio between vGPUs. Cap is easy to
> > understand as it's just a percentage. For weight. for example, if we
> > define the max weight is 16, the vGPU_1 who get weight 8 should been
> > assigned double GPU resources compared to the vGPU_2 whose weight is 4,
> > we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
> > total_physical_GPU_resource.
> >   
> 
> How will vendor driver provide max weight to userspace
> application/libvirt? Max weight will be per physical device, right?
> 
> How would such resource allocation reflect in 'available_instances'?
> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
> FB free but you have reached max weight, so will you make
> available_instances = 0 for all types on that physical GPU?

No, per the algorithm above, the available scheduling for the remaining
mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
we'd need to define or make the range discoverable, 16 seems rather
arbitrary).  We can always add new scheduling participants.  AIUI,
Intel uses round-robin scheduling now, where you could consider all
mdev devices to have the same weight.  Whether we consider that to be a
weight of 16 or zero or 8 doesn't really matter.

> > If there is only one guest exist, then there is no target to compare, 
> > weight become meaningless and the single guest enjoy the whole physical GPU.
> >   
> 
> If single VM is running for long time say vGPU_1, i.e. it enjoy whole
> GPU, but then other VM boots with weight 4, so you will cut down
> resources of vGPU_1 at runtime? Doesn't that would show performance
> degradation for VM with vGPU_1 at runtime?

Yes.  We have this already though, vGPU_1 may enjoy the whole GPU
simply because the other vGPUs are idle, that can change at any time
and may reduce the resources available to vGPU_1.  Do we want a QoS
knob for fixed scheduling slices?  With only cap, weight, and priority,
how could I provide an SLA for no less than 40% of the GPU?  I guess we
can get that with careful use of weight, but I wonder if we could make
it more simple for users.

> >> If libvirt/other toolstack decides to do smart allocation based on type
> >> name without taking physical host device as input, guest 1 and guest 2
> >> might get mdev devices created on different physical device. Then would
> >> weightage matter here?  
> > 
> > What your mean if it's the case that there are two discrete GPU cards
> > exist and the vGPU types can be freely allocated on them, IMO the
> > back-end driver should handle such case, as the number of physical
> > device is transparent to tool stack. e.g. present multi-physical device
> > as a logic one to mdev.
> >   
> 
> No, generally toolstack is aware of available physical devices and it
> could have smart logic to decide on which physical device mdev device
> should be created, i.e. to load one physical device first or to
> distribute the load across physical devices when mdev devices are
> created. Libvirt don't have such logic now, but it was discussed earlier
> about having such logic in libvirt.
> Then in that case as I said above doesn't that would show perf
> degradation on running VMs at runtime?

It seems that the proposed cap, weight, and priority only handle QoS
within a single parent device.  All the knobs are relative to other
scheduling participants on that parent device.  The same QoS parameters
for mdev devices on separate parent devices could have wildly different
performance characteristics depending on the load the other mdev
devices are inflicting.  If there's only one such parent device on the
system, this works.  libvirt has already effectively rejected the idea
of automating mdev placement and perhaps this is another similar case
where we simply require some higher level management tool to have a
global view of the system.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-02 16:58               ` Alex Williamson
@ 2017-08-03 12:26                 ` Gao, Ping A
  2017-08-03 21:11                   ` Alex Williamson
  0 siblings, 1 reply; 18+ messages in thread
From: Gao, Ping A @ 2017-08-03 12:26 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: kvm, linux-kernel, Tian, Kevin, Zhenyu Wang, Jike Song,
	libvir-list, zhi.a.wang


On 2017/8/3 0:58, Alex Williamson wrote:
> On Wed, 2 Aug 2017 21:16:28 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:
>>> On 2017/8/2 18:19, Kirti Wankhede wrote:  
>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:  
>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>  
>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:  
>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:    
>>>>>>>> [cc +libvir-list]
>>>>>>>>
>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>    
>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>
>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>> single submission control.
>>>>>>>>>
>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>> mdev core sysfs for QoS purpose.    
>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>> specification?    
>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>> in their back-end driver.
>>>>>>>
>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>
>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>
>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>> total physical resource.
>>>>>>>
>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>
>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>
>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>> submission control. I will sent out detail design later once get aligned.    
>>>>>> Hi Alex,
>>>>>> Any comments about the interface mentioned above?  
>>>>> Not really.
>>>>>
>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>> for NVIDIA devices?
>>>>>  
>>>> We have different types of vGPU for different QoS factors.
>>>>
>>>> When mdev devices are created, its resources are allocated irrespective
>>>> of which VM/userspace app is going to use that mdev device. Any
>>>> parameter we add here should be tied to particular mdev device and not
>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>> along that line. All mdev device might not need/use these parameters,
>>>> these can be made optional interfaces.  
>>> We also define some QoS parameters in Intel vGPU types, but it only
>>> provided a default fool-style way. We still need a flexible approach
>>> that give user the ability to change QoS parameters freely and
>>> dynamically according to their requirement , not restrict to the current
>>> limited and static vGPU types.
>>>   
>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>> devices on same physical device.
>>>>
>>>> In the above example, "if guest 1 should take double mdev device
>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>> will you calculate resources?  
>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>> vertical limitation, but weight is a horizontal limitation that define
>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>> understand as it's just a percentage. For weight. for example, if we
>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>> total_physical_GPU_resource.
>>>   
>> How will vendor driver provide max weight to userspace
>> application/libvirt? Max weight will be per physical device, right?
>>
>> How would such resource allocation reflect in 'available_instances'?
>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>> FB free but you have reached max weight, so will you make
>> available_instances = 0 for all types on that physical GPU?
> No, per the algorithm above, the available scheduling for the remaining
> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
> we'd need to define or make the range discoverable, 16 seems rather
> arbitrary).  We can always add new scheduling participants.  AIUI,
> Intel uses round-robin scheduling now, where you could consider all
> mdev devices to have the same weight.  Whether we consider that to be a
> weight of 16 or zero or 8 doesn't really matter.

QoS is to control the device's process capability like GPU
rendering/computing that can be time multiplexing, not used to control
the dedicated partition resources like FB, so there is no impact on
'available_instances'.

if vGPU_1 weight=8, vGPU_2 weight=4;
then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
if vGPU_3 created with weight 2;
then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
total, vGPU_3_res = 2 / (8 + 4 + 2) * total.

The resource allocation of vGPU_1 and vGPU_2 have been dynamically
changed after vGPU_3 creating, that's weight doing as it's to define the
relationship of all the vGPUs, the performance degradation is meet
expectation. The end-user should know about such behavior.

However the argument on weight let me has some self-reflection, does the
end-user real need weight? does weight has actually application
requirement?  Maybe the cap and priority are enough?

>>> If there is only one guest exist, then there is no target to compare, 
>>> weight become meaningless and the single guest enjoy the whole physical GPU.
>>>   
>> If single VM is running for long time say vGPU_1, i.e. it enjoy whole
>> GPU, but then other VM boots with weight 4, so you will cut down
>> resources of vGPU_1 at runtime? Doesn't that would show performance
>> degradation for VM with vGPU_1 at runtime?
> Yes.  We have this already though, vGPU_1 may enjoy the whole GPU
> simply because the other vGPUs are idle, that can change at any time
> and may reduce the resources available to vGPU_1.  Do we want a QoS
> knob for fixed scheduling slices?  With only cap, weight, and priority,
> how could I provide an SLA for no less than 40% of the GPU?  I guess we
> can get that with careful use of weight, but I wonder if we could make
> it more simple for users.
>
>>>> If libvirt/other toolstack decides to do smart allocation based on type
>>>> name without taking physical host device as input, guest 1 and guest 2
>>>> might get mdev devices created on different physical device. Then would
>>>> weightage matter here?  
>>> What your mean if it's the case that there are two discrete GPU cards
>>> exist and the vGPU types can be freely allocated on them, IMO the
>>> back-end driver should handle such case, as the number of physical
>>> device is transparent to tool stack. e.g. present multi-physical device
>>> as a logic one to mdev.
>>>   
>> No, generally toolstack is aware of available physical devices and it
>> could have smart logic to decide on which physical device mdev device
>> should be created, i.e. to load one physical device first or to
>> distribute the load across physical devices when mdev devices are
>> created. Libvirt don't have such logic now, but it was discussed earlier
>> about having such logic in libvirt.
>> Then in that case as I said above doesn't that would show perf
>> degradation on running VMs at runtime?
> It seems that the proposed cap, weight, and priority only handle QoS
> within a single parent device.  All the knobs are relative to other
> scheduling participants on that parent device.  The same QoS parameters
> for mdev devices on separate parent devices could have wildly different
> performance characteristics depending on the load the other mdev
> devices are inflicting.  If there's only one such parent device on the
> system, this works.  libvirt has already effectively rejected the idea
> of automating mdev placement and perhaps this is another similar case
> where we simply require some higher level management tool to have a
> global view of the system.  Thanks,

Yeah, QoS is only try to handle single parent device. For multi-devices
case we need define the management in higher level.

Thanks,
Ping

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-03 12:26                 ` Gao, Ping A
@ 2017-08-03 21:11                   ` Alex Williamson
  2017-08-07  7:41                     ` Gao, Ping A
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2017-08-03 21:11 UTC (permalink / raw)
  To: Gao, Ping A
  Cc: Kirti Wankhede, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list, zhi.a.wang

On Thu, 3 Aug 2017 20:26:14 +0800
"Gao, Ping A" <ping.a.gao@intel.com> wrote:

> On 2017/8/3 0:58, Alex Williamson wrote:
> > On Wed, 2 Aug 2017 21:16:28 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> >> On 8/2/2017 6:29 PM, Gao, Ping A wrote:  
> >>> On 2017/8/2 18:19, Kirti Wankhede wrote:    
> >>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:    
> >>>>> On Tue, 1 Aug 2017 13:54:27 +0800
> >>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> >>>>>    
> >>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:    
> >>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:      
> >>>>>>>> [cc +libvir-list]
> >>>>>>>>
> >>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
> >>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
> >>>>>>>>      
> >>>>>>>>> The vfio-mdev provide the capability to let different guest share the
> >>>>>>>>> same physical device through mediate sharing, as result it bring a
> >>>>>>>>> requirement about how to control the device sharing, we need a QoS
> >>>>>>>>> related interface for mdev to management virtual device resource.
> >>>>>>>>>
> >>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
> >>>>>>>>> different performance requirements, some guests may need higher priority
> >>>>>>>>> for real time usage, some other may need more portion of the GPU
> >>>>>>>>> resource to get higher 3D performance, corresponding we can define some
> >>>>>>>>> interfaces like weight/cap for overall budget control, priority for
> >>>>>>>>> single submission control.
> >>>>>>>>>
> >>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
> >>>>>>>>> mdev core sysfs for QoS purpose.      
> >>>>>>>> I think what you're asking for is just some standardization of a QoS
> >>>>>>>> attribute_group which a vendor can optionally include within the
> >>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> >>>>>>>> transparently enable this, but it really only provides the standard,
> >>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
> >>>>>>>> but of course the trouble with and sort of standardization is arriving
> >>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
> >>>>>>>> across any mdev device type?  Are there others that are more specific
> >>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
> >>>>>>>> specification?      
> >>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
> >>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
> >>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
> >>>>>>> need to focus on the implementation of the corresponding QoS algorithm
> >>>>>>> in their back-end driver.
> >>>>>>>
> >>>>>>> Vfio-mdev framework provide the capability to share the device that lack
> >>>>>>> of HW virtualization support to guests, no matter the device type,
> >>>>>>> mediated sharing actually is a time sharing multiplex method, from this
> >>>>>>> point of view, QoS can be take as a generic way about how to control the
> >>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
> >>>>>>> define QoS knob generic across any device type by this way. Even if HW
> >>>>>>> has build in with some kind of QoS support, I think it's not a problem
> >>>>>>> for back-end driver to convert mdev standard QoS definition to their
> >>>>>>> specification to reach the same performance expectation. Seems there are
> >>>>>>> no examples for us to follow, we need define it from scratch.
> >>>>>>>
> >>>>>>> I proposal universal QoS control interfaces like below:
> >>>>>>>
> >>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
> >>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
> >>>>>>> total physical resource.
> >>>>>>>
> >>>>>>> Weight: The weight define proportional control of the mdev device
> >>>>>>> resource between guests, it’s orthogonal with Cap, to target load
> >>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
> >>>>>>> compare with guest 2, need set weight ratio to 2:1.
> >>>>>>>
> >>>>>>> Priority: The guest who has higher priority will get execution first,
> >>>>>>> target to some real time usage and speeding interactive response.
> >>>>>>>
> >>>>>>> Above QoS interfaces cover both overall budget control and single
> >>>>>>> submission control. I will sent out detail design later once get aligned.      
> >>>>>> Hi Alex,
> >>>>>> Any comments about the interface mentioned above?    
> >>>>> Not really.
> >>>>>
> >>>>> Kirti, are there any QoS knobs that would be interesting
> >>>>> for NVIDIA devices?
> >>>>>    
> >>>> We have different types of vGPU for different QoS factors.
> >>>>
> >>>> When mdev devices are created, its resources are allocated irrespective
> >>>> of which VM/userspace app is going to use that mdev device. Any
> >>>> parameter we add here should be tied to particular mdev device and not
> >>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
> >>>> along that line. All mdev device might not need/use these parameters,
> >>>> these can be made optional interfaces.    
> >>> We also define some QoS parameters in Intel vGPU types, but it only
> >>> provided a default fool-style way. We still need a flexible approach
> >>> that give user the ability to change QoS parameters freely and
> >>> dynamically according to their requirement , not restrict to the current
> >>> limited and static vGPU types.
> >>>     
> >>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
> >>>> devices on same physical device.
> >>>>
> >>>> In the above example, "if guest 1 should take double mdev device
> >>>> resource compare with guest 2" but what if guest 2 never booted, how
> >>>> will you calculate resources?    
> >>> Cap is try to limit the max physical GPU resource for vGPU, it's a
> >>> vertical limitation, but weight is a horizontal limitation that define
> >>> the GPU resource consumption ratio between vGPUs. Cap is easy to
> >>> understand as it's just a percentage. For weight. for example, if we
> >>> define the max weight is 16, the vGPU_1 who get weight 8 should been
> >>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
> >>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
> >>> total_physical_GPU_resource.
> >>>     
> >> How will vendor driver provide max weight to userspace
> >> application/libvirt? Max weight will be per physical device, right?
> >>
> >> How would such resource allocation reflect in 'available_instances'?
> >> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
> >> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
> >> FB free but you have reached max weight, so will you make
> >> available_instances = 0 for all types on that physical GPU?  
> > No, per the algorithm above, the available scheduling for the remaining
> > mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
> > we'd need to define or make the range discoverable, 16 seems rather
> > arbitrary).  We can always add new scheduling participants.  AIUI,
> > Intel uses round-robin scheduling now, where you could consider all
> > mdev devices to have the same weight.  Whether we consider that to be a
> > weight of 16 or zero or 8 doesn't really matter.  
> 
> QoS is to control the device's process capability like GPU
> rendering/computing that can be time multiplexing, not used to control
> the dedicated partition resources like FB, so there is no impact on
> 'available_instances'.
> 
> if vGPU_1 weight=8, vGPU_2 weight=4;
> then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
> if vGPU_3 created with weight 2;
> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
> 
> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
> changed after vGPU_3 creating, that's weight doing as it's to define the
> relationship of all the vGPUs, the performance degradation is meet
> expectation. The end-user should know about such behavior.
> 
> However the argument on weight let me has some self-reflection, does the
> end-user real need weight? does weight has actually application
> requirement?  Maybe the cap and priority are enough?

What sort of SLAs do you want to be able to offer?  For instance if I
want to be able to offer a GPU in 1/4 increments, how does that work?
I might sell customers A & B 1/4 increment each and customer C a 1/2
increment.  If weight is removed, can we do better than capping A & B
at 25% each and C at 50%?  That has the downside that nobody gets to
use the unused capacity of the other clients.  The SLA is some sort of
"up to X% (and no more)" model.  With weighting it's as simple as making
sure customer C's vGPU has twice the weight of that given to A or B.
Then you get an "at least X%" SLA model and any customer can use up to
100% if the others are idle.  Combining weight and cap, we can do "at
least X%, but no more than Y%".

All of this feels really similar to how cpusets must work since we're
just dealing with QoS relative to scheduling and we should not try to
reinvent scheduling QoS.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-03 21:11                   ` Alex Williamson
@ 2017-08-07  7:41                     ` Gao, Ping A
  2017-08-08  6:42                       ` Kirti Wankhede
  0 siblings, 1 reply; 18+ messages in thread
From: Gao, Ping A @ 2017-08-07  7:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, kvm, linux-kernel, Tian, Kevin, Zhenyu Wang,
	Jike Song, libvir-list, zhi.a.wang


On 2017/8/4 5:11, Alex Williamson wrote:
> On Thu, 3 Aug 2017 20:26:14 +0800
> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>
>> On 2017/8/3 0:58, Alex Williamson wrote:
>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>  
>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:  
>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:    
>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:    
>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>    
>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:    
>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:      
>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>
>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>>      
>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>>>
>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>>>> single submission control.
>>>>>>>>>>>
>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>>>> mdev core sysfs for QoS purpose.      
>>>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>>>> specification?      
>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>>>> in their back-end driver.
>>>>>>>>>
>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>>>
>>>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>>>
>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>>>> total physical resource.
>>>>>>>>>
>>>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>
>>>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>>>
>>>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>>>> submission control. I will sent out detail design later once get aligned.      
>>>>>>>> Hi Alex,
>>>>>>>> Any comments about the interface mentioned above?    
>>>>>>> Not really.
>>>>>>>
>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>> for NVIDIA devices?
>>>>>>>    
>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>
>>>>>> When mdev devices are created, its resources are allocated irrespective
>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>> parameter we add here should be tied to particular mdev device and not
>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>>>> along that line. All mdev device might not need/use these parameters,
>>>>>> these can be made optional interfaces.    
>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>> provided a default fool-style way. We still need a flexible approach
>>>>> that give user the ability to change QoS parameters freely and
>>>>> dynamically according to their requirement , not restrict to the current
>>>>> limited and static vGPU types.
>>>>>     
>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>>>> devices on same physical device.
>>>>>>
>>>>>> In the above example, "if guest 1 should take double mdev device
>>>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>>>> will you calculate resources?    
>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>>>> vertical limitation, but weight is a horizontal limitation that define
>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>> understand as it's just a percentage. For weight. for example, if we
>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>>>> total_physical_GPU_resource.
>>>>>     
>>>> How will vendor driver provide max weight to userspace
>>>> application/libvirt? Max weight will be per physical device, right?
>>>>
>>>> How would such resource allocation reflect in 'available_instances'?
>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>> FB free but you have reached max weight, so will you make
>>>> available_instances = 0 for all types on that physical GPU?  
>>> No, per the algorithm above, the available scheduling for the remaining
>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>> we'd need to define or make the range discoverable, 16 seems rather
>>> arbitrary).  We can always add new scheduling participants.  AIUI,
>>> Intel uses round-robin scheduling now, where you could consider all
>>> mdev devices to have the same weight.  Whether we consider that to be a
>>> weight of 16 or zero or 8 doesn't really matter.  
>> QoS is to control the device's process capability like GPU
>> rendering/computing that can be time multiplexing, not used to control
>> the dedicated partition resources like FB, so there is no impact on
>> 'available_instances'.
>>
>> if vGPU_1 weight=8, vGPU_2 weight=4;
>> then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
>> if vGPU_3 created with weight 2;
>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>
>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>> changed after vGPU_3 creating, that's weight doing as it's to define the
>> relationship of all the vGPUs, the performance degradation is meet
>> expectation. The end-user should know about such behavior.
>>
>> However the argument on weight let me has some self-reflection, does the
>> end-user real need weight? does weight has actually application
>> requirement?  Maybe the cap and priority are enough?
> What sort of SLAs do you want to be able to offer?  For instance if I
> want to be able to offer a GPU in 1/4 increments, how does that work?
> I might sell customers A & B 1/4 increment each and customer C a 1/2
> increment.  If weight is removed, can we do better than capping A & B
> at 25% each and C at 50%?  That has the downside that nobody gets to
> use the unused capacity of the other clients.  The SLA is some sort of
> "up to X% (and no more)" model.  With weighting it's as simple as making
> sure customer C's vGPU has twice the weight of that given to A or B.
> Then you get an "at least X%" SLA model and any customer can use up to
> 100% if the others are idle.  Combining weight and cap, we can do "at
> least X%, but no more than Y%".
>
> All of this feels really similar to how cpusets must work since we're
> just dealing with QoS relative to scheduling and we should not try to
> reinvent scheduling QoS.  Thanks,
>

Yeah, that's also my original thoughts.
Since we get aligned about the QoS basic definition, I'm going to
prepare the code in kernel side. How about the corresponding part in
libvirt? Implemented separately after the kernel interface finalizing?

Thanks,
Ping

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-07  7:41                     ` Gao, Ping A
@ 2017-08-08  6:42                       ` Kirti Wankhede
  2017-08-08 12:48                         ` Gao, Ping A
  0 siblings, 1 reply; 18+ messages in thread
From: Kirti Wankhede @ 2017-08-08  6:42 UTC (permalink / raw)
  To: Gao, Ping A, Alex Williamson
  Cc: kvm, linux-kernel, Tian, Kevin, Zhenyu Wang, Jike Song,
	libvir-list, zhi.a.wang



On 8/7/2017 1:11 PM, Gao, Ping A wrote:
> 
> On 2017/8/4 5:11, Alex Williamson wrote:
>> On Thu, 3 Aug 2017 20:26:14 +0800
>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>
>>> On 2017/8/3 0:58, Alex Williamson wrote:
>>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>  
>>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:  
>>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:    
>>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:    
>>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>    
>>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:    
>>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:      
>>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>>>      
>>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>>>>
>>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>>>>> single submission control.
>>>>>>>>>>>>
>>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>>>>> mdev core sysfs for QoS purpose.      
>>>>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>>>>> specification?      
>>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>>>>> in their back-end driver.
>>>>>>>>>>
>>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>>>>
>>>>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>>>>
>>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>>>>> total physical resource.
>>>>>>>>>>
>>>>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>>
>>>>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>>>>
>>>>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>>>>> submission control. I will sent out detail design later once get aligned.      
>>>>>>>>> Hi Alex,
>>>>>>>>> Any comments about the interface mentioned above?    
>>>>>>>> Not really.
>>>>>>>>
>>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>>> for NVIDIA devices?
>>>>>>>>    
>>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>>
>>>>>>> When mdev devices are created, its resources are allocated irrespective
>>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>>> parameter we add here should be tied to particular mdev device and not
>>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>>>>> along that line. All mdev device might not need/use these parameters,
>>>>>>> these can be made optional interfaces.    
>>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>>> provided a default fool-style way. We still need a flexible approach
>>>>>> that give user the ability to change QoS parameters freely and
>>>>>> dynamically according to their requirement , not restrict to the current
>>>>>> limited and static vGPU types.
>>>>>>     
>>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>>>>> devices on same physical device.
>>>>>>>
>>>>>>> In the above example, "if guest 1 should take double mdev device
>>>>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>>>>> will you calculate resources?    
>>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>>>>> vertical limitation, but weight is a horizontal limitation that define
>>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>>> understand as it's just a percentage. For weight. for example, if we
>>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>>>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>>>>> total_physical_GPU_resource.
>>>>>>     
>>>>> How will vendor driver provide max weight to userspace
>>>>> application/libvirt? Max weight will be per physical device, right?
>>>>>
>>>>> How would such resource allocation reflect in 'available_instances'?
>>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>>> FB free but you have reached max weight, so will you make
>>>>> available_instances = 0 for all types on that physical GPU?  
>>>> No, per the algorithm above, the available scheduling for the remaining
>>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>>> we'd need to define or make the range discoverable, 16 seems rather
>>>> arbitrary).  We can always add new scheduling participants.  AIUI,
>>>> Intel uses round-robin scheduling now, where you could consider all
>>>> mdev devices to have the same weight.  Whether we consider that to be a
>>>> weight of 16 or zero or 8 doesn't really matter.  
>>> QoS is to control the device's process capability like GPU
>>> rendering/computing that can be time multiplexing, not used to control
>>> the dedicated partition resources like FB, so there is no impact on
>>> 'available_instances'.
>>>
>>> if vGPU_1 weight=8, vGPU_2 weight=4;
>>> then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
>>> if vGPU_3 created with weight 2;
>>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>>
>>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>>> changed after vGPU_3 creating, that's weight doing as it's to define the
>>> relationship of all the vGPUs, the performance degradation is meet
>>> expectation. The end-user should know about such behavior.
>>>
>>> However the argument on weight let me has some self-reflection, does the
>>> end-user real need weight? does weight has actually application
>>> requirement?  Maybe the cap and priority are enough?
>> What sort of SLAs do you want to be able to offer?  For instance if I
>> want to be able to offer a GPU in 1/4 increments, how does that work?
>> I might sell customers A & B 1/4 increment each and customer C a 1/2
>> increment.  If weight is removed, can we do better than capping A & B
>> at 25% each and C at 50%?  That has the downside that nobody gets to
>> use the unused capacity of the other clients.  The SLA is some sort of
>> "up to X% (and no more)" model.  With weighting it's as simple as making
>> sure customer C's vGPU has twice the weight of that given to A or B.
>> Then you get an "at least X%" SLA model and any customer can use up to
>> 100% if the others are idle.  Combining weight and cap, we can do "at
>> least X%, but no more than Y%".
>>
>> All of this feels really similar to how cpusets must work since we're
>> just dealing with QoS relative to scheduling and we should not try to
>> reinvent scheduling QoS.  Thanks,
>>
> 
> Yeah, that's also my original thoughts.
> Since we get aligned about the QoS basic definition, I'm going to
> prepare the code in kernel side. How about the corresponding part in
> libvirt? Implemented separately after the kernel interface finalizing?
> 

Ok. These interfaces should be optional since all vendors drivers of
mdev may not support such QoS.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC]Add new mdev interface for QoS
  2017-08-08  6:42                       ` Kirti Wankhede
@ 2017-08-08 12:48                         ` Gao, Ping A
  0 siblings, 0 replies; 18+ messages in thread
From: Gao, Ping A @ 2017-08-08 12:48 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: kvm, linux-kernel, Tian, Kevin, Zhenyu Wang, Jike Song,
	libvir-list, zhi.a.wang


On 2017/8/8 14:42, Kirti Wankhede wrote:
>
> On 8/7/2017 1:11 PM, Gao, Ping A wrote:
>> On 2017/8/4 5:11, Alex Williamson wrote:
>>> On Thu, 3 Aug 2017 20:26:14 +0800
>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>
>>>> On 2017/8/3 0:58, Alex Williamson wrote:
>>>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>  
>>>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:  
>>>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:    
>>>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:    
>>>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>    
>>>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:    
>>>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:      
>>>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>>>>      
>>>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>>>>>
>>>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>>>>>> single submission control.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>>>>>> mdev core sysfs for QoS purpose.      
>>>>>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>>>>>> specification?      
>>>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>>>>>> in their back-end driver.
>>>>>>>>>>>
>>>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>>>>>
>>>>>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>>>>>
>>>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>>>>>> total physical resource.
>>>>>>>>>>>
>>>>>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>>>
>>>>>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>>>>>
>>>>>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>>>>>> submission control. I will sent out detail design later once get aligned.      
>>>>>>>>>> Hi Alex,
>>>>>>>>>> Any comments about the interface mentioned above?    
>>>>>>>>> Not really.
>>>>>>>>>
>>>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>>>> for NVIDIA devices?
>>>>>>>>>    
>>>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>>>
>>>>>>>> When mdev devices are created, its resources are allocated irrespective
>>>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>>>> parameter we add here should be tied to particular mdev device and not
>>>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>>>>>> along that line. All mdev device might not need/use these parameters,
>>>>>>>> these can be made optional interfaces.    
>>>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>>>> provided a default fool-style way. We still need a flexible approach
>>>>>>> that give user the ability to change QoS parameters freely and
>>>>>>> dynamically according to their requirement , not restrict to the current
>>>>>>> limited and static vGPU types.
>>>>>>>     
>>>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>>>>>> devices on same physical device.
>>>>>>>>
>>>>>>>> In the above example, "if guest 1 should take double mdev device
>>>>>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>>>>>> will you calculate resources?    
>>>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>>>>>> vertical limitation, but weight is a horizontal limitation that define
>>>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>>>> understand as it's just a percentage. For weight. for example, if we
>>>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>>>>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>>>>>> total_physical_GPU_resource.
>>>>>>>     
>>>>>> How will vendor driver provide max weight to userspace
>>>>>> application/libvirt? Max weight will be per physical device, right?
>>>>>>
>>>>>> How would such resource allocation reflect in 'available_instances'?
>>>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>>>> FB free but you have reached max weight, so will you make
>>>>>> available_instances = 0 for all types on that physical GPU?  
>>>>> No, per the algorithm above, the available scheduling for the remaining
>>>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>>>> we'd need to define or make the range discoverable, 16 seems rather
>>>>> arbitrary).  We can always add new scheduling participants.  AIUI,
>>>>> Intel uses round-robin scheduling now, where you could consider all
>>>>> mdev devices to have the same weight.  Whether we consider that to be a
>>>>> weight of 16 or zero or 8 doesn't really matter.  
>>>> QoS is to control the device's process capability like GPU
>>>> rendering/computing that can be time multiplexing, not used to control
>>>> the dedicated partition resources like FB, so there is no impact on
>>>> 'available_instances'.
>>>>
>>>> if vGPU_1 weight=8, vGPU_2 weight=4;
>>>> then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
>>>> if vGPU_3 created with weight 2;
>>>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>>>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>>>
>>>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>>>> changed after vGPU_3 creating, that's weight doing as it's to define the
>>>> relationship of all the vGPUs, the performance degradation is meet
>>>> expectation. The end-user should know about such behavior.
>>>>
>>>> However the argument on weight let me has some self-reflection, does the
>>>> end-user real need weight? does weight has actually application
>>>> requirement?  Maybe the cap and priority are enough?
>>> What sort of SLAs do you want to be able to offer?  For instance if I
>>> want to be able to offer a GPU in 1/4 increments, how does that work?
>>> I might sell customers A & B 1/4 increment each and customer C a 1/2
>>> increment.  If weight is removed, can we do better than capping A & B
>>> at 25% each and C at 50%?  That has the downside that nobody gets to
>>> use the unused capacity of the other clients.  The SLA is some sort of
>>> "up to X% (and no more)" model.  With weighting it's as simple as making
>>> sure customer C's vGPU has twice the weight of that given to A or B.
>>> Then you get an "at least X%" SLA model and any customer can use up to
>>> 100% if the others are idle.  Combining weight and cap, we can do "at
>>> least X%, but no more than Y%".
>>>
>>> All of this feels really similar to how cpusets must work since we're
>>> just dealing with QoS relative to scheduling and we should not try to
>>> reinvent scheduling QoS.  Thanks,
>>>
>> Yeah, that's also my original thoughts.
>> Since we get aligned about the QoS basic definition, I'm going to
>> prepare the code in kernel side. How about the corresponding part in
>> libvirt? Implemented separately after the kernel interface finalizing?
>>
> Ok. These interfaces should be optional since all vendors drivers of
> mdev may not support such QoS.
>

Sure, all of them are optional, it's freely to choose or even not to choose.

Thanks,
Ping

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-08-08 12:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-26 13:16 [RFC]Add new mdev interface for QoS Gao, Ping A
2017-07-26 16:43 ` Alex Williamson
2017-07-27 16:00   ` Gao, Ping A
2017-08-01  5:54     ` Gao, Ping A
2017-08-01 22:26       ` Alex Williamson
2017-08-02  2:50         ` Tian, Kevin
2017-08-02 10:19         ` Kirti Wankhede
2017-08-02 12:59           ` Gao, Ping A
2017-08-02 15:46             ` Kirti Wankhede
2017-08-02 16:58               ` Alex Williamson
2017-08-03 12:26                 ` Gao, Ping A
2017-08-03 21:11                   ` Alex Williamson
2017-08-07  7:41                     ` Gao, Ping A
2017-08-08  6:42                       ` Kirti Wankhede
2017-08-08 12:48                         ` Gao, Ping A
2017-07-27 16:17   ` [libvirt] " Daniel P. Berrange
2017-07-27 18:01     ` Alex Williamson
2017-07-28  8:10       ` Daniel P. Berrange

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).