Re: [RFC]Add new mdev interface for QoS

From: "Gao, Ping A" <ping.a.gao@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Kirti Wankhede <kwankhede@nvidia.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, "Tian,
	Kevin" <kevin.tian@intel.com>,
	Zhenyu Wang <zhenyuw@linux.intel.com>,
	Jike Song <jike.song@intel.com>,
	libvir-list@redhat.com, zhi.a.wang@intel.com
Subject: Re: [RFC]Add new mdev interface for QoS
Date: Thu, 3 Aug 2017 20:26:14 +0800	[thread overview]
Message-ID: <ebebd457-cae1-61e2-7a84-20d07029a78f@intel.com> (raw)
In-Reply-To: <20170802105845.717ecf5f@w520.home>

On 2017/8/3 0:58, Alex Williamson wrote:
> On Wed, 2 Aug 2017 21:16:28 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:
>>> On 2017/8/2 18:19, Kirti Wankhede wrote:  
>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:  
>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>  
>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:  
>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:    
>>>>>>>> [cc +libvir-list]
>>>>>>>>
>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>    
>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>
>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>> single submission control.
>>>>>>>>>
>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>> mdev core sysfs for QoS purpose.    
>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>> specification?    
>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>> in their back-end driver.
>>>>>>>
>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>
>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>
>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>> total physical resource.
>>>>>>>
>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>
>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>
>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>> submission control. I will sent out detail design later once get aligned.    
>>>>>> Hi Alex,
>>>>>> Any comments about the interface mentioned above?  
>>>>> Not really.
>>>>>
>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>> for NVIDIA devices?
>>>>>  
>>>> We have different types of vGPU for different QoS factors.
>>>>
>>>> When mdev devices are created, its resources are allocated irrespective
>>>> of which VM/userspace app is going to use that mdev device. Any
>>>> parameter we add here should be tied to particular mdev device and not
>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>> along that line. All mdev device might not need/use these parameters,
>>>> these can be made optional interfaces.  
>>> We also define some QoS parameters in Intel vGPU types, but it only
>>> provided a default fool-style way. We still need a flexible approach
>>> that give user the ability to change QoS parameters freely and
>>> dynamically according to their requirement , not restrict to the current
>>> limited and static vGPU types.
>>>   
>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>> devices on same physical device.
>>>>
>>>> In the above example, "if guest 1 should take double mdev device
>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>> will you calculate resources?  
>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>> vertical limitation, but weight is a horizontal limitation that define
>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>> understand as it's just a percentage. For weight. for example, if we
>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>> total_physical_GPU_resource.
>>>   
>> How will vendor driver provide max weight to userspace
>> application/libvirt? Max weight will be per physical device, right?
>>
>> How would such resource allocation reflect in 'available_instances'?
>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>> FB free but you have reached max weight, so will you make
>> available_instances = 0 for all types on that physical GPU?
> No, per the algorithm above, the available scheduling for the remaining
> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
> we'd need to define or make the range discoverable, 16 seems rather
> arbitrary).  We can always add new scheduling participants.  AIUI,
> Intel uses round-robin scheduling now, where you could consider all
> mdev devices to have the same weight.  Whether we consider that to be a
> weight of 16 or zero or 8 doesn't really matter.

QoS is to control the device's process capability like GPU
rendering/computing that can be time multiplexing, not used to control
the dedicated partition resources like FB, so there is no impact on
'available_instances'.

if vGPU_1 weight=8, vGPU_2 weight=4;
then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
if vGPU_3 created with weight 2;
then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
total, vGPU_3_res = 2 / (8 + 4 + 2) * total.

The resource allocation of vGPU_1 and vGPU_2 have been dynamically
changed after vGPU_3 creating, that's weight doing as it's to define the
relationship of all the vGPUs, the performance degradation is meet
expectation. The end-user should know about such behavior.

However the argument on weight let me has some self-reflection, does the
end-user real need weight? does weight has actually application
requirement?  Maybe the cap and priority are enough?

>>> If there is only one guest exist, then there is no target to compare, 
>>> weight become meaningless and the single guest enjoy the whole physical GPU.
>>>   
>> If single VM is running for long time say vGPU_1, i.e. it enjoy whole
>> GPU, but then other VM boots with weight 4, so you will cut down
>> resources of vGPU_1 at runtime? Doesn't that would show performance
>> degradation for VM with vGPU_1 at runtime?
> Yes.  We have this already though, vGPU_1 may enjoy the whole GPU
> simply because the other vGPUs are idle, that can change at any time
> and may reduce the resources available to vGPU_1.  Do we want a QoS
> knob for fixed scheduling slices?  With only cap, weight, and priority,
> how could I provide an SLA for no less than 40% of the GPU?  I guess we
> can get that with careful use of weight, but I wonder if we could make
> it more simple for users.
>
>>>> If libvirt/other toolstack decides to do smart allocation based on type
>>>> name without taking physical host device as input, guest 1 and guest 2
>>>> might get mdev devices created on different physical device. Then would
>>>> weightage matter here?  
>>> What your mean if it's the case that there are two discrete GPU cards
>>> exist and the vGPU types can be freely allocated on them, IMO the
>>> back-end driver should handle such case, as the number of physical
>>> device is transparent to tool stack. e.g. present multi-physical device
>>> as a logic one to mdev.
>>>   
>> No, generally toolstack is aware of available physical devices and it
>> could have smart logic to decide on which physical device mdev device
>> should be created, i.e. to load one physical device first or to
>> distribute the load across physical devices when mdev devices are
>> created. Libvirt don't have such logic now, but it was discussed earlier
>> about having such logic in libvirt.
>> Then in that case as I said above doesn't that would show perf
>> degradation on running VMs at runtime?
> It seems that the proposed cap, weight, and priority only handle QoS
> within a single parent device.  All the knobs are relative to other
> scheduling participants on that parent device.  The same QoS parameters
> for mdev devices on separate parent devices could have wildly different
> performance characteristics depending on the load the other mdev
> devices are inflicting.  If there's only one such parent device on the
> system, this works.  libvirt has already effectively rejected the idea
> of automating mdev placement and perhaps this is another similar case
> where we simply require some higher level management tool to have a
> global view of the system.  Thanks,

Yeah, QoS is only try to handle single parent device. For multi-devices
case we need define the management in higher level.

Thanks,
Ping