From mboxrd@z Thu Jan  1 00:00:00 1970
From: Kirti Wankhede <kwankhede@nvidia.com>
Subject: Re: [RFC]Add new mdev interface for QoS
Date: Tue, 8 Aug 2017 12:12:22 +0530
Message-ID: <e7dfa278-36a0-a6b8-289b-12c7b40316c6@nvidia.com>
References: <9951f9cf-89dd-afa4-a9f7-9a795e4c01af@intel.com>
 <20170726104343.5bfa51d5@w520.home>
 <9607b33d-7b3a-1bcf-1ad9-4b554100e68a@intel.com>
 <f2032dbc-6d4e-a354-9fb6-aef0ec3283a9@intel.com>
 <20170801162625.6264dbd6@w520.home>
 <0f637a9b-8b74-8b50-6611-2eb2557a80d6@nvidia.com>
 <461872b1-1086-5151-1473-734223b050d0@intel.com>
 <e333d103-2321-304a-ff3f-2e0281575990@nvidia.com>
 <20170802105845.717ecf5f@w520.home>
 <ebebd457-cae1-61e2-7a84-20d07029a78f@intel.com>
 <20170803151155.35c650cb@w520.home>
 <09229dca-1083-4970-a27d-ec82d06f0b28@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8BIT
Cc: <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
        "Tian, Kevin" <kevin.tian@intel.com>,
        Zhenyu Wang <zhenyuw@linux.intel.com>,
        Jike Song <jike.song@intel.com>, <libvir-list@redhat.com>,
        <zhi.a.wang@intel.com>
To: "Gao, Ping A" <ping.a.gao@intel.com>,
        Alex Williamson <alex.williamson@redhat.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <09229dca-1083-4970-a27d-ec82d06f0b28@intel.com>
Content-Language: en-US
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org


On 8/7/2017 1:11 PM, Gao, Ping A wrote:
> 
> On 2017/8/4 5:11, Alex Williamson wrote:
>> On Thu, 3 Aug 2017 20:26:14 +0800
>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>
>>> On 2017/8/3 0:58, Alex Williamson wrote:
>>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>  
>>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:  
>>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:    
>>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:    
>>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>    
>>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:    
>>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:      
>>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>>> "Gao, Ping A" <ping.a.gao@intel.com> wrote:
>>>>>>>>>>>      
>>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>>>>
>>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>>>>> single submission control.
>>>>>>>>>>>>
>>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>>>>> mdev core sysfs for QoS purpose.      
>>>>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
>>>>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>>>>> all of the support code is left for the vendor.  I'm fine with that,
>>>>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>>>>> at an agreed upon standard.  Are there QoS knobs that are generic
>>>>>>>>>>> across any mdev device type?  Are there others that are more specific
>>>>>>>>>>> to vGPU?  Are there existing examples of this that we can steal their
>>>>>>>>>>> specification?      
>>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>>>>> in their back-end driver.
>>>>>>>>>>
>>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>>>>
>>>>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>>>>
>>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>>>>> total physical resource.
>>>>>>>>>>
>>>>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load
>>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>>
>>>>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>>>>
>>>>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>>>>> submission control. I will sent out detail design later once get aligned.      
>>>>>>>>> Hi Alex,
>>>>>>>>> Any comments about the interface mentioned above?    
>>>>>>>> Not really.
>>>>>>>>
>>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>>> for NVIDIA devices?
>>>>>>>>    
>>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>>
>>>>>>> When mdev devices are created, its resources are allocated irrespective
>>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>>> parameter we add here should be tied to particular mdev device and not
>>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>>>>> along that line. All mdev device might not need/use these parameters,
>>>>>>> these can be made optional interfaces.    
>>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>>> provided a default fool-style way. We still need a flexible approach
>>>>>> that give user the ability to change QoS parameters freely and
>>>>>> dynamically according to their requirement , not restrict to the current
>>>>>> limited and static vGPU types.
>>>>>>     
>>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>>>>> devices on same physical device.
>>>>>>>
>>>>>>> In the above example, "if guest 1 should take double mdev device
>>>>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>>>>> will you calculate resources?    
>>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>>>>> vertical limitation, but weight is a horizontal limitation that define
>>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>>> understand as it's just a percentage. For weight. for example, if we
>>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>>>>> we can translate it to this formula:  resource_of_vGPU_1 = 8 / (8+4) *
>>>>>> total_physical_GPU_resource.
>>>>>>     
>>>>> How will vendor driver provide max weight to userspace
>>>>> application/libvirt? Max weight will be per physical device, right?
>>>>>
>>>>> How would such resource allocation reflect in 'available_instances'?
>>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>>> FB free but you have reached max weight, so will you make
>>>>> available_instances = 0 for all types on that physical GPU?  
>>>> No, per the algorithm above, the available scheduling for the remaining
>>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>>> we'd need to define or make the range discoverable, 16 seems rather
>>>> arbitrary).  We can always add new scheduling participants.  AIUI,
>>>> Intel uses round-robin scheduling now, where you could consider all
>>>> mdev devices to have the same weight.  Whether we consider that to be a
>>>> weight of 16 or zero or 8 doesn't really matter.  
>>> QoS is to control the device's process capability like GPU
>>> rendering/computing that can be time multiplexing, not used to control
>>> the dedicated partition resources like FB, so there is no impact on
>>> 'available_instances'.
>>>
>>> if vGPU_1 weight=8, vGPU_2 weight=4;
>>> then vGPU_1_res = 8 / (8 + 4) * total,  vGPU_2_res = 4 / (8 + 4) * total;
>>> if vGPU_3 created with weight 2;
>>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>>
>>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>>> changed after vGPU_3 creating, that's weight doing as it's to define the
>>> relationship of all the vGPUs, the performance degradation is meet
>>> expectation. The end-user should know about such behavior.
>>>
>>> However the argument on weight let me has some self-reflection, does the
>>> end-user real need weight? does weight has actually application
>>> requirement?  Maybe the cap and priority are enough?
>> What sort of SLAs do you want to be able to offer?  For instance if I
>> want to be able to offer a GPU in 1/4 increments, how does that work?
>> I might sell customers A & B 1/4 increment each and customer C a 1/2
>> increment.  If weight is removed, can we do better than capping A & B
>> at 25% each and C at 50%?  That has the downside that nobody gets to
>> use the unused capacity of the other clients.  The SLA is some sort of
>> "up to X% (and no more)" model.  With weighting it's as simple as making
>> sure customer C's vGPU has twice the weight of that given to A or B.
>> Then you get an "at least X%" SLA model and any customer can use up to
>> 100% if the others are idle.  Combining weight and cap, we can do "at
>> least X%, but no more than Y%".
>>
>> All of this feels really similar to how cpusets must work since we're
>> just dealing with QoS relative to scheduling and we should not try to
>> reinvent scheduling QoS.  Thanks,
>>
> 
> Yeah, that's also my original thoughts.
> Since we get aligned about the QoS basic definition, I'm going to
> prepare the code in kernel side. How about the corresponding part in
> libvirt? Implemented separately after the kernel interface finalizing?
> 

Ok. These interfaces should be optional since all vendors drivers of
mdev may not support such QoS.

Thanks,
Kirti.