From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kirti Wankhede Subject: Re: [RFC]Add new mdev interface for QoS Date: Tue, 8 Aug 2017 12:12:22 +0530 Message-ID: References: <9951f9cf-89dd-afa4-a9f7-9a795e4c01af@intel.com> <20170726104343.5bfa51d5@w520.home> <9607b33d-7b3a-1bcf-1ad9-4b554100e68a@intel.com> <20170801162625.6264dbd6@w520.home> <0f637a9b-8b74-8b50-6611-2eb2557a80d6@nvidia.com> <461872b1-1086-5151-1473-734223b050d0@intel.com> <20170802105845.717ecf5f@w520.home> <20170803151155.35c650cb@w520.home> <09229dca-1083-4970-a27d-ec82d06f0b28@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8BIT Cc: , , "Tian, Kevin" , Zhenyu Wang , Jike Song , , To: "Gao, Ping A" , Alex Williamson Return-path: In-Reply-To: <09229dca-1083-4970-a27d-ec82d06f0b28@intel.com> Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org On 8/7/2017 1:11 PM, Gao, Ping A wrote: > > On 2017/8/4 5:11, Alex Williamson wrote: >> On Thu, 3 Aug 2017 20:26:14 +0800 >> "Gao, Ping A" wrote: >> >>> On 2017/8/3 0:58, Alex Williamson wrote: >>>> On Wed, 2 Aug 2017 21:16:28 +0530 >>>> Kirti Wankhede wrote: >>>> >>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote: >>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote: >>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote: >>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800 >>>>>>>> "Gao, Ping A" wrote: >>>>>>>> >>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote: >>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote: >>>>>>>>>>> [cc +libvir-list] >>>>>>>>>>> >>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800 >>>>>>>>>>> "Gao, Ping A" wrote: >>>>>>>>>>> >>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the >>>>>>>>>>>> same physical device through mediate sharing, as result it bring a >>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS >>>>>>>>>>>> related interface for mdev to management virtual device resource. >>>>>>>>>>>> >>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has >>>>>>>>>>>> different performance requirements, some guests may need higher priority >>>>>>>>>>>> for real time usage, some other may need more portion of the GPU >>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some >>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for >>>>>>>>>>>> single submission control. >>>>>>>>>>>> >>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in >>>>>>>>>>>> mdev core sysfs for QoS purpose. >>>>>>>>>>> I think what you're asking for is just some standardization of a QoS >>>>>>>>>>> attribute_group which a vendor can optionally include within the >>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups. The mdev core will >>>>>>>>>>> transparently enable this, but it really only provides the standard, >>>>>>>>>>> all of the support code is left for the vendor. I'm fine with that, >>>>>>>>>>> but of course the trouble with and sort of standardization is arriving >>>>>>>>>>> at an agreed upon standard. Are there QoS knobs that are generic >>>>>>>>>>> across any mdev device type? Are there others that are more specific >>>>>>>>>>> to vGPU? Are there existing examples of this that we can steal their >>>>>>>>>>> specification? >>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted. >>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS >>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only >>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm >>>>>>>>>> in their back-end driver. >>>>>>>>>> >>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack >>>>>>>>>> of HW virtualization support to guests, no matter the device type, >>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this >>>>>>>>>> point of view, QoS can be take as a generic way about how to control the >>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can >>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW >>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem >>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their >>>>>>>>>> specification to reach the same performance expectation. Seems there are >>>>>>>>>> no examples for us to follow, we need define it from scratch. >>>>>>>>>> >>>>>>>>>> I proposal universal QoS control interfaces like below: >>>>>>>>>> >>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own >>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of >>>>>>>>>> total physical resource. >>>>>>>>>> >>>>>>>>>> Weight: The weight define proportional control of the mdev device >>>>>>>>>> resource between guests, it’s orthogonal with Cap, to target load >>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource >>>>>>>>>> compare with guest 2, need set weight ratio to 2:1. >>>>>>>>>> >>>>>>>>>> Priority: The guest who has higher priority will get execution first, >>>>>>>>>> target to some real time usage and speeding interactive response. >>>>>>>>>> >>>>>>>>>> Above QoS interfaces cover both overall budget control and single >>>>>>>>>> submission control. I will sent out detail design later once get aligned. >>>>>>>>> Hi Alex, >>>>>>>>> Any comments about the interface mentioned above? >>>>>>>> Not really. >>>>>>>> >>>>>>>> Kirti, are there any QoS knobs that would be interesting >>>>>>>> for NVIDIA devices? >>>>>>>> >>>>>>> We have different types of vGPU for different QoS factors. >>>>>>> >>>>>>> When mdev devices are created, its resources are allocated irrespective >>>>>>> of which VM/userspace app is going to use that mdev device. Any >>>>>>> parameter we add here should be tied to particular mdev device and not >>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are >>>>>>> along that line. All mdev device might not need/use these parameters, >>>>>>> these can be made optional interfaces. >>>>>> We also define some QoS parameters in Intel vGPU types, but it only >>>>>> provided a default fool-style way. We still need a flexible approach >>>>>> that give user the ability to change QoS parameters freely and >>>>>> dynamically according to their requirement , not restrict to the current >>>>>> limited and static vGPU types. >>>>>> >>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev >>>>>>> devices on same physical device. >>>>>>> >>>>>>> In the above example, "if guest 1 should take double mdev device >>>>>>> resource compare with guest 2" but what if guest 2 never booted, how >>>>>>> will you calculate resources? >>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a >>>>>> vertical limitation, but weight is a horizontal limitation that define >>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to >>>>>> understand as it's just a percentage. For weight. for example, if we >>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been >>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4, >>>>>> we can translate it to this formula: resource_of_vGPU_1 = 8 / (8+4) * >>>>>> total_physical_GPU_resource. >>>>>> >>>>> How will vendor driver provide max weight to userspace >>>>> application/libvirt? Max weight will be per physical device, right? >>>>> >>>>> How would such resource allocation reflect in 'available_instances'? >>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with >>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G >>>>> FB free but you have reached max weight, so will you make >>>>> available_instances = 0 for all types on that physical GPU? >>>> No, per the algorithm above, the available scheduling for the remaining >>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16, >>>> we'd need to define or make the range discoverable, 16 seems rather >>>> arbitrary). We can always add new scheduling participants. AIUI, >>>> Intel uses round-robin scheduling now, where you could consider all >>>> mdev devices to have the same weight. Whether we consider that to be a >>>> weight of 16 or zero or 8 doesn't really matter. >>> QoS is to control the device's process capability like GPU >>> rendering/computing that can be time multiplexing, not used to control >>> the dedicated partition resources like FB, so there is no impact on >>> 'available_instances'. >>> >>> if vGPU_1 weight=8, vGPU_2 weight=4; >>> then vGPU_1_res = 8 / (8 + 4) * total, vGPU_2_res = 4 / (8 + 4) * total; >>> if vGPU_3 created with weight 2; >>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) * >>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total. >>> >>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically >>> changed after vGPU_3 creating, that's weight doing as it's to define the >>> relationship of all the vGPUs, the performance degradation is meet >>> expectation. The end-user should know about such behavior. >>> >>> However the argument on weight let me has some self-reflection, does the >>> end-user real need weight? does weight has actually application >>> requirement? Maybe the cap and priority are enough? >> What sort of SLAs do you want to be able to offer? For instance if I >> want to be able to offer a GPU in 1/4 increments, how does that work? >> I might sell customers A & B 1/4 increment each and customer C a 1/2 >> increment. If weight is removed, can we do better than capping A & B >> at 25% each and C at 50%? That has the downside that nobody gets to >> use the unused capacity of the other clients. The SLA is some sort of >> "up to X% (and no more)" model. With weighting it's as simple as making >> sure customer C's vGPU has twice the weight of that given to A or B. >> Then you get an "at least X%" SLA model and any customer can use up to >> 100% if the others are idle. Combining weight and cap, we can do "at >> least X%, but no more than Y%". >> >> All of this feels really similar to how cpusets must work since we're >> just dealing with QoS relative to scheduling and we should not try to >> reinvent scheduling QoS. Thanks, >> > > Yeah, that's also my original thoughts. > Since we get aligned about the QoS basic definition, I'm going to > prepare the code in kernel side. How about the corresponding part in > libvirt? Implemented separately after the kernel interface finalizing? > Ok. These interfaces should be optional since all vendors drivers of mdev may not support such QoS. Thanks, Kirti.