Re: [PATCH 1/4] xen: credit2: implement utilization cap

From: Anshul Makkar <anshul.makkar@citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>,
	xen-devel@lists.xenproject.org
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH 1/4] xen: credit2: implement utilization cap
Date: Tue, 13 Jun 2017 17:07:49 +0100	[thread overview]
Message-ID: <142e982a-f07e-8e24-9a5e-7d4eed213dd1@citrix.com> (raw)
In-Reply-To: <1497273564.26212.18.camel@citrix.com>

On 12/06/2017 14:19, Dario Faggioli wrote:
> Hey,

>>> Budget is burned by the domain's vCPUs, in a similar way to how
>>> credits are.
>>>
>>> When a domain runs out of budget, its vCPUs can't run any longer.
>>
>> if the vcpus of a domain have credit and if budget has run out, will
>> the
>> vcpus won't be scheduled.
>>
> Is this a question? Assuming it is, what do you mean with "domain have
> credit"? Domains always have credits, and they never run out of them.
> There's no such thing as a domain not being runnable because it does
> not have credits.
>
"domain have credit" ? the statement is if the "vcpus of domain have 
credit".

> About budget, a domain with <= 0 budget means all its vcpus are not
> runnable, and hence won't be scheduler, no matter their credits.
You answered the question here.
>
>>> @@ -92,6 +92,82 @@
>>>   */
>>>
>>>  /*
>>> + * Utilization cap:
>>> + *
>>> + * Setting an pCPU utilization cap for a domain means the
>>> following:
>>> + *
>>> + * - a domain can have a cap, expressed in terms of % of physical
>>> + * For implementing this, we use the following approach:
>>> + *
>>> + * - each domain is given a 'budget', an each domain has a timer,
>>> which
>>> + *   replenishes the domain's budget periodically. The budget is
>>> the amount
>>> + *   of time the vCPUs of the domain can use every 'period';
>>> + *
>>> + * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for
>>> all domains
>>> + *   (but each domain has its own timer; so the all are periodic
>>> by the same
>>> + *   period, but replenishment of the budgets of the various
>>> domains, at
>>> + *   periods boundaries, are not synchronous);
>>> + *
>>> + * - when vCPUs run, they consume budget. When they don't run,
>>> they don't
>>> + *   consume budget. If there is no budget left for the domain, no
>>> vCPU of
>>> + *   that domain can run. If a vCPU tries to run and finds that
>>> there is no
>>> + *   budget, it blocks.
>>> + *   Budget never expires, so at whatever time a vCPU wants to
>>> run, it can
>>> + *   check the domain's budget, and if there is some, it can use
>>> it.
>>> + *
>>> + * - budget is replenished to the top of the capacity for the
>>> domain once
>>> + *   per period. Even if there was some leftover budget from
>>> previous period,
>>> + *   though, the budget after a replenishment will always be at
>>> most equal
>>> + *   to the total capacify of the domain ('tot_budget');
>>> + *
>>
>> budget is replenished but credits not available ?
>>
> Still not getting this.
what I want to ask is that if the budget of the domain is replenished, 
but credit for the vcpus of that domain is not available, then what 
happens.
I believe, vcpus won't be scheduled (even if they have budget_quota) 
till they get their credit replenished.
>
>> budget is finished but not vcpu has not reached the rate limit
>> boundary ?
>>
> Budget takes precedence over ratelimiting. This is important to keep
> cap working "regularly", rather then in some kind of permanent "trying-
> to-keep-up-with-overruns-in-previous-periods" state.
>
> And, ideally, a vcpu cap and ratelimiting should be set in such a way
> that they don't step on each other toe (or do that only rarely). I can
> see about trying to print a warning when I detect potential tricky
> values (but it's not easy, considering budget is per-domain, so I can't
> be sure about how much each vcpu will actually get, and whether or not
why you can't be sure. Scheduler know the domain budget, number of vcpus 
per domain and we can calculate the budget_quota and translate it into 
cpu slot duration.
Similarly , the value of rate limit is also known. We can compare and 
give a warning to the user if the budget_quota is less than rate limit.

This is very important for the user to know, if wrongly chosen, it can 
adversely affect the system's performance with frequent context 
switches. (the problem we are aware of).

> that will reveal to be significantly less than ratelimiting the most of
> the times).
>
>>> + * - when a budget replenishment occurs, if there are vCPUs that
>>> had been
>>> + *   blocked because of lack of budget, they'll be unblocked, and
>>> they will
>>> + *   (potentially) be able to run again.
>>> + *
>>> + * Finally, some even more implementation related detail:
>>> + *
>>> + * - budget is stored in a domain-wide pool. vCPUs of the domain
>>> that want
>>> + *   to run go to such pool, and grub some. When they do so, the
>>> amount
>>> + *   they grabbed is _immediately_ removed from the pool. This
>>> happens in
>>> + *   vcpu_try_to_get_budget();
>>> + *
>>> + * - when vCPUs stop running, if they've not consumed all the
>>> budget they
>>> + *   took, the leftover is put back in the pool. This happens in
>>> + *   vcpu_give_budget_back();
>>
>> 200% budget, 4 vcpus to run on 4 pcpus each allowed only 50% of
>> budget.
>> This is a static allocation .
>>
> Err... again, are you telling or asking?
giving an example to prove its a static allocation.
>
>>  for eg. 2 vcpus running on 2 pvpus at 20%
>> budgeted time, if vcpu3 wants to execute some cpu intensive task,
>> then
>> it won't be allowed to allowed to use more than 50% of the pcpus.
>>
> With what parameters? You mean with these ones you cite here (i.e., a
> 200% budget)? If the VM has 200%, and vcpu1 and vcpu2 consumes
> 20%+20%=40%, there's 160% free for vcpu3 and vcpu4.
>
>> I checked the implenation below and I believe we can allow for these
>> type of dynamic budget_quota allocation per vcpu. Not for initial
>> version, but certainly we can consider it for future versions.
>>
> But... it's already totally dynamic.
csched2_dom_cntl()
{
svc->budget_quota = max(sdom->tot_budget / sdom->nr_vcpus,
                                         CSCHED2_MIN_TIMER);
}
If domain->tot_budge = 200
nr_cpus is 4, then each cpu gets 50%.
How this is dynamic allocation ? We are not considering vcpu utilization 
of other vcpus of domain before allocating budget_quota for some vcpu.

Let me know if my understanding is wrong.
>
>>> @@ -408,6 +505,10 @@ struct csched2_vcpu {
>>>      unsigned int residual;
>>>
>>>      int credit;
>>> +
>>> +    s_time_t budget;
>>
>> it's confusing, please can we have different member names for budget
>> in
>> domain and vcpu structure.
>>
> Mmm.. I don't think it is. It's "how much budget this _thing_ have",
> where "thing" can be the domain or a vcpu, and you can tell that by
> looking at the containing structure. Most of the time, that's rather
> explicit, the former being sdom->budget, the latter being svc->budget.
>
> What different names did you have in mind?
>
> The only alternative that I can come up with would be something like:
>
> struct csched2_dom {
>  ...
>  dom_budget;
>  ...
> };
> struct csched2_vcpu {
>  ...
>  vcpu_budget;
>  ...
> };
>
> Which I don't like (because of the repetition).
>
Just shared a thought as I experienced the confusion while I was reading 
the code for the first time. If you don't agree, its fine.
>>> @@ -1354,7 +1469,16 @@ static void reset_credit(const struct
>>> scheduler *ops, int cpu, s_time_t now,
>>>           * that the credit it has spent so far get accounted.
>>>           */
>>>          if ( svc->vcpu == curr_on_cpu(svc_cpu) )
>>> +        {
>>>              burn_credits(rqd, svc, now);
>>> +            /*
>>> +             * And, similarly, in case it has run out of budget,
>>> as a
>>> +             * consequence of this round of accounting, we also
>>> must inform
>>> +             * its pCPU that it's time to park it, and pick up
>>> someone else.
>>> +             */
>>> +            if ( unlikely(svc->budget <= 0) )
>>
>> use of unlikely here is not saving much of cpu cycles.
>>
> Well, considering that not all domains will have a cap, and that we
> don't expect, even for the domains with caps, all their vcpus to
> exhaust their budget at every reset event, I think annotating this as
> an unlikely event makes sense.
 From what I understand, I considered it to be a very likely event.

>
> It's not a super big deal, though, and I can get rid of it, if people
> don't like/are not convinced about it. :-)
yes, its fine, we can leave it for now.

>
>>> @@ -1410,27 +1534,35 @@ void burn_credits(struct
>>> +    sdom->budget += svc->budget;
>>> +
>>> +    if ( sdom->budget > 0 )
>>> +    {
>>> +        svc->budget = sdom->budget;
>>
>> why are you assigning the remaining sdom->budge to only this svc.
>> svc
>> should be assigned a proportionate budget.
>> Each vcpu is assigned a %age of the domain budget based on the cap
>> and
>> number of vcpus.
>> There is difference in the code that's here and the code in branch
>> git://xenbits.xen.org/people/dariof/xen.git (fetch)
>> rel/sched/credti2-caps branch. Logic in the branch code looks fine
>> where
>> you are taking svc->budget_quota into considration..
>>
> Yeah... maybe look at patch 3/4. :-P
Yeah, got it in third patch. :)
>
>> In runq candidate we have a code base
>> /*
>>   * Return the current vcpu if it has executed for less than
>> ratelimit.
>>   * Adjuststment for the selected vcpu's credit and decision
>>   * for how long it will run will be taken in csched2_runtime.
>>   *
>>   * Note that, if scurr is yielding, we don't let rate limiting kick
>> in.
>>   * In fact, it may be the case that scurr is about to spin, and
>> there's
>>   * no point forcing it to do so until rate limiting expires.
>>   */
>>   if ( !yield && prv->ratelimit_us && !is_idle_vcpu(scurr->vcpu) &&
>>        vcpu_runnable(scurr->vcpu) &&
>>       (now - scurr->vcpu->runstate.state_entry_time) <
>>         MICROSECS(prv->ratelimit_us) )
>> In this codeblock we return scurr. Here there is no check for vcpu-
>>> budget.
>> Even if the scurr vcpu has executed for less than rate limit and
>> scurr
>> is not yielding, we need to check for its budget before returning
>> scurr.
>>
> But we check vcpu_runnable(scurr). And we've already called, in
> csched2_schedule(), vcpu_try_to_get_budget(scurr). And if scurr could
> not get any budget, we called park_vcpu(scurr), which sets scurr up in
> such a way that vcpu_runnable(scurr) is false.
Yes, got your point, but then the call for vcpu_try_to_get_budet should 
move to the code block in runq_candidate that return scurr other wise 
the condition looks incomplete and makes the logic ambiguous.

We call runq_candidate to find the next runnable candidate. If we want 
to return scurr as the current runnable candidate then it should have 
gone through all the checks including budget_quota and all these checks 
should be at one place.
>
> Thanks and Regards,
> Dario
>

Anshul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel