From: George Dunlap <george.dunlap@citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>,
xen-devel@lists.xenproject.org
Cc: Wei Liu <wei.liu2@citrix.com>,
George Dunlap <george.dunlap@eu.citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
Anshul Makkar <anshul.makkar@citrix.com>,
Ian Jackson <ian.jackson@eu.citrix.com>,
Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH 1/4] xen: credit2: implement utilization cap
Date: Tue, 25 Jul 2017 16:08:44 +0100 [thread overview]
Message-ID: <18caea27-2d72-6e93-96ca-8a4a0a4ee614@citrix.com> (raw)
In-Reply-To: <149692372627.9605.8252407697848997058.stgit@Solace.fritz.box>
On 06/08/2017 01:08 PM, Dario Faggioli wrote:
> This commit implements the Xen part of the cap mechanism for
> Credit2.
>
> A cap is how much, in terms of % of physical CPU time, a domain
> can execute at most.
>
> For instance, a domain that must not use more than 1/4 of one
> physical CPU, must have a cap of 25%; one that must not use more
> than 1+1/2 of physical CPU time, must be given a cap of 150%.
>
> Caps are per domain, so it is all a domain's vCPUs, cumulatively,
> that will be forced to execute no more than the decided amount.
>
> This is implemented by giving each domain a 'budget', and using
> a (per-domain again) periodic timer. Values of budget and 'period'
> are chosen so that budget/period is equal to the cap itself.
>
> Budget is burned by the domain's vCPUs, in a similar way to how
> credits are.
>
> When a domain runs out of budget, its vCPUs can't run any longer.
> They can gain, when the budget is replenishment by the timer, which
> event happens once every period.
>
> Blocking the vCPUs because of lack of budget happens by
> means of a new (_VPF_parked) pause flag, so that, e.g.,
> vcpu_runnable() still works. This is similar to what is
> done in sched_rtds.c, as opposed to what happens in
> sched_credit.c, where vcpu_pause() and vcpu_unpause()
> (which means, among other things, more overhead).
>
> Note that xenalyze and tools/xentrace/format are also modified,
> to keep them updated with one modified event.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
> tools/xentrace/formats | 2
> tools/xentrace/xenalyze.c | 10 +
> xen/common/sched_credit2.c | 470 +++++++++++++++++++++++++++++++++++++++++---
> xen/include/xen/sched.h | 3
> 4 files changed, 445 insertions(+), 40 deletions(-)
>
> diff --git a/tools/xentrace/formats b/tools/xentrace/formats
> index 8b31780..142b0cf 100644
> --- a/tools/xentrace/formats
> +++ b/tools/xentrace/formats
> @@ -51,7 +51,7 @@
>
> 0x00022201 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tick
> 0x00022202 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:runq_pos [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
> -0x00022203 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit burn [ dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ]
> +0x00022203 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit burn [ dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ]
> 0x00022204 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:credit_add
> 0x00022205 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle_check [ dom:vcpu = 0x%(1)08x, credit = %(2)d ]
> 0x00022206 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) csched2:tickle [ cpu = %(1)d ]
> diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
> index fa608ad..c16c02d 100644
> --- a/tools/xentrace/xenalyze.c
> +++ b/tools/xentrace/xenalyze.c
> @@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p)
> if(opt.dump_all) {
> struct {
> unsigned int vcpuid:16, domid:16;
> - int credit, delta;
> + int credit, budget, delta;
> } *r = (typeof(r))ri->d;
>
> - printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta = %d\n",
> - ri->dump_header, r->domid, r->vcpuid,
> - r->credit, r->delta);
> + printf(" %s csched2:burn_credits d%uv%u, credit = %d, ",
> + ri->dump_header, r->domid, r->vcpuid, r->credit);
> + if ( r->budget != INT_MIN )
> + printf("budget = %d, ", r->budget);
> + printf("delta = %d\n", r->delta);
> }
> break;
> case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK */
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 126417c..ba4bf4b 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -92,6 +92,82 @@
> */
>
> /*
> + * Utilization cap:
> + *
> + * Setting an pCPU utilization cap for a domain means the following:
> + *
> + * - a domain can have a cap, expressed in terms of % of physical CPU time.
> + * A domain that must not use more than 1/4 of _one_ physical CPU, will
> + * be given a cap of 25%; a domain that must not use more than 1+1/2 of
> + * physical CPU time, will be given a cap of 150%;
> + *
> + * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and
> + * a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4
> + * vCPUs, and a 200% cap, all its 4 vCPUs are allowed to run for (the
> + * equivalent of) 100% time on 2 pCPUs. How much each of the various 4
> + * vCPUs will get, is unspecified (will depend on various aspects: workload,
> + * system load, etc.).
> + *
> + * For implementing this, we use the following approach:
> + *
> + * - each domain is given a 'budget', an each domain has a timer, which
> + * replenishes the domain's budget periodically. The budget is the amount
> + * of time the vCPUs of the domain can use every 'period';
> + *
> + * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains
> + * (but each domain has its own timer; so the all are periodic by the same
> + * period, but replenishment of the budgets of the various domains, at
> + * periods boundaries, are not synchronous);
> + *
> + * - when vCPUs run, they consume budget. When they don't run, they don't
> + * consume budget. If there is no budget left for the domain, no vCPU of
> + * that domain can run. If a vCPU tries to run and finds that there is no
> + * budget, it blocks.
> + * Budget never expires, so at whatever time a vCPU wants to run, it can
> + * check the domain's budget, and if there is some, it can use it.
> + *
> + * - budget is replenished to the top of the capacity for the domain once
> + * per period. Even if there was some leftover budget from previous period,
> + * though, the budget after a replenishment will always be at most equal
> + * to the total capacify of the domain ('tot_budget');
> + *
> + * - when a budget replenishment occurs, if there are vCPUs that had been
> + * blocked because of lack of budget, they'll be unblocked, and they will
> + * (potentially) be able to run again.
> + *
> + * Finally, some even more implementation related detail:
> + *
> + * - budget is stored in a domain-wide pool. vCPUs of the domain that want
> + * to run go to such pool, and grub some. When they do so, the amount
> + * they grabbed is _immediately_ removed from the pool. This happens in
> + * vcpu_try_to_get_budget();
> + *
> + * - when vCPUs stop running, if they've not consumed all the budget they
> + * took, the leftover is put back in the pool. This happens in
> + * vcpu_give_budget_back();
> + *
> + * - the above means that a vCPU can find out that there is no budget and
> + * block, not only if the cap has actually been reached (for this period),
> + * but also if some other vCPUs, in order to run, have grabbed a certain
> + * quota of budget, no matter whether they've already used it all or not.
> + * A vCPU blocking because (any form of) lack of budget is said to be
> + * "parked", and such blocking happens in park_vcpu();
> + *
> + * - when a vCPU stops running, and puts back some budget in the domain pool,
> + * we need to check whether there is someone which has been parked and that
> + * can be unparked. This happens in unpark_parked_vcpus(), called from
> + * csched2_context_saved();
> + *
> + * - of course, unparking happens also as a consequene of the domain's budget
> + * being replenished by the periodic timer. This also occurs by means of
> + * calling csched2_context_saved() (but from repl_sdom_budget());
> + *
> + * - parked vCPUs of a domain are kept in a (per-domain) list, called
> + * 'parked_vcpus'). Manipulation of the list and of the domain-wide budget
> + * pool, must occur only when holding the 'budget_lock'.
> + */
> +
> +/*
> * Locking:
> *
> * - runqueue lock
> @@ -112,18 +188,29 @@
> * runqueue each cpu is;
> * + serializes the operation of changing the weights of domains;
> *
> + * - Budget lock
> + * + it is per-domain;
> + * + protects, in domains that have an utilization cap;
> + * * manipulation of the total budget of the domain (as it is shared
> + * among all vCPUs of the domain),
> + * * manipulation of the list of vCPUs that are blocked waiting for
> + * some budget to be available.
> + *
> * - Type:
> * + runqueue locks are 'regular' spinlocks;
> * + the private scheduler lock can be an rwlock. In fact, data
> * it protects is modified only during initialization, cpupool
> * manipulation and when changing weights, and read in all
> - * other cases (e.g., during load balancing).
> + * other cases (e.g., during load balancing);
> + * + budget locks are 'regular' spinlocks.
> *
> * Ordering:
> * + tylock must be used when wanting to take a runqueue lock,
> * if we already hold another one;
> * + if taking both a runqueue lock and the private scheduler
> - * lock is, the latter must always be taken for first.
> + * lock is, the latter must always be taken for first;
> + * + if taking both a runqueue lock and a budget lock, the former
> + * must always be taken for first.
> */
>
> /*
> @@ -164,6 +251,8 @@
> #define CSCHED2_CREDIT_RESET 0
> /* Max timer: Maximum time a guest can be run for. */
> #define CSCHED2_MAX_TIMER CSCHED2_CREDIT_INIT
> +/* Period of the cap replenishment timer. */
> +#define CSCHED2_BDGT_REPL_PERIOD ((opt_cap_period)*MILLISECS(1))
>
> /*
> * Flags
> @@ -293,6 +382,14 @@ static int __read_mostly opt_underload_balance_tolerance = 0;
> integer_param("credit2_balance_under", opt_underload_balance_tolerance);
> static int __read_mostly opt_overload_balance_tolerance = -3;
> integer_param("credit2_balance_over", opt_overload_balance_tolerance);
> +/*
> + * Domains subject to a cap, receive a replenishment of their runtime budget
> + * once every opt_cap_period interval. Default is 10 ms. The amount of budget
> + * they receive depends on their cap. For instance, a domain with a 50% cap
> + * will receive 50% of 10 ms, so 5 ms.
> + */
> +static unsigned int __read_mostly opt_cap_period = 10; /* ms */
> +integer_param("credit2_cap_period_ms", opt_cap_period);
>
> /*
> * Runqueue organization.
> @@ -408,6 +505,10 @@ struct csched2_vcpu {
> unsigned int residual;
>
> int credit;
> +
> + s_time_t budget;
> + struct list_head parked_elem; /* On the parked_vcpus list */
> +
> s_time_t start_time; /* When we were scheduled (used for credit) */
> unsigned flags; /* 16 bits doesn't seem to play well with clear_bit() */
> int tickled_cpu; /* cpu tickled for picking us up (-1 if none) */
> @@ -425,7 +526,15 @@ struct csched2_vcpu {
> struct csched2_dom {
> struct list_head sdom_elem;
> struct domain *dom;
> +
> + spinlock_t budget_lock;
> + struct timer repl_timer;
> + s_time_t next_repl;
> + s_time_t budget, tot_budget;
> + struct list_head parked_vcpus;
> +
> uint16_t weight;
> + uint16_t cap;
Hmm, this needs to be rebased on the structure layout patches I checked
in last week. :-)
-George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
next prev parent reply other threads:[~2017-07-25 15:09 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-08 12:08 [PATCH 0/4] xen/tools: Credit2: implement caps Dario Faggioli
2017-06-08 12:08 ` [PATCH 1/4] xen: credit2: implement utilization cap Dario Faggioli
2017-06-12 11:16 ` Anshul Makkar
2017-06-12 13:19 ` Dario Faggioli
2017-06-13 16:07 ` Anshul Makkar
2017-06-13 21:13 ` Dario Faggioli
2017-06-15 16:16 ` Anshul Makkar
2017-06-22 16:55 ` George Dunlap
2017-06-23 16:19 ` Dario Faggioli
2017-06-28 14:28 ` George Dunlap
2017-06-28 14:56 ` Dario Faggioli
2017-06-28 19:05 ` George Dunlap
2017-06-29 10:09 ` Dario Faggioli
2017-07-25 14:34 ` George Dunlap
2017-07-25 17:29 ` Dario Faggioli
2017-07-25 15:08 ` George Dunlap [this message]
2017-07-25 16:05 ` Dario Faggioli
2017-06-08 12:08 ` [PATCH 2/4] xen: credit2: allow to set and get " Dario Faggioli
2017-06-28 15:19 ` George Dunlap
2017-06-29 10:21 ` Dario Faggioli
2017-06-29 7:39 ` Alan Robinson
2017-06-29 8:26 ` George Dunlap
2017-06-08 12:09 ` [PATCH 3/4] xen: credit2: improve distribution of budget (for domains with caps) Dario Faggioli
2017-06-28 16:02 ` George Dunlap
2017-06-08 12:09 ` [PATCH 4/4] libxl/xl: allow to get and set cap on Credit2 Dario Faggioli
2017-06-09 10:41 ` Wei Liu
2017-06-28 18:43 ` George Dunlap
2017-06-29 10:22 ` Dario Faggioli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18caea27-2d72-6e93-96ca-8a4a0a4ee614@citrix.com \
--to=george.dunlap@citrix.com \
--cc=andrew.cooper3@citrix.com \
--cc=anshul.makkar@citrix.com \
--cc=dario.faggioli@citrix.com \
--cc=george.dunlap@eu.citrix.com \
--cc=ian.jackson@eu.citrix.com \
--cc=jbeulich@suse.com \
--cc=wei.liu2@citrix.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).