From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: Re: [PATCH 1/4] xen: credit2: implement utilization cap
Date: Fri, 23 Jun 2017 18:19:27 +0200
Message-ID: <1498234767.7405.46.camel@citrix.com>
References: <149692186557.9605.11625777539060264052.stgit@Solace.fritz.box>
 <149692372627.9605.8252407697848997058.stgit@Solace.fritz.box>
 <2db5b8c2-eb6b-3926-806e-9bcf2e46b4a1@citrix.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============1203626488277207734=="
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta6.messagelabs.com ([193.109.254.103])
 by lists.xenproject.org with esmtp (Exim 4.84_2)
 (envelope-from <prvs=3407d24b6=dario.faggioli@citrix.com>)
 id 1dORJB-0007Ft-Mh
 for xen-devel@lists.xenproject.org; Fri, 23 Jun 2017 16:19:45 +0000
In-Reply-To: <2db5b8c2-eb6b-3926-806e-9bcf2e46b4a1@citrix.com>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
To: George Dunlap <george.dunlap@citrix.com>, xen-devel@lists.xenproject.org
Cc: Wei Liu <wei.liu2@citrix.com>, George Dunlap <george.dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Anshul Makkar <anshul.makkar@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, Jan Beulich <jbeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

--===============1203626488277207734==
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="=-a0OFOiK+hDT8ujhJa4hK"

--=-a0OFOiK+hDT8ujhJa4hK
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2017-06-22 at 17:55 +0100, George Dunlap wrote:
> On 08/06/17 13:08, Dario Faggioli wrote:
> > This commit implements the Xen part of the cap mechanism for
> > Credit2.
> >=20
> > A cap is how much, in terms of % of physical CPU time, a domain
> > can execute at most.
> >=20
> > For instance, a domain that must not use more than 1/4 of one
> > physical CPU, must have a cap of 25%; one that must not use more
> > than 1+1/2 of physical CPU time, must be given a cap of 150%.
> >=20
> > Caps are per domain, so it is all a domain's vCPUs, cumulatively,
> > that will be forced to execute no more than the decided amount.
> >=20
> > This is implemented by giving each domain a 'budget', and using
> > a (per-domain again) periodic timer. Values of budget and 'period'
> > are chosen so that budget/period is equal to the cap itself.
> >=20
> > Budget is burned by the domain's vCPUs, in a similar way to how
> > credits are.
> >=20
> > When a domain runs out of budget, its vCPUs can't run any longer.
> > They can gain, when the budget is replenishment by the timer, which
> > event happens once every period.
> >=20
> > Blocking the vCPUs because of lack of budget happens by
> > means of a new (_VPF_parked) pause flag, so that, e.g.,
> > vcpu_runnable() still works. This is similar to what is
> > done in sched_rtds.c, as opposed to what happens in
> > sched_credit.c, where vcpu_pause() and vcpu_unpause()
> > (which means, among other things, more overhead).
> >=20
> > Note that xenalyze and tools/xentrace/format are also modified,
> > to keep them updated with one modified event.
> >=20
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>=20
> Looks really good overall, Dario!=C2=A0=C2=A0Just a few relatively minor
> comments.
>=20
Cool! Thanks for having a look so quickly. :-)

> > --- a/xen/common/sched_credit2.c
> > +++ b/xen/common/sched_credit2.c
> > @@ -92,6 +92,82 @@
> > + *=C2=A0=C2=A0=C2=A0Budget never expires, so at whatever time a vCPU w=
ants to
> > run, it can
> > + *=C2=A0=C2=A0=C2=A0check the domain's budget, and if there is some, i=
t can use
> > it.
>=20
> I'm not sure what this paragraph is trying to say. Saying budget
> "never
> expires" makes it sound like you continue to accumulate it, such that
> if
> you don't run at all for several periods, you could "save it up" and
> run
> at 100% for one full period.
>=20
Yes, I see what you mean, and I agree that it's unclear. The point is
that there exists algorithm where the budget of a replenishment has an
expiry time, before the next replenishment, i.e., it has to be used
right away after the replenishment, or (usually), shortly after, or it
will be thrown away, and we'll have to wait next replenishment.

But again, yes, saying "never expires", here, and in the way I did it,
and without any reference and comparison to those other algorithms,
makes people think what you just said.

I'll rephrase (or cut the paragraph entirely, it's probably not super
informative/useful anyway). :-)

> > + * - when a budget replenishment occurs, if there are vCPUs that
> > had been
> > + *=C2=A0=C2=A0=C2=A0blocked because of lack of budget, they'll be unbl=
ocked, and
> > they will
> > + *=C2=A0=C2=A0=C2=A0(potentially) be able to run again.
> > + *
> > + * Finally, some even more implementation related detail:
> > + *
> > + * - budget is stored in a domain-wide pool. vCPUs of the domain
> > that want
> > + *=C2=A0=C2=A0=C2=A0to run go to such pool, and grub some. When they d=
o so, the
> > amount
> > + *=C2=A0=C2=A0=C2=A0they grabbed is _immediately_ removed from the poo=
l. This
> > happens in
> > + *=C2=A0=C2=A0=C2=A0vcpu_try_to_get_budget();
>=20
> This sounds like a good solution to the "greedy vcpu" problem. :-)
>=20
I like it too. It works because it's coupled with the fact that
runqueue is ordered by credits. E.g., doing the same in Credit1, would
probably still not work, because of the round-robin queues (within same
priority), and because of the boosting on wakeup.

> > @@ -1438,6 +1570,217 @@ void burn_credits(struct
> > +static bool vcpu_try_to_get_budget(struct csched2_vcpu *svc)
>=20
> This name is OK, but it's a bit long.=C2=A0=C2=A0What about
> "vcpu_grab_budget()"?
>=20
Ok (and the other renaming suggestions too).

> > +{
> > +=C2=A0=C2=A0=C2=A0=C2=A0struct csched2_dom *sdom =3D svc->sdom;
> > +=C2=A0=C2=A0=C2=A0=C2=A0unsigned int cpu =3D svc->vcpu->processor;
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0ASSERT(spin_is_locked(per_cpu(schedule_data,
> > cpu).schedule_lock));
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0if ( svc->budget > 0 )
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return true;
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0/* budget_lock nests inside runqueue lock. */
> > +=C2=A0=C2=A0=C2=A0=C2=A0spin_lock(&sdom->budget_lock);
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0/*
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* Here, svc->budget is <=3D 0 (as, if it=
 was > 0, we'd have
> > taken the if
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* above!). That basically means the vCPU=
 has overrun a bit --
> > because of
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* various reasons-- and we want to take =
that into account.
> > With the +=3D,
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* we are actually subtracting the amount=
 of budget the vCPU
> > has
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* overconsumed, from the total domain bu=
dget.
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/
> > +=C2=A0=C2=A0=C2=A0=C2=A0sdom->budget +=3D svc->budget;
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0if ( sdom->budget > 0 )
> > +=C2=A0=C2=A0=C2=A0=C2=A0{
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0svc->budget =3D sdom->=
budget;
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0sdom->budget =3D 0;
>=20
> Just a minor comment here -- I didn't see anything in this
> description,
> the cover letter, or this patch indicating that you were planning on
> changing this in a follow-up patch.=C2=A0=C2=A0A heads-up not to worry ab=
out
> apparently allowing only one vcpu to run at a time might have been
> nice. :-)
>=20
Right. I'll explain that in the cover, and in a comment here (which
will then go away).

> > +{
> > +=C2=A0=C2=A0=C2=A0=C2=A0struct csched2_dom *sdom =3D data;
> > +=C2=A0=C2=A0=C2=A0=C2=A0unsigned long flags;
> > +=C2=A0=C2=A0=C2=A0=C2=A0s_time_t now;
> > +=C2=A0=C2=A0=C2=A0=C2=A0LIST_HEAD(parked);
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0spin_lock_irqsave(&sdom->budget_lock, flags);
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0/*
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* It is possible that the domain overrun=
, and that the budget
> > hence went
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* below 0 (reasons may be system overboo=
king, issues in or
> > too coarse
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* runtime accounting, etc.). In particul=
ar, if we overrun by
> > more than
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* tot_budget, then budget+tot_budget wou=
ld still be < 0,
> > which in turn
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* means that, despite replenishment, the=
re's still no budget
> > for unarking
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* and running vCPUs.
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* It is also possible that we are handli=
ng the replenishment
> > much later
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* than expected (reasons may again be ov=
erbooking, or issues
> > with timers).
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* If we are more than CSCHED2_BDGT_REPL_=
PERIOD late, this
> > means we have
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* basically skipped (at least) one reple=
nishment.
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* We deal with both the issues here, by,=
 basically, doing
> > more than just
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* one replenishment. Note, however, that=
 every time we add
> > tot_budget
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* to the budget, we also move next_repl =
away by
> > CSCHED2_BDGT_REPL_PERIOD.
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* This guarantees we always respect the =
cap.
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/
> > +=C2=A0=C2=A0=C2=A0=C2=A0now =3D NOW();
> > +=C2=A0=C2=A0=C2=A0=C2=A0do
> > +=C2=A0=C2=A0=C2=A0=C2=A0{
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0sdom->next_repl +=3D C=
SCHED2_BDGT_REPL_PERIOD;
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0sdom->budget +=3D sdom=
->tot_budget;
> > +=C2=A0=C2=A0=C2=A0=C2=A0}
> > +=C2=A0=C2=A0=C2=A0=C2=A0while ( sdom->next_repl <=3D now || sdom->budg=
et <=3D 0 );
>=20
> The first clause ("oops, accidentally missed a replenishment period")
> I
> agree with;=20
>
Ok.

> but I'm going back and forth a bit on the second one.=C2=A0=C2=A0It
> means essentially that the scheduler made a mistake and allowed the
> VM
> to run for one full budget *more* than its allocated time (perhaps
> accumulated over several periods).
>=20
No, the budget does not accumulate. Or at least, it does, but only up
to the original tot_budget.

So, basically, the reason why budget may still be <0, after a
replenishment of tot_budget, is that something went wrong, and we let
the vcpu overrun for more than tot_budget.

It really should never happen (I may actually add a WARN()), unless the
accounting is very coarse, or the budget is really small (i.e., the
budget is small compared to the resolution we can achieve for the
accounting).

> On the one hand, it's the scheduler that made a mistake, so we
> shouldn't
> "punish" a domain by giving it a full period with no budget.
>=20
Yes, I think I understand what you mean. However, I would not
necessarily call this "punishing" the domain. We're just making sure
that cap is enforced, even during (hopefully sporadic and only
transient) tricky circumstances where the scheduler got something
wrong, and for those domains that have (perhaps not maliciously, but
still) already taken advantage of such mistake.

In fact, assume you have a domain that wants to execute W amount of
work every T time, but has a cap that results in it having a budget of
C<<W every T. Under normal circumstances, it executes for C between 0
and T, for C between T and 2T, for C between 2T and 3T, etc., until it
reaches W. So, after 3T, it will have executed for 3C.

In presence of an accounting/enforcing error, it may happen that it
executes for C between 0 and T, for 2C between T and 2T, for 0 between
2T and 3T, etc. So, after 3T, it will also have executed for 3C, as
above.

What I'm trying to say is that, yes, we're making it skip a period, but
that's only because we gave it more before, and it actually used this
additional capacity, we granted by mistake. So, it's not really a
punishment, it's... well.. recognising that we gave it too big of a
reward. And if it had spent that already, we don't go blame it, we just
skip the next one, to make things even.

> On the other hand, if there's a reliable way to trick the scheduler
> into
> allowing a guest to over-run its budget, this allows a rogue guest to
> essentially work around the cap.
>=20
Exactly. There really should not be a way of triggering this behavior
but, if someone finds it, this is a safety/robustness measure for
always having cap enforced.

And remember that this only applies if the vcpu has actually overrun.
As in, if the scheduler gets confused and come to do the budget
enforcement late, during the T-2T period, but the vcpu has only
executed for C it (the vcpu) does not have to pay any price for the
scheduler's mistake (that, indeed, would be unfair).

> > @@ -2423,14 +2785,22 @@ csched2_runtime(const struct scheduler
> > *ops, int cpu,
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* credit values of MIN,MAX per vcpu=
, since each vcpu burns
> > credit
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* at a different rate.
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/
> > -=C2=A0=C2=A0=C2=A0=C2=A0if (rt_credit > 0)
> > +=C2=A0=C2=A0=C2=A0=C2=A0if ( rt_credit > 0 )
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0time =3D c2t(rqd,=
 rt_credit, snext);
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0else
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0time =3D 0;
> > =C2=A0
> > -=C2=A0=C2=A0=C2=A0=C2=A0/* 3) But never run longer than MAX_TIMER or l=
ess than
> > MIN_TIMER or
> > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* the rate_limit time. */
> > -=C2=A0=C2=A0=C2=A0=C2=A0if ( time < min_time)
> > +=C2=A0=C2=A0=C2=A0=C2=A0/*
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* 3) But, if capped, never run more than=
 our budget.
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/
> > +=C2=A0=C2=A0=C2=A0=C2=A0if ( unlikely(has_cap(snext)) )
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0time =3D snext->budget=
 < time ? snext->budget : time;
> > +
> > +=C2=A0=C2=A0=C2=A0=C2=A0/*
> > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* 4) But never run longer than MAX_TIMER=
 or less than
> > MIN_TIMER or
>=20
> Two "buts" in a row doesn't work.=C2=A0=C2=A0I'd change this to "and", wh=
ich
> sort
> of follows on from the last "but".
>=20
Ok, will do.

Thanks again and Regards,
Dario
--=20
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
--=-a0OFOiK+hDT8ujhJa4hK
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCAAGBQJZTT+QAAoJEBZCeImluHPuTREP/RcIEBIn2S4kePPOM6tGmZA3
9MIw9lxSimyDK9HDNydhkZMuhhvlJTSJ8tlx2TYGdYTQmeKpSnSPlLSVXSQ1CoXL
H8KeP8bQjgkWk7H9dQbAUFt/WHRozhk5QXRf+nPDStBmM5gIGs/DNKZwzYq+2txP
QyNF8+hdcN/JOiMGD7RaeYVjUZVfia/c3HBUFfZ9XbDlMVMyehKGqiORZ6CYF0yl
jvkVPwIkattzJo8C7dbKiroKG3PZgntt4QFIj37Y4ihSiLEK95OTmTjklmybuoIZ
xRXe3Z5AHOz6iTYtle4XVxNBKCgzu8Ak+vCefpVl0KKy/7FS/D/wi2wMD5UGLAYA
Sk586MRNxHnFszmoeUrzbtsF/h2FKYH07OmDSEZevChC5LdjF4JH7tCFPgBhE3Xm
TggCypuYfsBAQ0X7ApshTpjL9DhBWq48+ID4En0IDO8hxDi+38E35Gu1fWsOV+ec
X+mbGTOwj7Ru0t1cnXs7mX/YZe/lXe27pQz0gLeeFUW0uy+wfInWGmsPzxJmRKJf
j/rksP2H44yIL8mG6eVvUzzM/ju0vWTAjN/nMdwyWBPWrh5TK0sPw64olCFw/iYs
DztoyyZV9JA99OxJL4mqMHZBX4eYmAcM9pE89mOLQCL/+w49dUqm0BcqiwhT4eP/
piFwQDca0AQLKW3rW50V
=dn+M
-----END PGP SIGNATURE-----

--=-a0OFOiK+hDT8ujhJa4hK--


--===============1203626488277207734==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs
IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v
cmcveGVuLWRldmVsCg==

--===============1203626488277207734==--