From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH 1/4] xen: credit2: implement utilization cap Date: Fri, 23 Jun 2017 18:19:27 +0200 Message-ID: <1498234767.7405.46.camel@citrix.com> References: <149692186557.9605.11625777539060264052.stgit@Solace.fritz.box> <149692372627.9605.8252407697848997058.stgit@Solace.fritz.box> <2db5b8c2-eb6b-3926-806e-9bcf2e46b4a1@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1203626488277207734==" Return-path: Received: from mail6.bemta6.messagelabs.com ([193.109.254.103]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dORJB-0007Ft-Mh for xen-devel@lists.xenproject.org; Fri, 23 Jun 2017 16:19:45 +0000 In-Reply-To: <2db5b8c2-eb6b-3926-806e-9bcf2e46b4a1@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: George Dunlap , xen-devel@lists.xenproject.org Cc: Wei Liu , George Dunlap , Andrew Cooper , Anshul Makkar , Ian Jackson , Jan Beulich List-Id: xen-devel@lists.xenproject.org --===============1203626488277207734== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="=-a0OFOiK+hDT8ujhJa4hK" --=-a0OFOiK+hDT8ujhJa4hK Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, 2017-06-22 at 17:55 +0100, George Dunlap wrote: > On 08/06/17 13:08, Dario Faggioli wrote: > > This commit implements the Xen part of the cap mechanism for > > Credit2. > >=20 > > A cap is how much, in terms of % of physical CPU time, a domain > > can execute at most. > >=20 > > For instance, a domain that must not use more than 1/4 of one > > physical CPU, must have a cap of 25%; one that must not use more > > than 1+1/2 of physical CPU time, must be given a cap of 150%. > >=20 > > Caps are per domain, so it is all a domain's vCPUs, cumulatively, > > that will be forced to execute no more than the decided amount. > >=20 > > This is implemented by giving each domain a 'budget', and using > > a (per-domain again) periodic timer. Values of budget and 'period' > > are chosen so that budget/period is equal to the cap itself. > >=20 > > Budget is burned by the domain's vCPUs, in a similar way to how > > credits are. > >=20 > > When a domain runs out of budget, its vCPUs can't run any longer. > > They can gain, when the budget is replenishment by the timer, which > > event happens once every period. > >=20 > > Blocking the vCPUs because of lack of budget happens by > > means of a new (_VPF_parked) pause flag, so that, e.g., > > vcpu_runnable() still works. This is similar to what is > > done in sched_rtds.c, as opposed to what happens in > > sched_credit.c, where vcpu_pause() and vcpu_unpause() > > (which means, among other things, more overhead). > >=20 > > Note that xenalyze and tools/xentrace/format are also modified, > > to keep them updated with one modified event. > >=20 > > Signed-off-by: Dario Faggioli >=20 > Looks really good overall, Dario!=C2=A0=C2=A0Just a few relatively minor > comments. >=20 Cool! Thanks for having a look so quickly. :-) > > --- a/xen/common/sched_credit2.c > > +++ b/xen/common/sched_credit2.c > > @@ -92,6 +92,82 @@ > > + *=C2=A0=C2=A0=C2=A0Budget never expires, so at whatever time a vCPU w= ants to > > run, it can > > + *=C2=A0=C2=A0=C2=A0check the domain's budget, and if there is some, i= t can use > > it. >=20 > I'm not sure what this paragraph is trying to say. Saying budget > "never > expires" makes it sound like you continue to accumulate it, such that > if > you don't run at all for several periods, you could "save it up" and > run > at 100% for one full period. >=20 Yes, I see what you mean, and I agree that it's unclear. The point is that there exists algorithm where the budget of a replenishment has an expiry time, before the next replenishment, i.e., it has to be used right away after the replenishment, or (usually), shortly after, or it will be thrown away, and we'll have to wait next replenishment. But again, yes, saying "never expires", here, and in the way I did it, and without any reference and comparison to those other algorithms, makes people think what you just said. I'll rephrase (or cut the paragraph entirely, it's probably not super informative/useful anyway). :-) > > + * - when a budget replenishment occurs, if there are vCPUs that > > had been > > + *=C2=A0=C2=A0=C2=A0blocked because of lack of budget, they'll be unbl= ocked, and > > they will > > + *=C2=A0=C2=A0=C2=A0(potentially) be able to run again. > > + * > > + * Finally, some even more implementation related detail: > > + * > > + * - budget is stored in a domain-wide pool. vCPUs of the domain > > that want > > + *=C2=A0=C2=A0=C2=A0to run go to such pool, and grub some. When they d= o so, the > > amount > > + *=C2=A0=C2=A0=C2=A0they grabbed is _immediately_ removed from the poo= l. This > > happens in > > + *=C2=A0=C2=A0=C2=A0vcpu_try_to_get_budget(); >=20 > This sounds like a good solution to the "greedy vcpu" problem. :-) >=20 I like it too. It works because it's coupled with the fact that runqueue is ordered by credits. E.g., doing the same in Credit1, would probably still not work, because of the round-robin queues (within same priority), and because of the boosting on wakeup. > > @@ -1438,6 +1570,217 @@ void burn_credits(struct > > +static bool vcpu_try_to_get_budget(struct csched2_vcpu *svc) >=20 > This name is OK, but it's a bit long.=C2=A0=C2=A0What about > "vcpu_grab_budget()"? >=20 Ok (and the other renaming suggestions too). > > +{ > > +=C2=A0=C2=A0=C2=A0=C2=A0struct csched2_dom *sdom =3D svc->sdom; > > +=C2=A0=C2=A0=C2=A0=C2=A0unsigned int cpu =3D svc->vcpu->processor; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0ASSERT(spin_is_locked(per_cpu(schedule_data, > > cpu).schedule_lock)); > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0if ( svc->budget > 0 ) > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0return true; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0/* budget_lock nests inside runqueue lock. */ > > +=C2=A0=C2=A0=C2=A0=C2=A0spin_lock(&sdom->budget_lock); > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0/* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* Here, svc->budget is <=3D 0 (as, if it= was > 0, we'd have > > taken the if > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* above!). That basically means the vCPU= has overrun a bit -- > > because of > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* various reasons-- and we want to take = that into account. > > With the +=3D, > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* we are actually subtracting the amount= of budget the vCPU > > has > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* overconsumed, from the total domain bu= dget. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/ > > +=C2=A0=C2=A0=C2=A0=C2=A0sdom->budget +=3D svc->budget; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0if ( sdom->budget > 0 ) > > +=C2=A0=C2=A0=C2=A0=C2=A0{ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0svc->budget =3D sdom->= budget; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0sdom->budget =3D 0; >=20 > Just a minor comment here -- I didn't see anything in this > description, > the cover letter, or this patch indicating that you were planning on > changing this in a follow-up patch.=C2=A0=C2=A0A heads-up not to worry ab= out > apparently allowing only one vcpu to run at a time might have been > nice. :-) >=20 Right. I'll explain that in the cover, and in a comment here (which will then go away). > > +{ > > +=C2=A0=C2=A0=C2=A0=C2=A0struct csched2_dom *sdom =3D data; > > +=C2=A0=C2=A0=C2=A0=C2=A0unsigned long flags; > > +=C2=A0=C2=A0=C2=A0=C2=A0s_time_t now; > > +=C2=A0=C2=A0=C2=A0=C2=A0LIST_HEAD(parked); > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0spin_lock_irqsave(&sdom->budget_lock, flags); > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0/* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* It is possible that the domain overrun= , and that the budget > > hence went > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* below 0 (reasons may be system overboo= king, issues in or > > too coarse > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* runtime accounting, etc.). In particul= ar, if we overrun by > > more than > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* tot_budget, then budget+tot_budget wou= ld still be < 0, > > which in turn > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* means that, despite replenishment, the= re's still no budget > > for unarking > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* and running vCPUs. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* It is also possible that we are handli= ng the replenishment > > much later > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* than expected (reasons may again be ov= erbooking, or issues > > with timers). > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* If we are more than CSCHED2_BDGT_REPL_= PERIOD late, this > > means we have > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* basically skipped (at least) one reple= nishment. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* We deal with both the issues here, by,= basically, doing > > more than just > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* one replenishment. Note, however, that= every time we add > > tot_budget > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* to the budget, we also move next_repl = away by > > CSCHED2_BDGT_REPL_PERIOD. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* This guarantees we always respect the = cap. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/ > > +=C2=A0=C2=A0=C2=A0=C2=A0now =3D NOW(); > > +=C2=A0=C2=A0=C2=A0=C2=A0do > > +=C2=A0=C2=A0=C2=A0=C2=A0{ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0sdom->next_repl +=3D C= SCHED2_BDGT_REPL_PERIOD; > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0sdom->budget +=3D sdom= ->tot_budget; > > +=C2=A0=C2=A0=C2=A0=C2=A0} > > +=C2=A0=C2=A0=C2=A0=C2=A0while ( sdom->next_repl <=3D now || sdom->budg= et <=3D 0 ); >=20 > The first clause ("oops, accidentally missed a replenishment period") > I > agree with;=20 > Ok. > but I'm going back and forth a bit on the second one.=C2=A0=C2=A0It > means essentially that the scheduler made a mistake and allowed the > VM > to run for one full budget *more* than its allocated time (perhaps > accumulated over several periods). >=20 No, the budget does not accumulate. Or at least, it does, but only up to the original tot_budget. So, basically, the reason why budget may still be <0, after a replenishment of tot_budget, is that something went wrong, and we let the vcpu overrun for more than tot_budget. It really should never happen (I may actually add a WARN()), unless the accounting is very coarse, or the budget is really small (i.e., the budget is small compared to the resolution we can achieve for the accounting). > On the one hand, it's the scheduler that made a mistake, so we > shouldn't > "punish" a domain by giving it a full period with no budget. >=20 Yes, I think I understand what you mean. However, I would not necessarily call this "punishing" the domain. We're just making sure that cap is enforced, even during (hopefully sporadic and only transient) tricky circumstances where the scheduler got something wrong, and for those domains that have (perhaps not maliciously, but still) already taken advantage of such mistake. In fact, assume you have a domain that wants to execute W amount of work every T time, but has a cap that results in it having a budget of C< On the other hand, if there's a reliable way to trick the scheduler > into > allowing a guest to over-run its budget, this allows a rogue guest to > essentially work around the cap. >=20 Exactly. There really should not be a way of triggering this behavior but, if someone finds it, this is a safety/robustness measure for always having cap enforced. And remember that this only applies if the vcpu has actually overrun. As in, if the scheduler gets confused and come to do the budget enforcement late, during the T-2T period, but the vcpu has only executed for C it (the vcpu) does not have to pay any price for the scheduler's mistake (that, indeed, would be unfair). > > @@ -2423,14 +2785,22 @@ csched2_runtime(const struct scheduler > > *ops, int cpu, > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* credit values of MIN,MAX per vcpu= , since each vcpu burns > > credit > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* at a different rate. > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/ > > -=C2=A0=C2=A0=C2=A0=C2=A0if (rt_credit > 0) > > +=C2=A0=C2=A0=C2=A0=C2=A0if ( rt_credit > 0 ) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0time =3D c2t(rqd,= rt_credit, snext); > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0else > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0time =3D 0; > > =C2=A0 > > -=C2=A0=C2=A0=C2=A0=C2=A0/* 3) But never run longer than MAX_TIMER or l= ess than > > MIN_TIMER or > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* the rate_limit time. */ > > -=C2=A0=C2=A0=C2=A0=C2=A0if ( time < min_time) > > +=C2=A0=C2=A0=C2=A0=C2=A0/* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* 3) But, if capped, never run more than= our budget. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/ > > +=C2=A0=C2=A0=C2=A0=C2=A0if ( unlikely(has_cap(snext)) ) > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0time =3D snext->budget= < time ? snext->budget : time; > > + > > +=C2=A0=C2=A0=C2=A0=C2=A0/* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* 4) But never run longer than MAX_TIMER= or less than > > MIN_TIMER or >=20 > Two "buts" in a row doesn't work.=C2=A0=C2=A0I'd change this to "and", wh= ich > sort > of follows on from the last "but". >=20 Ok, will do. Thanks again and Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-a0OFOiK+hDT8ujhJa4hK Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJZTT+QAAoJEBZCeImluHPuTREP/RcIEBIn2S4kePPOM6tGmZA3 9MIw9lxSimyDK9HDNydhkZMuhhvlJTSJ8tlx2TYGdYTQmeKpSnSPlLSVXSQ1CoXL H8KeP8bQjgkWk7H9dQbAUFt/WHRozhk5QXRf+nPDStBmM5gIGs/DNKZwzYq+2txP QyNF8+hdcN/JOiMGD7RaeYVjUZVfia/c3HBUFfZ9XbDlMVMyehKGqiORZ6CYF0yl jvkVPwIkattzJo8C7dbKiroKG3PZgntt4QFIj37Y4ihSiLEK95OTmTjklmybuoIZ xRXe3Z5AHOz6iTYtle4XVxNBKCgzu8Ak+vCefpVl0KKy/7FS/D/wi2wMD5UGLAYA Sk586MRNxHnFszmoeUrzbtsF/h2FKYH07OmDSEZevChC5LdjF4JH7tCFPgBhE3Xm TggCypuYfsBAQ0X7ApshTpjL9DhBWq48+ID4En0IDO8hxDi+38E35Gu1fWsOV+ec X+mbGTOwj7Ru0t1cnXs7mX/YZe/lXe27pQz0gLeeFUW0uy+wfInWGmsPzxJmRKJf j/rksP2H44yIL8mG6eVvUzzM/ju0vWTAjN/nMdwyWBPWrh5TK0sPw64olCFw/iYs DztoyyZV9JA99OxJL4mqMHZBX4eYmAcM9pE89mOLQCL/+w49dUqm0BcqiwhT4eP/ piFwQDca0AQLKW3rW50V =dn+M -----END PGP SIGNATURE----- --=-a0OFOiK+hDT8ujhJa4hK-- --===============1203626488277207734== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v cmcveGVuLWRldmVsCg== --===============1203626488277207734==--