From mboxrd@z Thu Jan 1 00:00:00 1970 From: Meng Xu Subject: Re: [PATCH v4] xen: sched: convert RTDS from time to event driven model Date: Wed, 3 Feb 2016 23:03:16 -0500 Message-ID: References: <1454301176-7131-1-git-send-email-tiche@seas.upenn.edu> <1454425705.9227.176.camel@citrix.com> <56B0F0E7.3070604@seas.upenn.edu> <1454503184.9227.276.camel@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1aRB8a-0003sm-1b for xen-devel@lists.xenproject.org; Thu, 04 Feb 2016 04:03:20 +0000 Received: by mail-oi0-f47.google.com with SMTP id s2so8308766oie.2 for ; Wed, 03 Feb 2016 20:03:17 -0800 (PST) In-Reply-To: <1454503184.9227.276.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: "xen-devel@lists.xenproject.org" , Tianyang Chen , George Dunlap , Dagaen Golomb , Meng Xu List-Id: xen-devel@lists.xenproject.org On Wed, Feb 3, 2016 at 7:39 AM, Dario Faggioli wrote: > > On Tue, 2016-02-02 at 13:09 -0500, Tianyang Chen wrote: > > On 2/2/2016 10:08 AM, Dario Faggioli wrote: > > > On Sun, 2016-01-31 at 23:32 -0500, Tianyang Chen wrote: > > > > > > > > + struct rt_private *prv = rt_priv(ops); > > > > + struct list_head *replq = rt_replq(ops); > > > > + struct list_head *iter; > > > > + > > > > + ASSERT( spin_is_locked(&prv->lock) ); > > > > + > > > Is this really useful? Considering where this is called, I don't > > > think > > > it is. > > > > > > > This basically comes from __q_insert(). > > > I see. Well it's still not necessary, IMO, so don't add it. At some > point, we may want to remove it from __runq_insert() too. > > The bottom line is: prv->lock is, currently, the runqueue lock (it's > the lock for everything! :-D). It is pretty obvious that you need the > runq lock to manipulate the runqueue. > > It is not the runqueue that you are manipulating here, it is the > replenishment queue, but as a matter of fact, prv->lock serializes that > too. So, if anything, it would be more useful to add a comment > somewhere, noting (reinforcing, I should probably say) the point that > also this new queue for replenishment is being serialized by prv->lock, > than an assert like this one. > > > I have been searching for the > > definitions of vcpu_schedule_lock_irq() and pcpu_schedule_lock_irq(). > > Do > > you know where they are defined? I did some name search in the > > entire > > xen directory but still nothing. > > > They are auto-generated. Look in xen/include/xen/sched-if.h. > > > > > + ASSERT( !__vcpu_on_p(svc) ); > > > > + > > > > + list_for_each(iter, replq) > > > > + { > > > > + struct rt_vcpu * iter_svc = __p_elem(iter); > > > > + if ( svc->cur_deadline <= iter_svc->cur_deadline ) > > > > + break; > > > > + } > > > > + > > > > + list_add_tail(&svc->p_elem, iter); > > > > +} > > > This looks ok. The list_for_each()+list_ad_tail() part would be > > > basically identical to what we have inside __runq_insert(), > > > thgough, > > > wouldn't it? > > > > > > It may make sense to define an _ordered_queue_insert() (or > > > _deadline_queue_insert(), since it's alwas the deadline that is > > > compared) utility function that we can then call in both cases. > > > > > > > @@ -449,6 +507,7 @@ rt_init(struct scheduler *ops) > > > > INIT_LIST_HEAD(&prv->sdom); > > > > INIT_LIST_HEAD(&prv->runq); > > > > INIT_LIST_HEAD(&prv->depletedq); > > > > + INIT_LIST_HEAD(&prv->replq); > > > > > > > Is there a reason why you don't do at least the allocation of the > > > timer > > > here? Because, to me, this looks the best place for that. > > > > > > > cpumask_clear(&prv->tickled); > > > > > > > > @@ -473,6 +532,9 @@ rt_deinit(const struct scheduler *ops) > > > > xfree(_cpumask_scratch); > > > > _cpumask_scratch = NULL; > > > > } > > > > + > > > > + kill_timer(prv->repl_timer); > > > > + > > > > xfree(prv); > > > > > > > And here, you're freeing prv, without having freed prv->timer, > > > which is > > > therefore leaked. > > > > > > > @@ -632,6 +699,17 @@ rt_vcpu_insert(const struct scheduler *ops, > > > > struct vcpu *vc) > > > > vcpu_schedule_unlock_irq(lock, vc); > > > > > > > > SCHED_STAT_CRANK(vcpu_insert); > > > > + > > > > + if( prv->repl_timer == NULL ) > > > > + { > > > > + /* vc->processor has been set in schedule.c */ > > > > + cpu = vc->processor; > > > > + > > > > + repl_timer = xzalloc(struct timer); > > > > + > > > > + prv->repl_timer = repl_timer; > > > > + init_timer(repl_timer, repl_handler, (void *)ops, cpu); > > > > + } > > > > } > > > > > > > This belong to rt_init, IMO. > > > > > > > When a new pool is created, the corresponding scheduler is also > > initialized. But there is no pcpu assigned to that pool. > > > Sure, and in fact, above, when commenting on rt_init() I said "Is there > a reason why you don't do at least the allocation of the timer here?" > > But you're right, here I put it like all should belong to rt_init(), I > should have been more clear. > > > If init_timer() > > happens in rt_init, how does it know which cpu to pick? > > When move_domain() is called, there must be some pcpus in the > > destination pool and then vcpu_insert happens. > > > Allocating memory for the rt_priv copy of this particular scheduler > instance, definitely belong in rt_init(). > > Then, in this case, we could do the allocation in rt_init() and do the > initialization later, perhaps here, or the first time that the timer > needs to be started, or when the first pCPU is assigned to this > scheduler (by being assigned to the cpupool this scheduler services). > > I honestly prefer the latter. That is to say, in rt_alloc_pdata(), you > check whether prv->repl_timer is NULL and you allocate and initialize > it for the pCPU being brought up. I also would put an explicit 'prv- > >repl_timer=NULL', with a comment that proper allocation and > initialization will happen later, and the reason why that is the case, > in rt_init(). > > What do you think? > > > > > @@ -841,7 +881,6 @@ rt_schedule(const struct scheduler *ops, > > > > s_time_t > > > > now, bool_t tasklet_work_sched > > > > /* burn_budget would return for IDLE VCPU */ > > > > burn_budget(ops, scurr, now); > > > > > > > > - __repl_update(ops, now); > > > > > > > __repl_update() going away is of course fine. But we need to enact > > > the > > > budget enforcement logic somewhere in here, don't we? > > > > > > > Sorry about that, I meant to add it in this version. > > > More on this later. But then I'm curios. With this patch applied, it > you set b=400/p=1000 for a domain, was the 40% utilization being > enforced properly? > > > > > @@ -1058,25 +1103,39 @@ rt_vcpu_wake(const struct scheduler *ops, > > > > struct vcpu *vc) > > > > if ( unlikely(svc->flags & RTDS_scheduled) ) > > > > { > > > > set_bit(__RTDS_delayed_runq_add, &svc->flags); > > > > - return; > > > > + goto out; > > > > } > > > > > > > In this case, I'm less sure. I think this should just return as > > > well, > > > but it may depend on how other stuff are done (like budget > > > enforcement), so it's hard to think at it right now, but consider > > > this > > > when working on next version. > > > > > > > OK. For the second case, I guess there is nothing preventing a vcpu > > that > > sleeps for a long time on runq to be picked before other vcpus. > > > By "the second case" you mean this one about __RTDS_delayed_runq_add? > If yes, I'm not following what you mean with "a vcpu sleeps for a long > time on runq etc." > > This is what we are talking about here. A vCPU is selected to be > preempted, during rt_schedule(), but it is still runnable, so we need > to put it back to the runqueue. You can't just do that (put it back in > the runqueue), because, in RTDS, like in Credit2, the runqueue is > shared between multiple pCPUs (see commit be9a0806e197f4fd3fa6). > > So what do we do, we raise the *_delayed_runq_add flag and continue and > leave the vcpu alone. This happens, for instance, when there is a race > between scheduling out a vcpu that is going to sleep, and the vcpu in > question being woken back up already! It's not frequent, but we need to > take care of that. At some point (sooner rather tan later, most likely) > the context switch will actually happen, and the vcpu that is waking up > is put in the runqueue. > > > And also, repl_handler removes vcpus that are not runnable and if > > they > > are not added back here, where should we start tracking them? This > > goes > > back to if it's necessary to remove a un-runnable vcpu. > > > I also don't get what you mean by "tracking" (here and in other > places). > > Anyway, what I meant when I said that what to do in the "delayed add" > case depends on other things is, for instance, this: the vcpu is waking > up right now, but it is going to be inserted in the runqueue later. > When (as per "where in the code", "in what function") do you want to > program the timer, here in rt_vcpu_wake(), or changing > __replq_insert(), as said somewhere else, or in rt_context_saved(), > where the vcpu is actually added to the runq? This kind of things... > > > All this being said, taking a bit of a step back, and considering, not > only the replenishment event and/or timer (re)programming issue, but > also the deadline that needs updating, I think you actually need to > find a way to check for both, in this patch. > > In fact, if we aim at simplifying rt_context_saved() too --and we > certainly are-- in this case that a vcpu is actually being woken up, > but it's not being put back in the runqueue until context switch, when > it is that we check if it's deadline passed and it hence refreshed > scheduling parameters? > > > So, big recap, I think a possible variant of rt_vcpu_wake() would look > as follows: > > - if vcpu is running or in a q ==> /* nothing. */ > > - esle if vcpu is delayed_runq_add ==> if (now >= deadline?) { > update_deadline(); > if ( __vcpu_on_p(svc) ) > _replq_remove(); > } > _replq_insert(); > > - else ==> if (now >= deadline?) { > update_deadline(); > if ( __vcpu_on_p(svc) ) > > _replq_remove(); > > } > _runq_insert(); > _replq_insert(); > runq_tickle(); > > This would mean something like this: > > static void > rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc) > { > struct rt_vcpu * const svc = rt_vcpu(vc); > s_time_t now = NOW(); > struct rt_private *prv = rt_priv(ops); > > BUG_ON( is_idle_vcpu(vc) ); > > if ( unlikely(curr_on_cpu(vc->processor) == vc) ) > { > SCHED_STAT_CRANK(vcpu_wake_running); > return; > } > > /* on RunQ/DepletedQ, just update info is ok */ > if ( unlikely(__vcpu_on_q(svc)) ) > { > SCHED_STAT_CRANK(vcpu_wake_onrunq); > return; > } > > if ( likely(vcpu_runnable(vc)) ) > SCHED_STAT_CRANK(vcpu_wake_runnable); > else > SCHED_STAT_CRANK(vcpu_wake_not_runnable); > > > if ( now >= svc->cur_deadline) > { > rt_update_deadline(now, svc); > __replq_remove(svc); > } > > __replq_insert(svc); > > /* If context hasn't been saved for this vcpu yet, we can't put it on > * the Runqueue/DepletedQ. Instead, we set a flag so that it will be > * put on the Runqueue/DepletedQ after the context has been saved. > */ > if ( unlikely(svc->flags & RTDS_scheduled) ) > { > set_bit(__RTDS_delayed_runq_add, &svc->flags); > return; > } > > /* insert svc to runq/depletedq because svc is not in queue now */ > __runq_insert(ops, svc); > runq_tickle(ops, snext); > > return; > } > > And, at this point, I'm starting thinking again that we want to call > runq_tickle() in rt_context_saved(), at least in case > __RTDS_delayed_runq_add was set as, if we don't, we're missing tickling > for the vcpu being inserted because of that. > > Of course, all the above, taken with a little bit of salt... i.e., I do > not expect to have got all the details right, especially in the code, > but it should at least be effective in showing the idea. > > Thoughts? This is smart and we should be able to put it into this patch series. When a VCPU is stopped on a core, hasn't saved its context and has not been added back to runq/depletedq, we should delay the tickle until its context has been saved and it has been put back to the (runq/depletedq) queue. > > > > > > > + /* when the replenishment happens > > > > + * svc is either on a pcpu or on > > > > + * runq/depletedq > > > > + */ > > > > + if( __vcpu_on_q(svc) ) > > > > + { > > > > + /* put back to runq */ > > > > + __q_remove(svc); > > > > + __runq_insert(ops, svc); > > > > + runq_tickle(ops, svc); > > > > + } > > > > + > > > Aha! So, it looks the budget enforcement logic ended up here? > > > Mmm... > > > no, I think that doing it in rt_schedule(), as we agreed, would > > > make > > > things simpler and better looking, both here and there. > > > > > > This is the replenishment timer handler, and it really should do > > > one > > > thins: handle replenishment events. That's it! ;-P > > > > > > > Oh I might be misunderstanding what should be put in rt_schedule(). > > Was > > it specifically for handling a vcpu that shouldn't be taken off a > > core? > > Why are you referring this piece of code as budget enforcement? > > > > Isn't it necessary to modify the runq here after budget > > replenishment > > has happened? If runq goes out of order here, will rt_schedule() be > > sorting it? > > > Ops, yes, you are right, this is not budget enforcement. So, budget > enforcement looks still to be missing from this patch (and that is why > I asked whether it is actually working). > > And this chunk of code (and perhaps this whole function) is probably > ok, it's just a bit hard to understand what it does... > > > So, do you mind if we take a step back again, and analyze the > situation. When the timer fires, and this handler is executed and a > replenishment event taken off the replenishment queue, the following > situations are possible: > > 1) the replenishment is for a running vcpu > 2) the replenishment is for a runnable vcpu in the runq > 3) the replenishment is for a runnable vcpu in the depletedq > 4) the replenishment is for a sleeping vcpu > > So, 4) is never going to happen, right?, as we're stopping the timer > when a vcpu blocks. If we go for this, let's put an ASSERT() and a > comment as a guard for that. If we do the optimization in this very > patch. If we leave it for another patch, we need to handle this case, I > guess. > > In all 1), 2) and 3) cases, we need to remove the replenishment event > from the replenishment queue, so just (in the code) do it upfront. > > What other actions need to happen in each case, 1), 2) and 3)? Can you > summarize then in here, so we can decide how to handle each situation? > > I'd say that, in both case 2) and case 3), the vcpu needs to be taken > out of the queue where they are, and (re-)inserted in the runqueue, and > then we need to tickle... Is that correct? > > I'd also say that, in case 1), we also need to tickle because the > deadline may now be have become bigger than some other vcpu that is in > the runqueue? > > Also, under what conditions we need, as soon as replenishment for vcpu > X happened, to queue another replenishment for the same vcpu, at its > next deadline? Always, I would say (unless the vcpu is sleeping, > because of the optimization you introduced)? I'm thinking the following example as to why we need to rearm the timer for the just-replenished VCPU (Although it's very likely I missed something.): A runnable VCPU (period = 10ms, budget = 4ms) is scheduled to run on a core at t = 0; at t = 4, its budget depleted and it's added to depletedq; at t=10, the replenish timer should fire and replenish the budget; I think we should arm the replenish timer for t = 20 so that its budget will get replenished at t = 20 for the period [20, 30]; If we don't arm the timer at t = 20, how will the VCPU's budget be replenished at t=20? Please correct me if I'm wrong. Thanks and Best Regards, Meng -- ----------- Meng Xu PhD Student in Computer and Information Science University of Pennsylvania http://www.cis.upenn.edu/~mengxu/