From mboxrd@z Thu Jan  1 00:00:00 1970
From: Meng Xu <xumengpanda@gmail.com>
Subject: Re: [PATCH v4] xen: sched: convert RTDS from time to
	event driven model
Date: Wed, 3 Feb 2016 23:03:16 -0500
Message-ID: <CAENZ-+n49GosoJ9oDMfmiSRz6aDWu4+JtL5jYxM5LzqcAFP4=A@mail.gmail.com>
References: <1454301176-7131-1-git-send-email-tiche@seas.upenn.edu>
	<1454425705.9227.176.camel@citrix.com>
	<56B0F0E7.3070604@seas.upenn.edu>
	<1454503184.9227.276.camel@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <xumengpanda@gmail.com>) id 1aRB8a-0003sm-1b
	for xen-devel@lists.xenproject.org; Thu, 04 Feb 2016 04:03:20 +0000
Received: by mail-oi0-f47.google.com with SMTP id s2so8308766oie.2
	for <xen-devel@lists.xenproject.org>;
	Wed, 03 Feb 2016 20:03:17 -0800 (PST)
In-Reply-To: <1454503184.9227.276.camel@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>, Tianyang Chen <tiche@seas.upenn.edu>, George Dunlap <george.dunlap@citrix.com>, Dagaen Golomb <dgolomb@seas.upenn.edu>, Meng Xu <mengxu@cis.upenn.edu>
List-Id: xen-devel@lists.xenproject.org

On Wed, Feb 3, 2016 at 7:39 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>
> On Tue, 2016-02-02 at 13:09 -0500, Tianyang Chen wrote:
> > On 2/2/2016 10:08 AM, Dario Faggioli wrote:
> > > On Sun, 2016-01-31 at 23:32 -0500, Tianyang Chen wrote:
> > > >
> > > > +    struct rt_private *prv = rt_priv(ops);
> > > > +    struct list_head *replq = rt_replq(ops);
> > > > +    struct list_head *iter;
> > > > +
> > > > +    ASSERT( spin_is_locked(&prv->lock) );
> > > > +
> > > Is this really useful? Considering where this is called, I don't
> > > think
> > > it is.
> > >
> >
> > This basically comes from __q_insert().
> >
> I see. Well it's still not necessary, IMO, so don't add it. At some
> point, we may want to remove it from __runq_insert() too.
>
> The bottom line is: prv->lock is, currently, the runqueue lock (it's
> the lock for everything! :-D). It is pretty obvious that you need the
> runq lock to manipulate the runqueue.
>
> It is not the runqueue that you are manipulating here, it is the
> replenishment queue, but as a matter of fact, prv->lock serializes that
> too. So, if anything, it would be more useful to add a comment
> somewhere, noting (reinforcing, I should probably say) the point that
> also this new queue for replenishment is being serialized by prv->lock,
> than an assert like this one.
>
> > I have been searching for the
> > definitions of vcpu_schedule_lock_irq() and pcpu_schedule_lock_irq().
> > Do
> > you know where they are defined? I did some name search in the
> > entire
> > xen directory but still nothing.
> >
> They are auto-generated. Look in xen/include/xen/sched-if.h.
>
> > > > +    ASSERT( !__vcpu_on_p(svc) );
> > > > +
> > > > +    list_for_each(iter, replq)
> > > > +    {
> > > > +        struct rt_vcpu * iter_svc = __p_elem(iter);
> > > > +        if ( svc->cur_deadline <= iter_svc->cur_deadline )
> > > > +                break;
> > > > +    }
> > > > +
> > > > +    list_add_tail(&svc->p_elem, iter);
> > > > +}
> > > This looks ok. The list_for_each()+list_ad_tail() part would be
> > > basically identical to what we have inside __runq_insert(),
> > > thgough,
> > > wouldn't it?
> > >
> > > It may make sense to define an _ordered_queue_insert() (or
> > > _deadline_queue_insert(), since it's alwas the deadline that is
> > > compared) utility function that we can then call in both cases.
> > >
> > > > @@ -449,6 +507,7 @@ rt_init(struct scheduler *ops)
> > > >       INIT_LIST_HEAD(&prv->sdom);
> > > >       INIT_LIST_HEAD(&prv->runq);
> > > >       INIT_LIST_HEAD(&prv->depletedq);
> > > > +    INIT_LIST_HEAD(&prv->replq);
> > > >
> > > Is there a reason why you don't do at least the allocation of the
> > > timer
> > > here? Because, to me, this looks the best place for that.
> > >
> > > >       cpumask_clear(&prv->tickled);
> > > >
> > > > @@ -473,6 +532,9 @@ rt_deinit(const struct scheduler *ops)
> > > >           xfree(_cpumask_scratch);
> > > >           _cpumask_scratch = NULL;
> > > >       }
> > > > +
> > > > +    kill_timer(prv->repl_timer);
> > > > +
> > > >       xfree(prv);
> > > >
> > > And here, you're freeing prv, without having freed prv->timer,
> > > which is
> > > therefore leaked.
> > >
> > > > @@ -632,6 +699,17 @@ rt_vcpu_insert(const struct scheduler *ops,
> > > > struct vcpu *vc)
> > > >       vcpu_schedule_unlock_irq(lock, vc);
> > > >
> > > >       SCHED_STAT_CRANK(vcpu_insert);
> > > > +
> > > > +    if( prv->repl_timer == NULL )
> > > > +    {
> > > > +        /* vc->processor has been set in schedule.c */
> > > > +        cpu = vc->processor;
> > > > +
> > > > +        repl_timer = xzalloc(struct timer);
> > > > +
> > > > +        prv->repl_timer = repl_timer;
> > > > +        init_timer(repl_timer, repl_handler, (void *)ops, cpu);
> > > > +    }
> > > >   }
> > > >
> > > This belong to rt_init, IMO.
> > >
> >
> > When a new pool is created, the corresponding scheduler is also
> > initialized. But there is no pcpu assigned to that pool.
> >
> Sure, and in fact, above, when commenting on rt_init() I said "Is there
> a reason why you don't do at least the allocation of the timer here?"
>
> But you're right, here I put it like all should belong to rt_init(), I
> should have been more clear.
>
> > If init_timer()
> > happens in rt_init, how does it know which cpu to pick?
> > When move_domain() is called, there must be some pcpus in the
> > destination pool and then vcpu_insert happens.
> >
> Allocating memory for the rt_priv copy of this particular scheduler
> instance, definitely belong in rt_init().
>
> Then, in this case, we could do the allocation in rt_init() and do the
> initialization later, perhaps here, or the first time that the timer
> needs to be started, or when the first pCPU is assigned to this
> scheduler (by being assigned to the cpupool this scheduler services).
>
> I honestly prefer the latter. That is to say, in rt_alloc_pdata(), you
> check whether prv->repl_timer is NULL and you allocate and initialize
> it for the pCPU being brought up. I also would put an explicit 'prv-
> >repl_timer=NULL', with a comment that proper allocation and
> initialization will happen later, and the reason why that is the case,
> in rt_init().
>
> What do you think?
>
> > > > @@ -841,7 +881,6 @@ rt_schedule(const struct scheduler *ops,
> > > > s_time_t
> > > > now, bool_t tasklet_work_sched
> > > >       /* burn_budget would return for IDLE VCPU */
> > > >       burn_budget(ops, scurr, now);
> > > >
> > > > -    __repl_update(ops, now);
> > > >
> > > __repl_update() going away is of course fine. But we need to enact
> > > the
> > > budget enforcement logic somewhere in here, don't we?
> > >
> >
> > Sorry about that, I meant to add it in this version.
> >
> More on this later. But then I'm curios. With this patch applied, it
> you set b=400/p=1000 for a domain, was the 40% utilization being
> enforced properly?
>
> > > > @@ -1058,25 +1103,39 @@ rt_vcpu_wake(const struct scheduler *ops,
> > > > struct vcpu *vc)
> > > >       if ( unlikely(svc->flags & RTDS_scheduled) )
> > > >       {
> > > >           set_bit(__RTDS_delayed_runq_add, &svc->flags);
> > > > -        return;
> > > > +        goto out;
> > > >       }
> > > >
> > > In this case, I'm less sure. I think this should just return as
> > > well,
> > > but it may depend on how other stuff are done (like budget
> > > enforcement), so it's hard to think at it right now, but consider
> > > this
> > > when working on next version.
> > >
> >
> > OK. For the second case, I guess there is nothing preventing a vcpu
> > that
> > sleeps for a long time on runq to be picked before other vcpus.
> >
> By "the second case" you mean this one about __RTDS_delayed_runq_add?
> If yes, I'm not following what you mean with "a vcpu sleeps for a long
> time on runq etc."
>
> This is what we are talking about here. A vCPU is selected to be
> preempted, during rt_schedule(), but it is still runnable, so we need
> to put it back to the runqueue. You can't just do that (put it back in
> the runqueue), because, in RTDS, like in Credit2, the runqueue is
> shared between multiple pCPUs (see commit be9a0806e197f4fd3fa6).
>
> So what do we do, we raise the *_delayed_runq_add flag and continue and
> leave the vcpu alone. This happens, for instance, when there is a race
> between scheduling out a vcpu that is going to sleep, and the vcpu in
> question being woken back up already! It's not frequent, but we need to
> take care of that. At some point (sooner rather tan later, most likely)
> the context switch will actually happen, and the vcpu that is waking up
> is put in the runqueue.
>
> > And also, repl_handler removes vcpus that are not runnable and if
> > they
> > are not added back here, where should we start tracking them? This
> > goes
> > back to if it's necessary to remove a un-runnable vcpu.
> >
> I also don't get what you mean by "tracking" (here and in other
> places).
>
> Anyway, what I meant when I said that what to do in the "delayed add"
> case depends on other things is, for instance, this: the vcpu is waking
> up right now, but it is going to be inserted in the runqueue later.
> When (as per "where in the code", "in what function") do you want to
> program the timer, here in rt_vcpu_wake(), or changing
> __replq_insert(), as said somewhere else, or in rt_context_saved(),
> where the vcpu is actually added to the runq? This kind of things...
>
>
> All this being said, taking a bit of a step back, and considering, not
> only the replenishment event and/or timer (re)programming issue, but
> also the deadline that needs updating, I think you actually need to
> find a way to check for both, in this patch.
>
> In fact, if we aim at simplifying rt_context_saved() too --and we
> certainly are-- in this case that a vcpu is actually being woken up,
> but it's not being put back in the runqueue until context switch, when
> it is that we check if it's deadline passed and it hence refreshed
> scheduling parameters?
>
>
> So, big recap, I think a possible variant of rt_vcpu_wake() would look
> as follows:
>
>  - if vcpu is running or in a q     ==> /* nothing. */
>
>  - esle if vcpu is delayed_runq_add ==> if (now >= deadline?) {
>                                           update_deadline();
>                                           if ( __vcpu_on_p(svc) )
>                                             _replq_remove();
>                                         }
>                                         _replq_insert();
>
>  - else                             ==> if (now >= deadline?) {
>                                           update_deadline();
>                                           if ( __vcpu_on_p(svc) )
>
>                                      _replq_remove();
>
>                       }
>                                         _runq_insert();
>                                         _replq_insert();
>                                         runq_tickle();
>
> This would mean something like this:
>
> static void
> rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> {
>     struct rt_vcpu * const svc = rt_vcpu(vc);
>     s_time_t now = NOW();
>     struct rt_private *prv = rt_priv(ops);
>
>     BUG_ON( is_idle_vcpu(vc) );
>
>     if ( unlikely(curr_on_cpu(vc->processor) == vc) )
>     {
>         SCHED_STAT_CRANK(vcpu_wake_running);
>         return;
>     }
>
>     /* on RunQ/DepletedQ, just update info is ok */
>     if ( unlikely(__vcpu_on_q(svc)) )
>     {
>         SCHED_STAT_CRANK(vcpu_wake_onrunq);
>         return;
>     }
>
>     if ( likely(vcpu_runnable(vc)) )
>         SCHED_STAT_CRANK(vcpu_wake_runnable);
>     else
>         SCHED_STAT_CRANK(vcpu_wake_not_runnable);
>
>
>     if ( now >= svc->cur_deadline)
>     {
>         rt_update_deadline(now, svc);
>         __replq_remove(svc);
>     }
>
>     __replq_insert(svc);
>
>     /* If context hasn't been saved for this vcpu yet, we can't put it on
>      * the Runqueue/DepletedQ. Instead, we set a flag so that it will be
>      * put on the Runqueue/DepletedQ after the context has been saved.
>      */
>     if ( unlikely(svc->flags & RTDS_scheduled) )
>     {
>         set_bit(__RTDS_delayed_runq_add, &svc->flags);
>         return;
>     }
>
>     /* insert svc to runq/depletedq because svc is not in queue now */
>     __runq_insert(ops, svc);
>     runq_tickle(ops, snext);
>
>     return;
> }
>
> And, at this point, I'm starting thinking again that we want to call
> runq_tickle() in rt_context_saved(), at least in case
> __RTDS_delayed_runq_add was set as, if we don't, we're missing tickling
> for the vcpu being inserted because of that.
>
> Of course, all the above, taken with a little bit of salt... i.e., I do
> not expect to have got all the details right, especially in the code,
> but it should at least be effective in showing the idea.
>
> Thoughts?


This is smart and we should be able to put it into this patch series.

When a VCPU is stopped on a core, hasn't saved its context and has not
been added back to runq/depletedq, we should delay the tickle until
its context has been saved and it has been put back to the
(runq/depletedq) queue.
>
>
>
> > > > +            /* when the replenishment happens
> > > > +             * svc is either on a pcpu or on
> > > > +             * runq/depletedq
> > > > +             */
> > > > +            if( __vcpu_on_q(svc) )
> > > > +            {
> > > > +                /* put back to runq */
> > > > +                __q_remove(svc);
> > > > +                __runq_insert(ops, svc);
> > > > +                runq_tickle(ops, svc);
> > > > +            }
> > > > +
> > > Aha! So, it looks the budget enforcement logic ended up here?
> > > Mmm...
> > > no, I think that doing it in rt_schedule(), as we agreed, would
> > > make
> > > things simpler and better looking, both here and there.
> > >
> > > This is the replenishment timer handler, and it really should do
> > > one
> > > thins: handle replenishment events. That's it! ;-P
> > >
> >
> > Oh I might be misunderstanding what should be put in rt_schedule().
> > Was
> > it specifically for handling a vcpu that shouldn't be taken off a
> > core?
> > Why are you referring this piece of code as budget enforcement?
> >
> > Isn't it necessary to modify the runq here after budget
> > replenishment
> > has happened? If runq goes out of order here, will rt_schedule() be
> > sorting it?
> >
> Ops, yes, you are right, this is not budget enforcement. So, budget
> enforcement looks still to be missing from this patch (and that is why
> I asked whether it is actually working).
>
> And this chunk of code (and perhaps this whole function) is probably
> ok, it's just a bit hard to understand what it does...
>
>
> So, do you mind if we take a step back again, and analyze the
> situation. When the timer fires, and this handler is executed and a
> replenishment event taken off the replenishment queue, the following
> situations are possible:
>
>  1) the replenishment is for a running vcpu
>  2) the replenishment is for a runnable vcpu in the runq
>  3) the replenishment is for a runnable vcpu in the depletedq
>  4) the replenishment is for a sleeping vcpu
>
> So, 4) is never going to happen, right?, as we're stopping the timer
> when a vcpu blocks. If we go for this, let's put an ASSERT() and a
> comment as a guard for that. If we do the optimization in this very
> patch. If we leave it for another patch, we need to handle this case, I
> guess.
>
> In all 1), 2) and 3) cases, we need to remove the replenishment event
> from the replenishment queue, so just (in the code) do it upfront.
>
> What other actions need to happen in each case, 1), 2) and 3)? Can you
> summarize then in here, so we can decide how to handle each situation?
>
> I'd say that, in both case 2) and case 3), the vcpu needs to be taken
> out of the queue where they are, and (re-)inserted in the runqueue, and
> then we need to tickle... Is that correct?
>
> I'd also say that, in case 1), we also need to tickle because the
> deadline may now be have become bigger than some other vcpu that is in
> the runqueue?
>
> Also, under what conditions we need, as soon as replenishment for vcpu
> X happened, to queue another replenishment for the same vcpu, at its
> next deadline? Always, I would say (unless the vcpu is sleeping,
> because of the optimization you introduced)?


I'm thinking the following example as to why we need to rearm the
timer for the just-replenished VCPU (Although it's very likely I
missed something.):

A runnable VCPU (period = 10ms, budget = 4ms) is scheduled to run on a
core at t = 0;
at t = 4, its budget depleted and it's added to depletedq;
at t=10, the replenish timer should fire and replenish the budget;
I think we should arm the replenish timer for t = 20 so that its
budget will get replenished at t = 20 for the period [20, 30];
If we don't arm the timer at t = 20, how will the VCPU's budget be
replenished at t=20?

Please correct me if I'm wrong.

Thanks and Best Regards,

Meng


-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/