No HTML, please.

On Wed, 2015-06-10 at 00:18 -0400, Dagaen Golomb wrote:


>         And note that, when I say "timer", I mean an actual Xen timer,
>         i.e.,
>         those things that are started, stopped, and with a timer
>         handling
>         routine being called when they expire. For an example, you can
>         have a
>         look, in sched_credit.c, at 'ticker' or 'master_ticker'.
> 
> 
> I will look at this. However, the current solution is likely
> functionally equivalent and, with only one timer and a single list
> used to arm it, I'm not sure if using a Xen timer is necessary. 
>
It is functionally equivalent, by chance (see the issue about calling
runq_tickle() on current vcpus in my reply to Meng). The reason why a
different approach is required is making the code look better, reducing
(instead than increasing) the complexity of sched_rt.c, lowering the
overhead caused by long running operations performed while holding the
global scheduler spin lock, and improving scalability.

> Do they incur any extra overhead or coarsen granularity?
>
"extra overhead or coarsen granularity" as compared to what? s_timer in
schedule.c (which is what you're using) is one of them already!

What I meant with "an actual timer" is that you should ad a new one, and
move some of the stuff currently being done in rt_schedule() in its
handling routine, rather than just adding a new queue of events to be
serviced by abusing s_timer.

>         Deciding between a) or b) isn't easy, without actually trying
>         to
>         implement them. I personally prefer b), and I think it would
>         be a
>         superior long term solution (see right below), but given te
>         current
>         architecture of the RTDS scheduler (e.g., the fact that it has
>         a global
>         runq, etc), I won't nack a), which would most likely be
>         simpler.
> 
> 
> I agree b) would may be better in the long run, but with the current
> architecture a) is simpler. b) can be future work as the scheduler
> evolves.
>
Sure. And in fact, I'm fine with a), if done properly.
>         
>         With this patch, you're still using rt_schedule() to do both
>         scheduling
>         and replenishments, although you're now doing it at actual
>         replenishment
>         times, instead of checking for them every millisecond.

> 
> I think once you understand that the timerq is not only
> replenishments, but any event that could cause a branch is the
> scheduler behavior, it becomes more palatable to have them
> intertwined. 
>
I got that, and I'm asking you to do it differently.

> Really, the timerq and scheduling aren't as intertwined as they
> appear, they are piratically isolated functionally. Its just that the
> timerq is most naturally serviced during scheduling events when data
> that effects scheduling decisions is changing. Its straightforward and
> efficient.
>
Yeah, replenishments are 'naturally serviced' in a 'straightforward and
efficient' way by looping on all runnable vcpus, in more than one place,
from within super-hot paths, with the global scheduler spin lock held.
Neat! :-/

Oh, and that's before you introducing of another list to be taken care
of, still under the same conditions. :-O

>         Also, the bits and pieces that you need to add in order to
>         deal with
>         this new queue is, really, making things even more complex
>         than they are
>         now, which is undesirable.

> While it does add some complexity, I don't feel a single queue and
> managing it is that much extra complexity.
>
But in fact the point of making the scheduler 'event driven' is to take
advantage of the chances that this offers for getting stuff *out* of
rt_schedule(), and the purpose is not "not adding that much extra
complexity", is making it _simpler_.


>         > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
>         > index 7c39a9e..25f0458 100644
>         > --- a/xen/common/sched_rt.c
>         > +++ b/xen/common/sched_rt.c
>         
>         > @@ -156,6 +160,7 @@ struct rt_vcpu {
>         >      s_time_t cur_budget;        /* current budget */
>         >      s_time_t last_start;        /* last start time */
>         >      s_time_t cur_deadline;      /* current deadline for EDF
>         */
>         > +    s_time_t next_sched_needed; /* next time to make
>         scheduling decision */
>         >
>         As an example of why I said that things should become simpler,
>         and are
>         instead becoming more complex: with my solution, you don't
>         need anything
>         like this. In fact, budget enforcement is already being done
>         ok in
>         rt_schedule(), so no need to do anything more/different about
>         it.
>         Replenishments should be programmed to fire at cur_deadline,
>         so again,
>         no need to add this new field, and multiplex its actual value
>         between
>         budget and deadline (which, yes, counts as being something
>         rather
>         complex for me! :-D).
> 
> 
> As mentioned, the timerq is handling all events that could change the
> scheduling decision, not just replenishments. 
>
Yes, exactly, it's handling *too much* events. :-)

For example, have a look at 'struct vcpu', in 
xen/include/xen/sched.h. It's got 3 timers in it:

    struct timer     periodic_timer;
    struct timer     singleshot_timer;
    struct timer     poll_timer;    /* timeout for SCHEDOP_poll */

It probably would have been possible to just use one, with a single and
mixed event queue, as you're doing, a 1k lines handling routines, etc.

Do you it it would have been easier or harder to implement, understand,
debug and maintain? I bet on harder.

> This tracks the earliest time anything cause this to be scheduled
> differently, and could be extended if any more insights are found.
> Budget enforcement is being done in rt_schedule but its being done by
> this very list: a budget expiration is one possible event that would
> require a vcpu reschedule.
>
OMG, 'extended'... You're thinking to actually put more stuff in
there?!? :-O

It's a rather key and already too long and complex critical section, so
please, let's aim at shortening and making it simpler, i.e., at
improving things, not making them worse.
>         
>         > +/*
>         > + * Return how far the lowest time on the timerq (that is
>         after NOW) is in the
>         > + * future.
>         > + * Lock must be grabbed before calling this.
>         > + */
>         > +static s_time_t __timerq_next(const struct scheduler *ops,
>         s_time_t now)
>         > +{
>         > +    struct list_head *timerq = rt_timerq(ops);
>         > +    struct list_head *iter, *tmp;
>         > +
>         > +    list_for_each_safe(iter, tmp, timerq)
>         > +    {
>         > +        struct rt_vcpu * iter_svc = __t_elem(iter);
>         > +
>         > +        if ( iter_svc->next_sched_needed > now )
>         > +            return (iter_svc->next_sched_needed - now);
>         > +        else
>         > +            __t_remove(iter_svc);
>         > +    }
>         > +
>         > +    return MAX_SCHEDULE;
>         > +}
>         >
>         If using a timer, you can just get rid of items while, in the
>         timer
>         routine, you handle the event associated to them.
>         
>         Also, why is MAX_SCHEDULE still there?
> 
> 
> Honestly, events do not have to be removed here, but its done to
> prevent walking a list of stale values to get to the newest one on the
> next call. This is done more for performance. Their non-removal would
> be functionally equivalent.
>
Of course you should remove the stale entries, I certainly was not
arguing that the list should just grow indefinitely!

Point is, again, that this is another list walk occurring in an hot
path, with a big spin lock held. We want to get rid of such thing, not
adding more of it.

> With the timer alternative you mention, where would the timer events
> and their data be held? I think that could be extra complexity, unless
> I fail to understand the idea completely.
>
In an event queue like yours, of course. But you won't go through it
and/or bookkeep it while in hot paths, with the scheduler lock held.

See my email to Meng to have more details on what I have in mind, or
fell free to ask more details.

> MAX_SCHEDULE may not be required, but I have it there as an "in case."
> For example, it will cause the scheduler to be called after
> MAX_SCHEDULE even when no events are occurring (such as no vcpus). Its
> there to ensure progress in case of any bugs or unexpected behavior.
>
Wait, so, all this work, and then you still want to call rt_schedule()
every millisecond, when there's nothing to do?!?! :-O
 
>         That being said, in this specific case, you're moving up
>         runq_tickle(),
>         the purpose of which is to trigger a call to rt_schedule()
>         (after going
>         through the softirq machinery)... in order to call it in
>         rt_schedule().
>         That's pretty tricky, and is not how tickling should work.

> It set up to only tickle if needed. I'm not sure if this is an issue,
>
It's wrong, AFAICT. See my reply to Meng and, please, comment by
replying to it, if you've got anything to say about this, to make the
discussion easier to follow.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)