From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: Re: [PATCH] Modified RTDS scheduler to use an
 event-driven model instead of polling.
Date: Wed, 17 Jun 2015 02:20:36 +0200
Message-ID: <1434500436.2420.154.camel@citrix.com>
References: <1433764019-2194-1-git-send-email-dgolomb@seas.upenn.edu>
	<1433854393.2403.141.camel@citrix.com>
	<CALcuvTjaUhVvFX3oNZxEPOZ0dehtuVfCLQcwMDvSWac+xCvjGg@mail.gmail.com>
	<1433975412.2482.116.camel@citrix.com>
	<CALcuvTjnCRAPNkx1i_Gtrm7Ek8JNLGNsDwtGzVcFK_vdpYMSfQ@mail.gmail.com>
	<CALcuvThzXyooGo0Ljt=Tb7ZhXjgE9hx-Dk_TGnqsPkV5onZiGA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============7001328422343407239=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <CALcuvThzXyooGo0Ljt=Tb7ZhXjgE9hx-Dk_TGnqsPkV5onZiGA@mail.gmail.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dagaen Golomb <dgolomb@seas.upenn.edu>
Cc: Wei Liu <wei.liu2@citrix.com>, Sisu Xi <xisisu@gmail.com>, George Dunlap <george.dunlap@eu.citrix.com>, xen-devel@lists.xen.org, Meng Xu <mengxu@cis.upenn.edu>, Jan Beulich <jbeulich@suse.com>, Chong Li <lichong659@gmail.com>
List-Id: xen-devel@lists.xenproject.org

--===============7001328422343407239==
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="=-7z2OL/b2gTT8IENi1YMY"

--=-7z2OL/b2gTT8IENi1YMY
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, 2015-06-16 at 13:07 -0400, Dagaen Golomb wrote:
> I'm not replying inline because this is a more general response that
> is not limited
> to any particular comment.
>=20
That's fine.

Thanks for this actually... I love discussing these things, it makes me
remind the time when I was doing these stuff myself, and makes me feel
young! :-P

> Separating the replenishment from the scheduler may be problematic. The n=
ature
> of the gEDF scheduler is that when a priority changes it should be instan=
tly
> reflected.=20
>
'instantly'. Is there such a thing? :-O

> If replenishments were done by themsevles, then the scheduler may
> decide to wait for some period of time, and in this same time period a
> replenishment
> could change what should be executing.=20
>
Well, the scheduler then would better *not* decide to wait for any
period of time. It better act like this: as soon as a replenishment
happens, and a new deadline is assigned to a vCPU, put the vCPU in the
runqueue, in the proper position; and if such a newly replenished vCPU
should now preempt any other, currently running, vCPU, 'tickle' the pCPU
in question and let it reschedule, and pick up the newly replenished
vCPU.

> Either the gEDF policy will be
> violated or
> the replenishment routine would have to notify the scheduler after any
> replenishment, requiring some form of interaction and possibly more sched=
uler
> invocations than the current structure.=20
>
Obviously not. In fact, it's your 'current structure' that already
invokes the scheduler for any replenishment. Doing more than that, would
be really hard (impossible, I should say).

I fact, it's the s_timer that you're making fire for any replenishment,
and replenishments happen inside rt_schedule() itself. This is
_the_definition_ of invoking the scheduler!

OTOH, by using a (some) replenishment timer(s), the scheduler *may*
*potentially* be invoked for any replenishment, yes, but it may also
very well not. In fact, if even after a replenishment, the new deadline
of the replenished vCPU is is father than the deadlines of all the
currently running vCPUs, there's no need to tickle any pCPU, no need to
call the scheduler.

BTW, from what you say, it seems to me that we are in agreement: we
_do_not_ want to call the scheduler upon any replenishment. Do as I say
and we'll achieve that.

> The other option is for the scheduler to
> check for replenishments, as it does now.=20
>
Yeah, the wrong option. :-D :-D

> Without any interaction a violation of
> the policy could ensue. The gEDF is not like the credit scheduler where t=
here is
> an accounting period where, during this time, policy may be violated
> temporarily.
>
Yep, I do have an idea of what EDF is.

> Further, with two timer routines we now need to deal with any ordering
> or preemption
> between them (or logic to prevent / order such) which I could argue is fa=
r more
> complexity than having them done at once as it is now.=20
>
Mmm... I think I'm starting to lose you. Ordering between two timers? I
don't see what you mean.

Allow me, though, to walk through the behavior of your current approach.
I've done this already in a previous email, but no big deal re-doing it.
Oh, actually, the current approach is buggy, as you tickle the wrong
vCPU, but let's forget about that (let's assume it's fixed).

Let's say that time for a replenishment to vCPU w has come.

 on pCPU X:
 . timer interrupt
 .   TIMER_SOFTIRQ raised
 .     process softirqs
 .       SCHEDULE_SOFTIRQ raised
 .         process softirqs
 .           rt_schedule()
 .             [spin_lock]
 .             burn_budget(scurr)
 .             __repl_update()  (goes through all runnable vcpus!!)
 .             snext =3D __runq_pick() -> says vCPU w should run on pCPU Y
 .             runq_tickle(snext) ----> tickle pCPU Y [*]
 .             [spin_unlock]
 .
 on pCPU Y:
 . process softirqs
 .   SCHEDULE_SOFTIRQ
 .     rt_schedule()
 .       [spin_lock]
 .       burn_budget(scurr)
 .       __repl_update()  (goes through all runnable vcpus!! AGAIN!!)
 .       snext =3D runq_pick() ----------> pick vCPU w
 .       [spin_unlock]
 .
 context switch: vCPU w now runs on pCPU Y

So, you forced pCPU X through a full execution of rt_schedule(), with
the spinlock, the scan of runqueue and depleted queue, and everything,
just for the sake of figuring out that pCPU Y should reschedule. Then
you (as soon as practical) actually reschedule on pCPU Y and context
switch in favour of vCPU w, enacting EDF.

That's 2 scheduler invocations, and rt_schedule() is still ugly and
complex to read, maintain and improve.

[*] and don't forget that this needs fixing, because right now it's just
incorrect, and (just speculating) making runq_pick() behave in the way
you want, may not be super-simple.

Oh, I was almost forgetting! Wonder what happens in the time interval
between when __repl_update() is called (from inside rt_schedule()) on
pCPU X, and the actual time instant of the context switch on pCPU Y?
Yeah, right, what happens is that you're violating EDF. :-/

This can be accounted as a form of overhead introduced by the
implementation, I guess. In this specific case, it's due to the  fact
that you can't, from pCPU X, just change *instantly* what's running on
pCPU Y. No, you've got to send an IPI, then Y has to react and it has to
schedule, etc. This is of course dependant on the architecture of the
scheduler. Xen's scheduler works like this. FWIW, Linux's scheduler, wrt
this, also does.

So, see? Transient  violation of EDF is just there, no matter the
approach!

Let's look at the timers way:

 on pCPU X:
 . timer interrupt
 .   TIMER_SOFTIRQ raised
 .     process softirqs
 .       replenishment_timer_handler()
 .         [spin_lock]
 .           <for_each_depleted_vcpu(i, i.repl_time < NOW()) {
 .              replenish(i) ---------> replenish vCPU w
 .              runq_tickle(i) -------> tickle pCPU Y
 .            }>
 .         [spin_lock]
 .
 on pCPU Y:
 . process softirqs
 .   SCHEDULE_SOFTIRQ
 .     rt_schedule()
 .       [spin_lock]
 .       burn_budget(scurr)
 .       snext =3D runq_pick() ----------> pick vCPU w
 .       [spin_unlock]
 .
 context switch: vCPU x now runs on pCPU Y

So this means only 1 invocation of the scheduler, and only on the pCPU
that actually needs to reschedule. There's some overhead due to the
timer, of course, but, notably, this allows for shorter critical
sections, not to mention better looking and easier to maintain code.

Are we still violating EDF? Of course we are, between the replenishment
and context switch time, as above. That's unavoidable, I'm afraid.

Summarizing, the two solutions are on par wrt temporary violation of the
theoretical algorithm, but the timer based approach has loads of other
benefits, so let's go with timers. :-)

> If both are armed for the same time, which should execute? Scheduler
> first before
> a possibly applicable replenishment? Or replenishment always first?
> Both with added
> logic to enforce this and prevent preemption, of course.
>=20
I really don't see what you mean here. There won't be anything like that
to take care of. Xen offers timers as an abstraction, and deals with
them itself. You only need to take care of properly serializing
rt_schedule() and the timer handling routine, for the code sections that
require that.

> Due to this, it is my belief that by the nature of the gEDF policy the
> current solution is
> better, mostly because it appears to actually be the least complex way th=
at is
> functionally correct.=20
>
Not really, no. I wouldn't be convinced of this, even if what you have
right now were functionally correct.

> The gEDF policy just isn't well suited for
> decoupling events, as
> the events are highly dependent on one another, and particularly dependen=
t at an
> instantaneous timescale.=20
>
And here it come 'instantaneous timescale' again.

No, sorry, but I don't buy this. In fact, why does it not apply to
vCPUs' wakeups? I think it does, conceptually... So, should we move the
wakeup logic inside rt_schedule() too?

> It would be a hard pitch for a gEDF scheduler with a
> similar "accounting period" where the gEDF policy could be violated.=20
>
There's no accounting period of any sort. It's called overhead!

What we need to do is to try to keep it as small as possible. And I'm
quite sure that an effective way to do so is to keep critical sections
short, especially when protected by (potentially) highly contented spin
locks. RTDS, currently, suffer from poor scalability (and, I suspect,
poor latency as well, at least under pressure), and one of the reasons
why the work you're doing is interesting is that it can alleviate
this... if done with that in mind, of course.

> That is
> blasphemy in the real-time world.=20
>
Blasphemy, that's a nice one! :-D :-D

Well, I've never been good at religions, I have to admit. So, yes, I can
live with being called blasphemous, I guess. :-)

> Its also worth noting that the stereotypical textbook event-driven
> model is as we have
> now: a single event loop that handles events. In this case the scheduler =
is the
> conceptually the main loop and this makes perfect sense: its deciding
> what to run
> (think of the running VCPUs as callback functions and the abstraction
> falls into place
> perfectly). The event loop needs some information (about replenishments) =
before
> deciding, and collecting this would be part of the decode and decision
> phase for an
> event signal.
>=20
Talking about 'stereotypes', I don't have any textbook describing an
event-driven model at hand right now. However, you RT-Xen people like
and use LitmusRT, don't you? Well, you're doing the right thing, because
it's a great piece of software!

If you're interested, have a look in there. Spoiler: you'll find a lot
of timers. :-D :-D

More seriously, it's of course rather different, as Linux is not Xen,
but since you're looking for stereotypes:

 - Linux (or at lease LitmusRT patch 2014.2, i.e., what I'm looking at
   right now) has ticks. This means a timer fires every X milliseconds
   (on every CPU), and task_tick_litmus() is run as a consequence. Such
   function does not (at all!) invoke the scheduler directly, it much
   rather just checks whether doing so is necessary, in this case
   because of some task having exhausted its budget. If it happened, it
   calls litmus_reschedule_local() (see below). I know, this is budget
   enforcement, not replenishment, but I think it works as a good
   example of my point about using timers.

 - Actually, budget enforcement can be done, in LitmusRT, in two ways.
   One is tick based (described above), the other, called 'precise', is
   timer based. To see that in action, check how the enforcement timer
   is handled, e.g., in update_enforcement_timer(). Look at
   on_enforcement_timeout(), and find that it also _does_not_ schedule.
   It again just asks for the scheduler to be invoked as soon as
   practical, via (again) litmus_reschedule_local().

 - Finally, a 'job release', in LitmusRT terminology, is probably what
   is most close to a budget replenishment for us here. In fact, when a
   new RT job is released, it's given full budget, and its scheduling
   parameters (deadline, in most cases) are updated accordingly, like it
   happens, for us, with replenishments.
   Check, therefore, how that happens in add_release() and
   __add_release(). You'll find a call to arm_release_timer() which
   calls, when firing, on_release_timer(). From there, go, e.g., to
   gsnedf_release_jobs(), and follow the call chain through
   check_for_preemptions() --> preempt() --> preempt_if_preemptable()
   --> litmus_reschedule(), which then calls either
   set_tsk_need_resched(current) or smp_send_reschedule(cpu).

litmus_reschedule_local(), and similar functions, do something very
similar to our tickling, i.e., they ask a specific CPU to go through the
scheduler. When this happens via set_tsk_need_resched(current) it's
like, in Xen, tickling the pCPU we're on. When it happens via
smp_send_reschedule(cpu) it's like, in Xen, tickling the remote pCPU
cpu.

Now, although some years have passed, I've interacted quite a bit with
LitmusRT folks. They're very smart, and usually very good at what they
do, and I'd be surprised to find them guilty of that much real-time
blasphemy. :-P

You also may know that there is an EDF implementation in mainline Linux,
and you may would like to go have a look there too. I'm not including
any example from there because I'm one of the  main authors of it, so
it's pretty obvious that it uses the approach I think it's best (and,
more important, it would be rather inelegant to cite myself! :-D).

Anyway, if you fancy doing that, the core is in kernel/sched/deadline.c.

> Not only is there a complexity issue, but as mentioned before this may be=
 the
> best performing option, making the most information available and the
> best timing
> decision possible. Adding a few extra cycles to a hot path, even with
> a lock being
> held, is worth it if the scheduler and context switching is done less.
>
Right, so let's use timers, they require less or equal scheduler
invocations wrt to... Always invoking the scheduler!! (Not sure why you
mention the number of context switches, that's independent on which of
the two approaches you  choose.)

> For the above reasons I think you should reconsider the current
> implementation, as it
> appears it may be the least complex and easiest to reason about
> solution.=20
>
Well, no, sorry. :-(

I mean, I appreciate that what you have now (once fixed), may be
effective in avoiding some unnecessary invocation of the scheduler, with
respect to what we currently have in the repo, so I'm not saying it
can't be seen as an improvement. It's a "nice hack", actually.

However, if we're  still aiming at making this scheduler a first class
citizen within Xen, we're interested in less hacks, rather than in more
--no matter how nice they are. So, with that in mind, it would be a step
in the wrong direction, IMO.

> Let me know if I'm missing some key insight into how the behavior
> could be implemented
> correctly and beautifully using the multiple timer approach.
>
well, I tried. I really did.

> I simply
> don't see how it can
> be done without heavy interaction and information sharing between them
> which really
> defeats the purpose.
>=20
No, it doesn't.

Thanks and Regards,
Dario

--=20
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

--=-7z2OL/b2gTT8IENi1YMY
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEABECAAYFAlWAvVQACgkQk4XaBE3IOsR7dgCgksQVp9Lpo4tJT1qcChF7VwZj
l2YAoKoh+RSZ+0GW+DVcaN4vc0/dY66y
=/Tv4
-----END PGP SIGNATURE-----

--=-7z2OL/b2gTT8IENi1YMY--


--===============7001328422343407239==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============7001328422343407239==--