All of lore.kernel.org
 help / color / mirror / Atom feed
* Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
@ 2015-03-24  3:50 Meng Xu
  2015-03-24 11:54 ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Meng Xu @ 2015-03-24  3:50 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli, George Dunlap; +Cc: Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 6739 bytes --]

Hi Dario and George,

I'm exploring the design choice of eliminating the Xen scheduler overhead
on the dedicated CPU. A dedicated CPU is a PCPU that has a full capacity
VCPU pinned onto it and no other VCPUs will run on that PCPU. In other
words, when a full-capacity VCPU is dedicated to a dedicated CPU, other
VCPUs will never be scheduled onto that dedicated CPU; Because the
dedicated CPU will only run the full-capacity VCPU pinned to it, the
scheduler does not need to be invoked on that dedicated CPU. Considering
the current RTDS scheduler implementation, eliminating the scheduler on the
dedicated CPU could save the scheduler overhead (i.e., 1000 - 2000 cycles)
per 1ms (roughly speaking) on that dedicated CPU.

This dedicated CPU feature could be useful for the extreme low latency
applications in domU, in which several microseconds matters. Because the
dedicated VCPU is "always available" on that dedicated CPU (since scheduler
overhead is eliminated), the processes inside domU that are running on the
dedicated VCPU will avoid the scheduling latency and be more responsive.

The dedicated CPU feature is called Exclusive Affinity feature in VMware's
vSphere.  I watched a presentation from VMWorld 2013: "Extreme Performance
Series Network Speed Ahead" (https://www.youtube.com/watch?v=I-D1a0QaZaU),
that discusses this Exclusive Affinity feature:
     At 34', they list the I/O latency introduced by the hypervisor
scheduler.
     At 35', they introduced the exclusive affinity feature that dedicate a
full capacity VCPU to a PCPU so that the scheduler overhead and the context
switch overhead are eliminated.
     At 39':56'', they discussed the side effects of the Exclusive Affinity
feature.

What I want to do is to implement the Exclusive Affinity feature in Xen
(which I called dedicated CPU) and measure how much scheduler overhead we
can save by using this feature.

[Design]
I added a per_cpu field, cpu_d_status, that has four statuses:
SCHED_CPU_D_STATUS_DISABLED
​: the cpu is a non-dedicated CPU, scheduler should be invoked on this cpu;​
SCHED_CPU_D_STATUS_
​INIT: the cpu is set to be dedicated CPU by user, but we haven't migrated
the dedicated VCPU to this CPU;
SCHED_CPU_D_STATUS_ENABLED
​: the cpu has been set to dedicated CPU, and it has the dedicated VCPU
running on it now; the scheduler should never be invoked on this cpu once
it's in this status.
SCHED_CPU_D_STATUS_RESTORE​: This cpu has been set to a non-dedicated CPU
from a dedicated CPU by user, we need to do some house-keeping work (e.g.,
update the parameters of the dedicated vcpu and re-arm the timers) before
mark it as a non-dedicated CPU.

I added two hypercalls to add/remove a dedicated CPU:
Hypercall ​XEN_DOMCTL_SCHEDOP_add_dedvcpu pins a dedicated VCPU to a
dedicated PCPU;
Hypercall XEN_DOMCTL_SCHEDOP_remove_dedvcpu  restore the dedicated PCPU
back to the non-dedicated PCPU.

When the hypercall XEN_DOMCTL_SCHEDOP_add_dedvcpu is called, it will
Step 1) Mark the cpu_d_status on the dedicated CPU as SCHED_CPU_D_STATUS_
​INIT; and
             I
f the VCPU is not on the dedicated CPU right now,
m
igrate the dedicated VCPU to the corresponding dedicated CPU, .
Step 2) Exclude the dedicated CPU from the scheduling decisions made from
other cpus. In other words, the RTDS scheduler in sched_rt.c will never
raise SCHEDULE_SOFTIRQ to the dedicated CPU.
Step 3) After the dedicated VCPU is running on the dedicated CPU, mark the
dedicated CPU's cpu_d_status as SCHED_CPU_D_STATUS_
​ENABLED; and
kill the following timers (sd is the schedule_data on the pcpu and v is the
dedicated vcpu on the dedicated pcpu) so that scheduler won't be triggered
by timers:
    kill_timer(&sd->s_timer);
    kill_timer(&v->periodic_timer);
    kill_timer(&v->singleshot_timer);
    kill_timer(&v->poll_timer);

When the hypercall XEN_DOMCTL_SCHEDOP_
​remove​
_dedvcpu is called
​, I just did the reverse operation to re-init the timers on the pcpu and
vcpu and raise the SCHEDULE_SOFTIRQ to call the scheduler on that pcpu.​

​[Problems]​
​The issue I'm encountering is as follows:​
After I implemented the dedicated cpu feature, I compared the latency of a
cpu-intensive task in domU on dedicated CPU (denoted as R_dedcpu) and the
latency on non-dedicated CPU (denoted as R_nodedcpu). The expected result
should be R_dedcpu < R_nodedcpu since we avoid the scheduler overhead.
However, the actual result is R_dedcpu > R_nodedcpu, and R_dedcpu -
R_nodedcpu ~= 1000 cycles.

After adding some trace to every function that may raise the
SCHEDULE_SOFTIRQ, I found:
When a cpu is not marked as dedicated cpu and the scheduler on it is not
disabled, the vcpu_block() is triggered 2896 times during 58280322928ns
(i.e., triggered once every 20,124,421ns in average) on the dedicated cpu.
However,
When I disable the scheduler on a dedicated cpu, the function
vcpu_block(void) @schedule.c will be triggered very frequently; the
vcpu_block(void) is triggered 644824 times during 8,918,636,761ns (i.e.,
once every 13831ns in average) on the dedicated cpu.

To sum up the problem I'm facing, the vcpu_block(void) is trigger much
faster and more frequently when the scheduler is disabled on a cpu than
when the scheduled is enabled.

[My question]
I'm very confused at the reason why vcpu_block(void) is triggered so
frequently when the scheduler is disabled.  The vcpu_block(void) is called
by the SCHEDOP_block hypercall, but why this hypercall will be triggered so
frequently?

It will be great if you know the answer directly. (This is just a pure hope
and I cannot really expect it. :-) )
But I really appreciate it if you could give me some directions on how I
should figure it out. I grepped vcpu_block(void) and SCHEDOP_block  in the
xen code base, but didn't found much call to them.

What confused me most is that  the dedicated VCPU should be blocked less
frequently instead of more frequently when the scheduler is disabled on the
dedicated CPU, because the dedicated VCPU is always running on the CPU now
without the hypervisor scheduler's interference.

The code that implement this feature is at
https://github.com/PennPanda/xenproject/commit/fc6caec0b6ae794b05926cad92e833165fe45305
.
(I'm not sure if it's a good idea to attach the patch at the end of this
email. It may just make this email too long and hard to read. Please let me
know if you need it and I will send it in a separate email.)

Thank you very much for your advice, time and help!

Best regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 10080 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-24  3:50 Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU Meng Xu
@ 2015-03-24 11:54 ` George Dunlap
  2015-03-24 15:27   ` Meng Xu
  0 siblings, 1 reply; 9+ messages in thread
From: George Dunlap @ 2015-03-24 11:54 UTC (permalink / raw)
  To: Meng Xu; +Cc: Dario Faggioli, Stefano Stabellini, xen-devel

On Tue, Mar 24, 2015 at 3:50 AM, Meng Xu <xumengpanda@gmail.com> wrote:
> Hi Dario and George,
>
> I'm exploring the design choice of eliminating the Xen scheduler overhead on
> the dedicated CPU. A dedicated CPU is a PCPU that has a full capacity VCPU
> pinned onto it and no other VCPUs will run on that PCPU.

Hey Meng!  This sounds awesome, thanks for looking into it.


> [Problems]
> The issue I'm encountering is as follows:
> After I implemented the dedicated cpu feature, I compared the latency of a
> cpu-intensive task in domU on dedicated CPU (denoted as R_dedcpu) and the
> latency on non-dedicated CPU (denoted as R_nodedcpu). The expected result
> should be R_dedcpu < R_nodedcpu since we avoid the scheduler overhead.
> However, the actual result is R_dedcpu > R_nodedcpu, and R_dedcpu -
> R_nodedcpu ~= 1000 cycles.
>
> After adding some trace to every function that may raise the
> SCHEDULE_SOFTIRQ, I found:
> When a cpu is not marked as dedicated cpu and the scheduler on it is not
> disabled, the vcpu_block() is triggered 2896 times during 58280322928ns
> (i.e., triggered once every 20,124,421ns in average) on the dedicated cpu.
> However,
> When I disable the scheduler on a dedicated cpu, the function
> vcpu_block(void) @schedule.c will be triggered very frequently; the
> vcpu_block(void) is triggered 644824 times during 8,918,636,761ns (i.e.,
> once every 13831ns in average) on the dedicated cpu.
>
> To sum up the problem I'm facing, the vcpu_block(void) is trigger much
> faster and more frequently when the scheduler is disabled on a cpu than when
> the scheduled is enabled.
>
> [My question]
> I'm very confused at the reason why vcpu_block(void) is triggered so
> frequently when the scheduler is disabled.  The vcpu_block(void) is called
> by the SCHEDOP_block hypercall, but why this hypercall will be triggered so
> frequently?
>
> It will be great if you know the answer directly. (This is just a pure hope
> and I cannot really expect it. :-) )
> But I really appreciate it if you could give me some directions on how I
> should figure it out. I grepped vcpu_block(void) and SCHEDOP_block  in the
> xen code base, but didn't found much call to them.
>
> What confused me most is that  the dedicated VCPU should be blocked less
> frequently instead of more frequently when the scheduler is disabled on the
> dedicated CPU, because the dedicated VCPU is always running on the CPU now
> without the hypervisor scheduler's interference.

So if I had to guess, I would guess that you're not actually blocking
when the guest tries to block.  Normally if the guest blocks, it
blocks in a loop like this:

do {
  enable_irqs();
  hlt;
  disable_irqs;
} while (!interrup_pending);

For a PV guest, the hlt() would be replaced with a PV block() hypercall.

Normally, when a guest calls block(), then it's taken off the
runqueue; and if there's nothing on the runqueue, then the scheduler
will run the idle domain; it's the idle domain that actually does the
blocking.

If you've hardwired it always to return the vcpu in question rather
than the idle domain, then it will never block -- it will busy-wait,
calling block millions of times.

The simplest way to get your prototype working, in that case, would be
to return the idle vcpu for that pcpu if the guest is blocked.

But a brief comment on your design:

Looking at your design at the moment, you will get rid of the overhead
of the scheduler-related interrupts, and any pluggable-cpu accounting
that needs to happen (e.g., calculating credits burned, &c).  And
that's certainly not nothing.  But it's not really accurate to say
that you're avoiding the scheduler entirely.  At the moment, as far as
I can tell, you're still going through all the normal schedule.c
machinery between wake-up and actually running the vm; and the normal
machinery for interrupt delivery.

I'm wondering -- are people really going to want to just pin a single
vcpu from a domain like this?  Or are they going to want to pin all
vcpus from a given domain?

For the first to be useful, the guest OS would need to understand
somehow that this cpu has better properties than the other vcpus on
its system.  Which I suppose could be handled manually (e.g., by the
guest admin pinning processes to that cpu or something).

The reason I'm asking is because another option that would avoid the
need for special per-cpu flags would to make a "sched_place" scheduler
(sched_partition?), which would essentially do what you've done here
-- when you add a vcpu to the scheduler, it simply chooses one of its
free cpus and dedicates it to that vcpu.  If no such cpus are
available, it returns an error.  In that case, you could use the
normal cpupool machinery to assign cpus to that scheduler, without
needing to introduce these extra flags, and to make each of the
pluggable schedulers need to deal with the complexity of implementing
the "dedicated" scheduling.

The only downside is that at the moment you can't have a domain cross
cpupools; so either all vcpus of a domain would have to be dedicated,
or none.

Thoughts?

 -George

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-24 11:54 ` George Dunlap
@ 2015-03-24 15:27   ` Meng Xu
  2015-03-24 21:08     ` George Dunlap
  2015-03-25 18:50     ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 9+ messages in thread
From: Meng Xu @ 2015-03-24 15:27 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Dario Faggioli, Denys Drozdov, xen-devel,
	Linh Thi Xuan Phan, Insup Lee, andrii.tseglytskyi


[-- Attachment #1.1: Type: text/plain, Size: 9176 bytes --]

2015-03-24 7:54 GMT-04:00 George Dunlap <George.Dunlap@eu.citrix.com>:

> On Tue, Mar 24, 2015 at 3:50 AM, Meng Xu <xumengpanda@gmail.com> wrote:
> > Hi Dario and George,
> >
> > I'm exploring the design choice of eliminating the Xen scheduler
> overhead on
> > the dedicated CPU. A dedicated CPU is a PCPU that has a full capacity
> VCPU
> > pinned onto it and no other VCPUs will run on that PCPU.
>
> Hey Meng!  This sounds awesome, thanks for looking into it.
>

:-) I think it is a useful feature for extreme low latency applications.
​

>
>
> > [Problems]
> > The issue I'm encountering is as follows:
> > After I implemented the dedicated cpu feature, I compared the latency of
> a
> > cpu-intensive task in domU on dedicated CPU (denoted as R_dedcpu) and the
> > latency on non-dedicated CPU (denoted as R_nodedcpu). The expected result
> > should be R_dedcpu < R_nodedcpu since we avoid the scheduler overhead.
> > However, the actual result is R_dedcpu > R_nodedcpu, and R_dedcpu -
> > R_nodedcpu ~= 1000 cycles.
> >
> > After adding some trace to every function that may raise the
> > SCHEDULE_SOFTIRQ, I found:
> > When a cpu is not marked as dedicated cpu and the scheduler on it is not
> > disabled, the vcpu_block() is triggered 2896 times during 58280322928ns
> > (i.e., triggered once every 20,124,421ns in average) on the dedicated
> cpu.
> > However,
> > When I disable the scheduler on a dedicated cpu, the function
> > vcpu_block(void) @schedule.c will be triggered very frequently; the
> > vcpu_block(void) is triggered 644824 times during 8,918,636,761ns (i.e.,
> > once every 13831ns in average) on the dedicated cpu.
> >
> > To sum up the problem I'm facing, the vcpu_block(void) is trigger much
> > faster and more frequently when the scheduler is disabled on a cpu than
> when
> > the scheduled is enabled.
> >
> > [My question]
> > I'm very confused at the reason why vcpu_block(void) is triggered so
> > frequently when the scheduler is disabled.  The vcpu_block(void) is
> called
> > by the SCHEDOP_block hypercall, but why this hypercall will be triggered
> so
> > frequently?
> >
> > It will be great if you know the answer directly. (This is just a pure
> hope
> > and I cannot really expect it. :-) )
> > But I really appreciate it if you could give me some directions on how I
> > should figure it out. I grepped vcpu_block(void) and SCHEDOP_block  in
> the
> > xen code base, but didn't found much call to them.
> >
> > What confused me most is that  the dedicated VCPU should be blocked less
> > frequently instead of more frequently when the scheduler is disabled on
> the
> > dedicated CPU, because the dedicated VCPU is always running on the CPU
> now
> > without the hypervisor scheduler's interference.
>
> So if I had to guess, I would guess that you're not actually blocking
> when the guest tries to block.  Normally if the guest blocks, it
> blocks in a loop like this:
>
> do {
>   enable_irqs();
>   hlt;
>   disable_irqs;
> } while (!interrup_pending);
>
> For a PV guest, the hlt() would be replaced with a PV block() hypercall.
>
> Normally, when a guest calls block(), then it's taken off the
> runqueue; and if there's nothing on the runqueue, then the scheduler
> will run the idle domain; it's the idle domain that actually does the
> blocking.
>
> If you've hardwired it always to return the vcpu in question rather
> than the idle domain, then it will never block -- it will busy-wait,
> calling block millions of times.
>
> The simplest way to get your prototype working, in that case, would be
> to return the idle vcpu for that pcpu if the guest is blocked.
>

​Exactly! Thank you so much for pointing this out!  I did hardwired it
always to return the vcpu that is supposed to be blocked. Now I totally
understand what happened. :-) ​

But this lead to another issue to my design:
If I return the idle vcpu when the dedicated VCPU is blocked, it will do
the context_switch(prev, next); when the dedicated VCPU is unblocked,
another context_switch() is triggered.
It means that we can not eliminate the context_switch overhead for the
dedicated CPU.
The ideal performance for the dedicated VCPU on the dedicated CPU should be
super-close to the bare-metal CPU. Here we still have the context_switch
overhead, which is about  1500-2000  cycles.

Can we avoid the context switch overhead?


> But a brief comment on your design:
>
> Looking at your design at the moment, you will get rid of the overhead
> of the scheduler-related interrupts, and any pluggable-cpu accounting
> that needs to happen (e.g., calculating credits burned, &c).  And
> that's certainly not nothing.


​Yes. The schedule() function is avoided.
Right now, I only apply the dedicated cpu feature to the RTDS scheduler. So
when a dedicated VCPU is pinned and running on the dedicated CPU, it should
be a full-capacity vcpu and we don't need to count the budget burned.

However, because credit2 scheduler counts the credit in domain level, the
function of counting the credit burned should not be avoided.

Actually, the trace code in the scheduler() will also be bypassed on the
dedicated CPU. I'm not sure if we need the trace code working on the
dedicated CPU or not. Since we are aiming to provide the dedicated VCPU
that has close-to-bare-metal CPU performance, the tracing mechanism in the
schedule() is unnecessary IMHO.

But it's not really accurate to say
> that you're avoiding the scheduler entirely.  At the moment, as far as
> I can tell, you're still going through all the normal schedule.c
> machinery between wake-up and actually running the vm; and the normal
> machinery for interrupt delivery.
>

​Yes. :-(
Ideally, I want to isolate all such interference from the dedicated CPU so
that the dedicated VCPU on it will have the high-performance that is close
to the bare-metal cpu. However, I'm concerning about how complex it will be
and how it will affect the existing functions that relies on  interrupts.
​


>
> I'm wondering -- are people really going to want to just pin a single
> vcpu from a domain like this?  Or are they going to want to pin all
> vcpus from a given domain?
>
> For the first to be useful, the guest OS would need to understand
> somehow that this cpu has better properties than the other vcpus on
> its system.  Which I suppose could be handled manually (e.g., by the
> guest admin pinning processes to that cpu or something).
>

​Right. The guest OS will be running on heterogeneous cpus.​ In my mind,
not all processes in the guest ask for the extreme low latency. So guest OS
can just pin those latency-critical processes onto the dedicated VCPU
(which is mapped to dedicated CPU), and pin other processes to the
non-dedicated VCPUs. This could be more flexible for the guest OS and
accommodate more domains on the same number of cpus. But (of course), it
introduce more complexity into the hypervisor and management in guest OS.


>
> The reason I'm asking is because another option that would avoid the
> need for special per-cpu flags would to make a "sched_place" scheduler
> (sched_partition?), which would essentially do what you've done here
> -- when you add a vcpu to the scheduler, it simply chooses one of its
> free cpus and dedicates it to that vcpu.  If no such cpus are
> available, it returns an error.  In that case, you could use the
> normal cpupool machinery to assign cpus to that scheduler, without
> needing to introduce these extra flags, and to make each of the
> pluggable schedulers need to deal with the complexity of implementing
> the "dedicated" scheduling.
>

​This is also a good idea, if we don't aim to avoid the context switch
overhead and avoid calling the schedule() function. The biggest strength of
this approach is that it has as little impact as possible on the existing
functions.​

Actually, I can extend the RTDS scheduler to include this feature. This is
more like a fast path in the scheduler on the dedicated CPU: Instead of
scanning the runq and deciding which vcpu should run next, we just always
pick the dedicated VCPU if the vcpu is not blocked. (If the dedicated VCPU
is blocked, we pick the idle VCPU.)

​However, this just reduce (instead of remove) the scheduler() overhead​
and cannot avoid the context switch overhead either.


> The only downside is that at the moment you can't have a domain cross
> cpupools; so either all vcpus of a domain would have to be dedicated,
> or none.
>

​Yes. I think this is a secondary concern. I'm more concerned about how
much overhead can we remove by using the dedicated CPU. Ideally, the more
overhead we remove, the better performance we get.

​Do you have any suggestion/insights on the performance goal of ​the
dedicated CPU feature? I think it will affect how far we should go to
remove the overheads.

​Thank you very much!​

​Best,

Meng​

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 13211 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-24 15:27   ` Meng Xu
@ 2015-03-24 21:08     ` George Dunlap
  2015-03-25  1:48       ` Meng Xu
  2015-03-25 18:50     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 9+ messages in thread
From: George Dunlap @ 2015-03-24 21:08 UTC (permalink / raw)
  To: Meng Xu
  Cc: Stefano Stabellini, Dario Faggioli, Tim Deegan, Denys Drozdov,
	xen-devel, Linh Thi Xuan Phan, Insup Lee, Jan Beulich,
	andrii.tseglytskyi

On Tue, Mar 24, 2015 at 3:27 PM, Meng Xu <xumengpanda@gmail.com> wrote:
>> The simplest way to get your prototype working, in that case, would be
>> to return the idle vcpu for that pcpu if the guest is blocked.
>
>
> Exactly! Thank you so much for pointing this out!  I did hardwired it always
> to return the vcpu that is supposed to be blocked. Now I totally understand
> what happened. :-)
>
> But this lead to another issue to my design:
> If I return the idle vcpu when the dedicated VCPU is blocked, it will do the
> context_switch(prev, next); when the dedicated VCPU is unblocked, another
> context_switch() is triggered.
> It means that we can not eliminate the context_switch overhead for the
> dedicated CPU.
> The ideal performance for the dedicated VCPU on the dedicated CPU should be
> super-close to the bare-metal CPU. Here we still have the context_switch
> overhead, which is about  1500-2000  cycles.
>
> Can we avoid the context switch overhead?

If you look at xen/arch/x86/domain.c:context_switch(), you'll see that
it's already got clever algorithms for avoiding as much context switch
work as possible.  In particular, __context_switch() (which on x86
does the actual work of context switching) won't be called when
switching *into* the idle vcpu; nor will it be called if you're
switching from the idle vcpu back to the vcpu it switched away from
(curr_vcpu == next).  Not familiar with the arm path, but hopefully
they do something similar.

IOW, a context switch to the idle domain isn't really a context switch. :-)

> However, because credit2 scheduler counts the credit in domain level, the
> function of counting the credit burned should not be avoided.

Actually, that's not true.  In credit2, the weight is set at a domain
level, but that only changes the "burn rate".  Individual vcpus are
assigned and charged their own credits; and credit of a vcpu in one
runqueue has no comparison to or direct effect on the credit of a vcpu
in another runqueue.  It wouldn't be at all inconsistent to simply not
do the credit calculation for a "dedicated" vcpu.  The effect on other
vcpus would be exactly the same as having that vcpu on a runqueue by
itself.

>> But it's not really accurate to say
>> that you're avoiding the scheduler entirely.  At the moment, as far as
>> I can tell, you're still going through all the normal schedule.c
>> machinery between wake-up and actually running the vm; and the normal
>> machinery for interrupt delivery.
>
>
> Yes. :-(
> Ideally, I want to isolate all such interference from the dedicated CPU so
> that the dedicated VCPU on it will have the high-performance that is close
> to the bare-metal cpu. However, I'm concerning about how complex it will be
> and how it will affect the existing functions that relies on  interrupts.

Right; so there are several bits of overhead you might address:

1. The overhead of scheduling calculations -- credit, load balancing,
sorting lists, &c; and regular scheduling interrupts.

2. The overhead in the generic code of having the flexibility to run
more than one vcpu.  This would (probably) be measured in the number
of instructions from a waking interrupt to actually running the guest
OS handler.

3. The maintenance things that happen in softirq context, like
periodic clock synchronization, &c.

Addressing #1 is fairly easy.  The most simple thing to do would be to
make a new scheduler and use cpupools; but it shouldn't be terribly
difficult to build the functionality within existing schedulers.

My guess is that #2 would involve basically rewriting a parallel set
of entry / exit routines which were pared down to an absolute minimum,
and then having machinery in place to switch a CPU to use those
routines (with a specific vcpu) rather than the current, more
fully-functional ones.   It might also require cutting back on the
functionality given to the guest as well in terms of hypecalls --
making this "minimalist" Xen environment work with all the existing
hypercalls might be a lot of work.

That sounds like a lot of very complicated work, and before you tried
it I think you'd want to be very much convinced that it would pay off
in terms of reduced wake-up latency.  Getting from 5000 cycles down to
1000 cycles might be worth it; getting from 1400 cycles down to 1000,
or 5000 cycles down to 4600, maybe not so much. :-)

I'm not sure exactly what #3 would entail; it might involve basically
taking the cpu offline from Xen's perspective.  (Again, not sure if
it's possible or worth it.)

You might take a look at this presentation from FOSDEM last year, to
see if you can get any interesting ideas:

https://archive.fosdem.org/2014/schedule/event/virtiaas13/

Any opinions, Dario / Jan / Tim?

 -George

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-24 21:08     ` George Dunlap
@ 2015-03-25  1:48       ` Meng Xu
  2015-03-25 11:58         ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Meng Xu @ 2015-03-25  1:48 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Dario Faggioli, Tim Deegan, Denys Drozdov,
	xen-devel, Linh Thi Xuan Phan, Insup Lee, Jan Beulich,
	andrii.tseglytskyi


[-- Attachment #1.1: Type: text/plain, Size: 6255 bytes --]

2015-03-24 17:08 GMT-04:00 George Dunlap <George.Dunlap@eu.citrix.com>:

> On Tue, Mar 24, 2015 at 3:27 PM, Meng Xu <xumengpanda@gmail.com> wrote:
> >> The simplest way to get your prototype working, in that case, would be
> >> to return the idle vcpu for that pcpu if the guest is blocked.
> >
> >
> > Exactly! Thank you so much for pointing this out!  I did hardwired it
> always
> > to return the vcpu that is supposed to be blocked. Now I totally
> understand
> > what happened. :-)
> >
> > But this lead to another issue to my design:
> > If I return the idle vcpu when the dedicated VCPU is blocked, it will do
> the
> > context_switch(prev, next); when the dedicated VCPU is unblocked, another
> > context_switch() is triggered.
> > It means that we can not eliminate the context_switch overhead for the
> > dedicated CPU.
> > The ideal performance for the dedicated VCPU on the dedicated CPU should
> be
> > super-close to the bare-metal CPU. Here we still have the context_switch
> > overhead, which is about  1500-2000  cycles.
> >
> > Can we avoid the context switch overhead?
>
> If you look at xen/arch/x86/domain.c:context_switch(), you'll see that
> it's already got clever algorithms for avoiding as much context switch
> work as possible.  In particular, __context_switch() (which on x86
> does the actual work of context switching) won't be called when
> switching *into* the idle vcpu; nor will it be called if you're
> switching from the idle vcpu back to the vcpu it switched away from
> (curr_vcpu == next).  Not familiar with the arm path, but hopefully
> they do something similar.
>
> IOW, a context switch to the idle domain isn't really a context switch. :-)
>

​I see.
​


>
> > However, because credit2 scheduler counts the credit in domain level, the
> > function of counting the credit burned should not be avoided.
>
> Actually, that's not true.  In credit2, the weight is set at a domain
> level, but that only changes the "burn rate".  Individual vcpus are
> assigned and charged their own credits; and credit of a vcpu in one
> runqueue has no comparison to or direct effect on the credit of a vcpu
> in another runqueue.  It wouldn't be at all inconsistent to simply not
> do the credit calculation for a "dedicated" vcpu.  The effect on other
> vcpus would be exactly the same as having that vcpu on a runqueue by
> itself.
>

​I see. If the accounting of the budget is per-vcpu level, then we don't
need to keep accounting the budget burn for the dedicated VCPU. We just
need to restore/reenable the accounting ​mechanism for the dedicated VCPU
when it is changed from dedicated to non-dedicated. But this is not a key
issue for the current design, anyway. I will first do it for RTDS scheduler
and measure the performance and if it works great, I will do it for the
credit2/credit scheduler. :-)


>
> >> But it's not really accurate to say
> >> that you're avoiding the scheduler entirely.  At the moment, as far as
> >> I can tell, you're still going through all the normal schedule.c
> >> machinery between wake-up and actually running the vm; and the normal
> >> machinery for interrupt delivery.
> >
> >
> > Yes. :-(
> > Ideally, I want to isolate all such interference from the dedicated CPU
> so
> > that the dedicated VCPU on it will have the high-performance that is
> close
> > to the bare-metal cpu. However, I'm concerning about how complex it will
> be
> > and how it will affect the existing functions that relies on  interrupts.
>
> Right; so there are several bits of overhead you might address:
>
> 1. The overhead of scheduling calculations -- credit, load balancing,
> sorting lists, &c; and regular scheduling interrupts.
>
> 2. The overhead in the generic code of having the flexibility to run
> more than one vcpu.  This would (probably) be measured in the number
> of instructions from a waking interrupt to actually running the guest
> OS handler.
>
> 3. The maintenance things that happen in softirq context, like
> periodic clock synchronization, &c.
>
> Addressing #1 is fairly easy.  The most simple thing to do would be to
> make a new scheduler and use cpupools; but it shouldn't be terribly
> difficult to build the functionality within existing schedulers.
>

​Right. ​



>
> My guess is that #2 would involve basically rewriting a parallel set
> of entry / exit routines which were pared down to an absolute minimum,
> and then having machinery in place to switch a CPU to use those
> routines (with a specific vcpu) rather than the current, more
> fully-functional ones.   It might also require cutting back on the
> functionality given to the guest as well in terms of hypecalls --
> making this "minimalist" Xen environment work with all the existing
> hypercalls might be a lot of work.
>
> That sounds like a lot of very complicated work, and before you tried
> it I think you'd want to be very much convinced that it would pay off
> in terms of reduced wake-up latency.  Getting from 5000 cycles down to
> 1000 cycles might be worth it; getting from 1400 cycles down to 1000,
> or 5000 cycles down to 4600, maybe not so much. :-)
>

​Exactly! I will do some measurement on the overhead in #2 before I really
try to do it. Since #1 is fairly easy, I will first implement #1 and see
how much gap it remains to achieve the bare-metal performance.
​


>
> I'm not sure exactly what #3 would entail; it might involve basically
> taking the cpu offline from Xen's perspective.  (Again, not sure if
> it's possible or worth it.)
>
> You might take a look at this presentation from FOSDEM last year, to
> see if you can get any interesting ideas:
>
> https://archive.fosdem.org/2014/schedule/event/virtiaas13/


​Thank you very much for sharing this video! It is very interesting. In my
mind, to really eliminate those softirq, ​we have to remap/redirect those
interrupts to other cores. I'm unsure about the difficulty it is and the
benefits it may bring. :-(

​Thank you very much!​

​Best,​

​Meng​

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 8533 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-25  1:48       ` Meng Xu
@ 2015-03-25 11:58         ` George Dunlap
  2015-03-25 13:50           ` Meng Xu
  0 siblings, 1 reply; 9+ messages in thread
From: George Dunlap @ 2015-03-25 11:58 UTC (permalink / raw)
  To: Meng Xu
  Cc: Stefano Stabellini, Dario Faggioli, Tim Deegan, Denys Drozdov,
	xen-devel, Linh Thi Xuan Phan, Insup Lee, Jan Beulich,
	andrii.tseglytskyi

On 03/25/2015 01:48 AM, Meng Xu wrote:
> ​Exactly! I will do some measurement on the overhead in #2 before I really
> try to do it. Since #1 is fairly easy, I will first implement #1 and see
> how much gap it remains to achieve the bare-metal performance.

Interface-wise: I'm wondering if at the Xen or libxl level we really
need to have a whole new set of hypercalls, at least to implement #1.
Would it make sense for certain schedulers to automatically switch to
"no accounting" mode when the hard_affinity of a vcpu is limited to one
vcpu, and no other vcpus are also limited to that one vcpu?

 -George




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-25 11:58         ` George Dunlap
@ 2015-03-25 13:50           ` Meng Xu
  2015-03-25 13:52             ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Meng Xu @ 2015-03-25 13:50 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Dario Faggioli, Tim Deegan, Denys Drozdov,
	xen-devel, Linh Thi Xuan Phan, Insup Lee, Jan Beulich,
	andrii.tseglytskyi


[-- Attachment #1.1: Type: text/plain, Size: 1298 bytes --]

2015-03-25 7:58 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:

> On 03/25/2015 01:48 AM, Meng Xu wrote:
> > ​Exactly! I will do some measurement on the overhead in #2 before I
> really
> > try to do it. Since #1 is fairly easy, I will first implement #1 and see
> > how much gap it remains to achieve the bare-metal performance.
>
> Interface-wise: I'm wondering if at the Xen or libxl level we really
> need to have a whole new set of hypercalls, at least to implement #1.
> Would it make sense for certain schedulers to automatically switch to
> "no accounting" mode when the hard_affinity of a vcpu is limited to one
> vcpu, and no other vcpus are also limited to that one vcpu?
>

​I guess you mean:​
certain schedulers to automatically switch to
"no accounting" mode when the hard_affinity of a vcpu is limited to one
​"​
cpu
​" (not vcpu)​
, and
​​
no other vcpus are also limited to that one
​"​
cpu
​"​
?

​Am I correct? :-)​

​If so, I think yes, it should be better than having a new hypercall
because in this case, user does want to isolate a dedicated cpu to a
dedicated vcpu. ​

Best regards,

​Meng​
-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 2738 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-25 13:50           ` Meng Xu
@ 2015-03-25 13:52             ` George Dunlap
  0 siblings, 0 replies; 9+ messages in thread
From: George Dunlap @ 2015-03-25 13:52 UTC (permalink / raw)
  To: Meng Xu
  Cc: Stefano Stabellini, Dario Faggioli, Tim Deegan, Denys Drozdov,
	xen-devel, Linh Thi Xuan Phan, Insup Lee, Jan Beulich,
	andrii.tseglytskyi

On 03/25/2015 01:50 PM, Meng Xu wrote:
> 2015-03-25 7:58 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
> 
>> On 03/25/2015 01:48 AM, Meng Xu wrote:
>>> ​Exactly! I will do some measurement on the overhead in #2 before I
>> really
>>> try to do it. Since #1 is fairly easy, I will first implement #1 and see
>>> how much gap it remains to achieve the bare-metal performance.
>>
>> Interface-wise: I'm wondering if at the Xen or libxl level we really
>> need to have a whole new set of hypercalls, at least to implement #1.
>> Would it make sense for certain schedulers to automatically switch to
>> "no accounting" mode when the hard_affinity of a vcpu is limited to one
>> vcpu, and no other vcpus are also limited to that one vcpu?
>>
> 
> ​I guess you mean:​
> certain schedulers to automatically switch to
> "no accounting" mode when the hard_affinity of a vcpu is limited to one
> ​"​
> cpu
> ​" (not vcpu)​
> , and
> ​​
> no other vcpus are also limited to that one
> ​"​
> cpu
> ​"​
> ?
> 
> ​Am I correct? :-)​

Ah, yes, thanks. :-)

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
  2015-03-24 15:27   ` Meng Xu
  2015-03-24 21:08     ` George Dunlap
@ 2015-03-25 18:50     ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-03-25 18:50 UTC (permalink / raw)
  To: Meng Xu
  Cc: Stefano Stabellini, George Dunlap, Dario Faggioli, Denys Drozdov,
	xen-devel, Linh Thi Xuan Phan, Insup Lee, andrii.tseglytskyi

On Tue, Mar 24, 2015 at 11:27:35AM -0400, Meng Xu wrote:
> 2015-03-24 7:54 GMT-04:00 George Dunlap <George.Dunlap@eu.citrix.com>:
> 
> > On Tue, Mar 24, 2015 at 3:50 AM, Meng Xu <xumengpanda@gmail.com> wrote:
> > > Hi Dario and George,
> > >
> > > I'm exploring the design choice of eliminating the Xen scheduler
> > overhead on
> > > the dedicated CPU. A dedicated CPU is a PCPU that has a full capacity
> > VCPU
> > > pinned onto it and no other VCPUs will run on that PCPU.
> >
> > Hey Meng!  This sounds awesome, thanks for looking into it.
> >
> 
> :-) I think it is a useful feature for extreme low latency applications.
> ​
> 
> >
> >
> > > [Problems]
> > > The issue I'm encountering is as follows:
> > > After I implemented the dedicated cpu feature, I compared the latency of
> > a
> > > cpu-intensive task in domU on dedicated CPU (denoted as R_dedcpu) and the
> > > latency on non-dedicated CPU (denoted as R_nodedcpu). The expected result
> > > should be R_dedcpu < R_nodedcpu since we avoid the scheduler overhead.
> > > However, the actual result is R_dedcpu > R_nodedcpu, and R_dedcpu -
> > > R_nodedcpu ~= 1000 cycles.
> > >
> > > After adding some trace to every function that may raise the
> > > SCHEDULE_SOFTIRQ, I found:
> > > When a cpu is not marked as dedicated cpu and the scheduler on it is not
> > > disabled, the vcpu_block() is triggered 2896 times during 58280322928ns
> > > (i.e., triggered once every 20,124,421ns in average) on the dedicated
> > cpu.
> > > However,
> > > When I disable the scheduler on a dedicated cpu, the function
> > > vcpu_block(void) @schedule.c will be triggered very frequently; the
> > > vcpu_block(void) is triggered 644824 times during 8,918,636,761ns (i.e.,
> > > once every 13831ns in average) on the dedicated cpu.
> > >
> > > To sum up the problem I'm facing, the vcpu_block(void) is trigger much
> > > faster and more frequently when the scheduler is disabled on a cpu than
> > when
> > > the scheduled is enabled.
> > >
> > > [My question]
> > > I'm very confused at the reason why vcpu_block(void) is triggered so
> > > frequently when the scheduler is disabled.  The vcpu_block(void) is
> > called
> > > by the SCHEDOP_block hypercall, but why this hypercall will be triggered
> > so
> > > frequently?
> > >
> > > It will be great if you know the answer directly. (This is just a pure
> > hope
> > > and I cannot really expect it. :-) )
> > > But I really appreciate it if you could give me some directions on how I
> > > should figure it out. I grepped vcpu_block(void) and SCHEDOP_block  in
> > the
> > > xen code base, but didn't found much call to them.
> > >
> > > What confused me most is that  the dedicated VCPU should be blocked less
> > > frequently instead of more frequently when the scheduler is disabled on
> > the
> > > dedicated CPU, because the dedicated VCPU is always running on the CPU
> > now
> > > without the hypervisor scheduler's interference.
> >
> > So if I had to guess, I would guess that you're not actually blocking
> > when the guest tries to block.  Normally if the guest blocks, it
> > blocks in a loop like this:
> >
> > do {
> >   enable_irqs();
> >   hlt;
> >   disable_irqs;
> > } while (!interrup_pending);
> >
> > For a PV guest, the hlt() would be replaced with a PV block() hypercall.
> >
> > Normally, when a guest calls block(), then it's taken off the
> > runqueue; and if there's nothing on the runqueue, then the scheduler
> > will run the idle domain; it's the idle domain that actually does the
> > blocking.
> >
> > If you've hardwired it always to return the vcpu in question rather
> > than the idle domain, then it will never block -- it will busy-wait,
> > calling block millions of times.
> >
> > The simplest way to get your prototype working, in that case, would be
> > to return the idle vcpu for that pcpu if the guest is blocked.
> >
> 
> ​Exactly! Thank you so much for pointing this out!  I did hardwired it
> always to return the vcpu that is supposed to be blocked. Now I totally
> understand what happened. :-) ​

Or you could change the kernel to do an idle poll instead of using 'hlt'.

And idle poll is just:
 do {
	nop;pause;
 } while  (!interrupt_pending);

On Linux I believe you do 'idle=poll' to make it do that.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-03-25 18:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-24  3:50 Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU Meng Xu
2015-03-24 11:54 ` George Dunlap
2015-03-24 15:27   ` Meng Xu
2015-03-24 21:08     ` George Dunlap
2015-03-25  1:48       ` Meng Xu
2015-03-25 11:58         ` George Dunlap
2015-03-25 13:50           ` Meng Xu
2015-03-25 13:52             ` George Dunlap
2015-03-25 18:50     ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.