Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU

* Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU
@ 2015-03-24  3:50 Meng Xu
  2015-03-24 11:54 ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Meng Xu @ 2015-03-24  3:50 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli, George Dunlap; +Cc: Stefano Stabellini

[-- Attachment #1.1: Type: text/plain, Size: 6739 bytes --]

Hi Dario and George,

I'm exploring the design choice of eliminating the Xen scheduler overhead
on the dedicated CPU. A dedicated CPU is a PCPU that has a full capacity
VCPU pinned onto it and no other VCPUs will run on that PCPU. In other
words, when a full-capacity VCPU is dedicated to a dedicated CPU, other
VCPUs will never be scheduled onto that dedicated CPU; Because the
dedicated CPU will only run the full-capacity VCPU pinned to it, the
scheduler does not need to be invoked on that dedicated CPU. Considering
the current RTDS scheduler implementation, eliminating the scheduler on the
dedicated CPU could save the scheduler overhead (i.e., 1000 - 2000 cycles)
per 1ms (roughly speaking) on that dedicated CPU.

This dedicated CPU feature could be useful for the extreme low latency
applications in domU, in which several microseconds matters. Because the
dedicated VCPU is "always available" on that dedicated CPU (since scheduler
overhead is eliminated), the processes inside domU that are running on the
dedicated VCPU will avoid the scheduling latency and be more responsive.

The dedicated CPU feature is called Exclusive Affinity feature in VMware's
vSphere.  I watched a presentation from VMWorld 2013: "Extreme Performance
Series Network Speed Ahead" (https://www.youtube.com/watch?v=I-D1a0QaZaU),
that discusses this Exclusive Affinity feature:
     At 34', they list the I/O latency introduced by the hypervisor
scheduler.
     At 35', they introduced the exclusive affinity feature that dedicate a
full capacity VCPU to a PCPU so that the scheduler overhead and the context
switch overhead are eliminated.
     At 39':56'', they discussed the side effects of the Exclusive Affinity
feature.

What I want to do is to implement the Exclusive Affinity feature in Xen
(which I called dedicated CPU) and measure how much scheduler overhead we
can save by using this feature.

[Design]
I added a per_cpu field, cpu_d_status, that has four statuses:
SCHED_CPU_D_STATUS_DISABLED
: the cpu is a non-dedicated CPU, scheduler should be invoked on this cpu;
SCHED_CPU_D_STATUS_
INIT: the cpu is set to be dedicated CPU by user, but we haven't migrated
the dedicated VCPU to this CPU;
SCHED_CPU_D_STATUS_ENABLED
: the cpu has been set to dedicated CPU, and it has the dedicated VCPU
running on it now; the scheduler should never be invoked on this cpu once
it's in this status.
SCHED_CPU_D_STATUS_RESTORE: This cpu has been set to a non-dedicated CPU
from a dedicated CPU by user, we need to do some house-keeping work (e.g.,
update the parameters of the dedicated vcpu and re-arm the timers) before
mark it as a non-dedicated CPU.

I added two hypercalls to add/remove a dedicated CPU:
Hypercall XEN_DOMCTL_SCHEDOP_add_dedvcpu pins a dedicated VCPU to a
dedicated PCPU;
Hypercall XEN_DOMCTL_SCHEDOP_remove_dedvcpu  restore the dedicated PCPU
back to the non-dedicated PCPU.

When the hypercall XEN_DOMCTL_SCHEDOP_add_dedvcpu is called, it will
Step 1) Mark the cpu_d_status on the dedicated CPU as SCHED_CPU_D_STATUS_
INIT; and
             I
f the VCPU is not on the dedicated CPU right now,
m
igrate the dedicated VCPU to the corresponding dedicated CPU, .
Step 2) Exclude the dedicated CPU from the scheduling decisions made from
other cpus. In other words, the RTDS scheduler in sched_rt.c will never
raise SCHEDULE_SOFTIRQ to the dedicated CPU.
Step 3) After the dedicated VCPU is running on the dedicated CPU, mark the
dedicated CPU's cpu_d_status as SCHED_CPU_D_STATUS_
ENABLED; and
kill the following timers (sd is the schedule_data on the pcpu and v is the
dedicated vcpu on the dedicated pcpu) so that scheduler won't be triggered
by timers:
    kill_timer(&sd->s_timer);
    kill_timer(&v->periodic_timer);
    kill_timer(&v->singleshot_timer);
    kill_timer(&v->poll_timer);

When the hypercall XEN_DOMCTL_SCHEDOP_
remove
_dedvcpu is called
, I just did the reverse operation to re-init the timers on the pcpu and
vcpu and raise the SCHEDULE_SOFTIRQ to call the scheduler on that pcpu.

[Problems]
The issue I'm encountering is as follows:
After I implemented the dedicated cpu feature, I compared the latency of a
cpu-intensive task in domU on dedicated CPU (denoted as R_dedcpu) and the
latency on non-dedicated CPU (denoted as R_nodedcpu). The expected result
should be R_dedcpu < R_nodedcpu since we avoid the scheduler overhead.
However, the actual result is R_dedcpu > R_nodedcpu, and R_dedcpu -
R_nodedcpu ~= 1000 cycles.

After adding some trace to every function that may raise the
SCHEDULE_SOFTIRQ, I found:
When a cpu is not marked as dedicated cpu and the scheduler on it is not
disabled, the vcpu_block() is triggered 2896 times during 58280322928ns
(i.e., triggered once every 20,124,421ns in average) on the dedicated cpu.
However,
When I disable the scheduler on a dedicated cpu, the function
vcpu_block(void) @schedule.c will be triggered very frequently; the
vcpu_block(void) is triggered 644824 times during 8,918,636,761ns (i.e.,
once every 13831ns in average) on the dedicated cpu.

To sum up the problem I'm facing, the vcpu_block(void) is trigger much
faster and more frequently when the scheduler is disabled on a cpu than
when the scheduled is enabled.

[My question]
I'm very confused at the reason why vcpu_block(void) is triggered so
frequently when the scheduler is disabled.  The vcpu_block(void) is called
by the SCHEDOP_block hypercall, but why this hypercall will be triggered so
frequently?

It will be great if you know the answer directly. (This is just a pure hope
and I cannot really expect it. :-) )
But I really appreciate it if you could give me some directions on how I
should figure it out. I grepped vcpu_block(void) and SCHEDOP_block  in the
xen code base, but didn't found much call to them.

What confused me most is that  the dedicated VCPU should be blocked less
frequently instead of more frequently when the scheduler is disabled on the
dedicated CPU, because the dedicated VCPU is always running on the CPU now
without the hypervisor scheduler's interference.

The code that implement this feature is at
https://github.com/PennPanda/xenproject/commit/fc6caec0b6ae794b05926cad92e833165fe45305
.
(I'm not sure if it's a good idea to attach the patch at the end of this
email. It may just make this email too long and hard to read. Please let me
know if you need it and I will send it in a separate email.)

Thank you very much for your advice, time and help!

Best regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 10080 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread