Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU

* Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
@ 2015-04-07 20:25 Meng Xu
  2015-04-08  9:13 ` George Dunlap
  0 siblings, 1 reply; 6+ messages in thread
From: Meng Xu @ 2015-04-07 20:25 UTC (permalink / raw)
  To: xen-devel, George Dunlap, Dario Faggioli, Konrad Rzeszutek Wilk
  Cc: Oleg Sokolsky, Linh Thi Xuan Phan, Insup Lee, Dagaen Golomb

Hi George, Dario and Konrad,

I finished a prototype of the RTDS scheduler with the dedicated CPU
feature and did some quick evaluation on this feature. Right now, I
need to refactor the code (because it is kind of messy when I was
exploring different approaches :() and will send out the clean patch
later (this week or next week). But the design follows our discussion
at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.

In a nutshell of the design, when a CPU is marked as dedicated CPU,
the scheduler on that CPU will return the dedicated VCPU on it with a
negative time so that it disable the scheduler timer on that CPU and
other CPUs will no longer send SCHEDULE_SOFTIRQ to the dedicated CPU.
The scheduler on the dedicated CPU may still be invoked when the
dedicated VCPU is blocked/unblocked by the domU. Once this situation
occurs, the schedule go though a fast pass to just return an idle
VCPU/the dedicated VCPU instead of go through the runq.

I did the following evaluation to show the benefits of introducing
this dedicated CPU feature:

I created a simple cpu-intensive task which just do the multiplication
for specified times:
        start = rdtsc();
        while ( i++ < cpu_measurement->multiply_times )
            result += i * i;
        finish = rdtsc();
        latencies[k] = finish - start;

I run this task and measure the execution time of above piece of code
on different environments: native Linux on bare metal, domU on Xen
with RTDS scheduler and domU on Xen with RTDS scheduler with dedicated
CPU feature, domU on Xen with Credit/Credit2 scheduler.

The difference between the execution time in virtualization
environment and the execution time on native linux on bare metal is
the virtualization overhead introduced by Xen.

I want to see that
1) The virtualization overhead decreases a lot after the dedicated CPU
feature is employed for RTDS scheduler (because the execution of the
task will no longer suffer the scheduler overhead any more).
2) The frequency of invoking the scheduler on the dedicated CPU
becomes very low once the dedicated CPU feature is applied.

The result is as follows:
When the cpu-intensive task did the multiplication for 1024 times, the
execution time of the piece of code is:
9264 cycles on native linux on bare metal;
10320 cycles on Xen RTDS scheduler with dedicated CPU feature;
10324 cycles on Xen RTDS scheduler without dedicated CPU feature;

We didn't see the improvement of the dedicated CPU feature here
because the execution time is too short and it may not experience the
scheduler overhead yet.

When the cpu-intensive task did the multiplication for  536870912
times, the execution time of the piece of code is:
4838016028  cycles on native linux on bare metal;
4839649567 cycles on Xen RTDS scheduler with dedicated CPU feature;
4855509977 cycles on Xen RTDS scheduler without dedicated CPU feature;

We can see that the dedicated CPU feature did save time for the
cpu-intensive task. Without the dedicated CPU feature, the hypervisor
scheduler may steal time from the domU and delay the execution of the
task inside domU.

I did vary the number of multiplications of the above piece of code in
cpu-intensive task, and draw a figure to show the relation of the
overhead and the execution time of the task on native linux. The
figure can be found at
http://www.cis.upenn.edu/~mengxu/xen-ml/cpu-base-alone_multiply_0_0_100.virtOhVSwcetnative.pdf.
Please note the x-axis is the "log" value of the execution time. So
the overhead is actually linear to the execution time of the task.

As to the frequence of invoking the RTDS scheduler with/without the
dedicated CPU feature, I add some code to trace which event triggers
the scheduler on the dedicated cpu and how frequent it is.

Before we apply the dedicated CPU feature to the RTDS scheduler, the
dedicated CPU 3 was invoked once
every 3.5us in average.
(XEN) cpu 3 has invoked 356805936 SCHED_SOFTIRQ (sched) within 1267613845122 ns
(XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(356789129), do_pool(18)
(XEN) vcpu_yield(0), vcpu_block(10483)

After we apply the dedicated CPU feature to the RTDS scheduler, the
dedicated CPU 3 was invoked once every 136ms in average. And the
scheduler was invoked because of vcpu_block/vcpu_unblock event. (We
could modify Linux in domU as Konrad suggests to avoid the hypercall
when vcpu is blocked/unblocked, but I'm unsure if it is better to do
that since it involves the change in domU.)
(XEN) cpu 3 has invoked 5396 SCHED_SOFTIRQ (nooh) within 736973916783 ns
(XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(0), do_pool(0)
(XEN) vcpu_yield(0), vcpu_block(2698)

Here are some conclusions/observation we have:
1) Dedicated CPU feature can save the scheduler overhead from domU and
thus reduce the virtualization overhead.
2) Scheduler overhead of the current RTDS scheduler in the view of
application is higher than the scheduler overhead of the current
credit/credit2 scheduler because the RTDS scheduler is invoked much
more frequent than the credit/credit2 scheduler. (RTDS scheduler is
invoked <= every 1ms, while credit2 scheduler is invoked once every
30ms.) This shows we do need to move the RTDS scheduler from quantum
driven to event driven (i.e., timer-driven) and only call the
scheduler when it is necessary.
3) There exist some constant virtualization overhead (see the case
when the the task's execution time is very small 9264 cycles). I don't
know where this kind of constant virtualization overhead comes from
and if we can eliminate/bound this kind of overhead. Do you have any
suggestion/advice on this?

What I'm thinking is that since we want to target the extreme-low
latency applications, we want to provide bare-metal performance to
these applications if possible. So I want to know where the
virtualization overhead comes from and if we can eliminate each of
them (by sacrificing some flexibility of virtualization). If we cannot
eliminate one of virtualization, we should at least be able to upper
bound the effect of it.
Do you have some suggestions?

Thank you very much for your help and advice!

Best,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 6+ messages in thread