All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
@ 2015-04-07 20:25 Meng Xu
  2015-04-08  9:13 ` George Dunlap
  0 siblings, 1 reply; 6+ messages in thread
From: Meng Xu @ 2015-04-07 20:25 UTC (permalink / raw)
  To: xen-devel, George Dunlap, Dario Faggioli, Konrad Rzeszutek Wilk
  Cc: Oleg Sokolsky, Linh Thi Xuan Phan, Insup Lee, Dagaen Golomb

Hi George, Dario and Konrad,

I finished a prototype of the RTDS scheduler with the dedicated CPU
feature and did some quick evaluation on this feature. Right now, I
need to refactor the code (because it is kind of messy when I was
exploring different approaches :() and will send out the clean patch
later (this week or next week). But the design follows our discussion
at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.

In a nutshell of the design, when a CPU is marked as dedicated CPU,
the scheduler on that CPU will return the dedicated VCPU on it with a
negative time so that it disable the scheduler timer on that CPU and
other CPUs will no longer send SCHEDULE_SOFTIRQ to the dedicated CPU.
The scheduler on the dedicated CPU may still be invoked when the
dedicated VCPU is blocked/unblocked by the domU. Once this situation
occurs, the schedule go though a fast pass to just return an idle
VCPU/the dedicated VCPU instead of go through the runq.

I did the following evaluation to show the benefits of introducing
this dedicated CPU feature:

I created a simple cpu-intensive task which just do the multiplication
for specified times:
        start = rdtsc();
        while ( i++ < cpu_measurement->multiply_times )
            result += i * i;
        finish = rdtsc();
        latencies[k] = finish - start;

I run this task and measure the execution time of above piece of code
on different environments: native Linux on bare metal, domU on Xen
with RTDS scheduler and domU on Xen with RTDS scheduler with dedicated
CPU feature, domU on Xen with Credit/Credit2 scheduler.

The difference between the execution time in virtualization
environment and the execution time on native linux on bare metal is
the virtualization overhead introduced by Xen.

I want to see that
1) The virtualization overhead decreases a lot after the dedicated CPU
feature is employed for RTDS scheduler (because the execution of the
task will no longer suffer the scheduler overhead any more).
2) The frequency of invoking the scheduler on the dedicated CPU
becomes very low once the dedicated CPU feature is applied.

The result is as follows:
When the cpu-intensive task did the multiplication for 1024 times, the
execution time of the piece of code is:
9264 cycles on native linux on bare metal;
10320 cycles on Xen RTDS scheduler with dedicated CPU feature;
10324 cycles on Xen RTDS scheduler without dedicated CPU feature;

We didn't see the improvement of the dedicated CPU feature here
because the execution time is too short and it may not experience the
scheduler overhead yet.

When the cpu-intensive task did the multiplication for  536870912
times, the execution time of the piece of code is:
4838016028  cycles on native linux on bare metal;
4839649567 cycles on Xen RTDS scheduler with dedicated CPU feature;
4855509977 cycles on Xen RTDS scheduler without dedicated CPU feature;

We can see that the dedicated CPU feature did save time for the
cpu-intensive task. Without the dedicated CPU feature, the hypervisor
scheduler may steal time from the domU and delay the execution of the
task inside domU.

I did vary the number of multiplications of the above piece of code in
cpu-intensive task, and draw a figure to show the relation of the
overhead and the execution time of the task on native linux. The
figure can be found at
http://www.cis.upenn.edu/~mengxu/xen-ml/cpu-base-alone_multiply_0_0_100.virtOhVSwcetnative.pdf.
Please note the x-axis is the "log" value of the execution time. So
the overhead is actually linear to the execution time of the task.

As to the frequence of invoking the RTDS scheduler with/without the
dedicated CPU feature, I add some code to trace which event triggers
the scheduler on the dedicated cpu and how frequent it is.

Before we apply the dedicated CPU feature to the RTDS scheduler, the
dedicated CPU 3 was invoked once
every 3.5us in average.
(XEN) cpu 3 has invoked 356805936 SCHED_SOFTIRQ (sched) within 1267613845122 ns
(XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(356789129), do_pool(18)
(XEN) vcpu_yield(0), vcpu_block(10483)


After we apply the dedicated CPU feature to the RTDS scheduler, the
dedicated CPU 3 was invoked once every 136ms in average. And the
scheduler was invoked because of vcpu_block/vcpu_unblock event. (We
could modify Linux in domU as Konrad suggests to avoid the hypercall
when vcpu is blocked/unblocked, but I'm unsure if it is better to do
that since it involves the change in domU.)
(XEN) cpu 3 has invoked 5396 SCHED_SOFTIRQ (nooh) within 736973916783 ns
(XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(0), do_pool(0)
(XEN) vcpu_yield(0), vcpu_block(2698)


Here are some conclusions/observation we have:
1) Dedicated CPU feature can save the scheduler overhead from domU and
thus reduce the virtualization overhead.
2) Scheduler overhead of the current RTDS scheduler in the view of
application is higher than the scheduler overhead of the current
credit/credit2 scheduler because the RTDS scheduler is invoked much
more frequent than the credit/credit2 scheduler. (RTDS scheduler is
invoked <= every 1ms, while credit2 scheduler is invoked once every
30ms.) This shows we do need to move the RTDS scheduler from quantum
driven to event driven (i.e., timer-driven) and only call the
scheduler when it is necessary.
3) There exist some constant virtualization overhead (see the case
when the the task's execution time is very small 9264 cycles). I don't
know where this kind of constant virtualization overhead comes from
and if we can eliminate/bound this kind of overhead. Do you have any
suggestion/advice on this?


What I'm thinking is that since we want to target the extreme-low
latency applications, we want to provide bare-metal performance to
these applications if possible. So I want to know where the
virtualization overhead comes from and if we can eliminate each of
them (by sacrificing some flexibility of virtualization). If we cannot
eliminate one of virtualization, we should at least be able to upper
bound the effect of it.
Do you have some suggestions?

Thank you very much for your help and advice!

Best,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
  2015-04-07 20:25 Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU Meng Xu
@ 2015-04-08  9:13 ` George Dunlap
  2015-04-08 20:52   ` Meng Xu
  0 siblings, 1 reply; 6+ messages in thread
From: George Dunlap @ 2015-04-08  9:13 UTC (permalink / raw)
  To: Meng Xu, xen-devel, Dario Faggioli, Konrad Rzeszutek Wilk
  Cc: Oleg Sokolsky, Linh Thi Xuan Phan, Insup Lee, Dagaen Golomb

On 04/07/2015 09:25 PM, Meng Xu wrote:
> Hi George, Dario and Konrad,
> 
> I finished a prototype of the RTDS scheduler with the dedicated CPU
> feature and did some quick evaluation on this feature. Right now, I
> need to refactor the code (because it is kind of messy when I was
> exploring different approaches :() and will send out the clean patch
> later (this week or next week). But the design follows our discussion
> at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.
> 
> In a nutshell of the design, when a CPU is marked as dedicated CPU,
> the scheduler on that CPU will return the dedicated VCPU on it with a
> negative time so that it disable the scheduler timer on that CPU and
> other CPUs will no longer send SCHEDULE_SOFTIRQ to the dedicated CPU.
> The scheduler on the dedicated CPU may still be invoked when the
> dedicated VCPU is blocked/unblocked by the domU. Once this situation
> occurs, the schedule go though a fast pass to just return an idle
> VCPU/the dedicated VCPU instead of go through the runq.
> 
> I did the following evaluation to show the benefits of introducing
> this dedicated CPU feature:
> 
> I created a simple cpu-intensive task which just do the multiplication
> for specified times:
>         start = rdtsc();
>         while ( i++ < cpu_measurement->multiply_times )
>             result += i * i;
>         finish = rdtsc();
>         latencies[k] = finish - start;
> 
> I run this task and measure the execution time of above piece of code
> on different environments: native Linux on bare metal, domU on Xen
> with RTDS scheduler and domU on Xen with RTDS scheduler with dedicated
> CPU feature, domU on Xen with Credit/Credit2 scheduler.
> 
> The difference between the execution time in virtualization
> environment and the execution time on native linux on bare metal is
> the virtualization overhead introduced by Xen.
> 
> I want to see that
> 1) The virtualization overhead decreases a lot after the dedicated CPU
> feature is employed for RTDS scheduler (because the execution of the
> task will no longer suffer the scheduler overhead any more).
> 2) The frequency of invoking the scheduler on the dedicated CPU
> becomes very low once the dedicated CPU feature is applied.
> 
> The result is as follows:
> When the cpu-intensive task did the multiplication for 1024 times, the
> execution time of the piece of code is:
> 9264 cycles on native linux on bare metal;
> 10320 cycles on Xen RTDS scheduler with dedicated CPU feature;
> 10324 cycles on Xen RTDS scheduler without dedicated CPU feature;
> 
> We didn't see the improvement of the dedicated CPU feature here
> because the execution time is too short and it may not experience the
> scheduler overhead yet.
> 
> When the cpu-intensive task did the multiplication for  536870912
> times, the execution time of the piece of code is:
> 4838016028  cycles on native linux on bare metal;
> 4839649567 cycles on Xen RTDS scheduler with dedicated CPU feature;
> 4855509977 cycles on Xen RTDS scheduler without dedicated CPU feature;

Hey Meng!  Thanks for looking at this.

One thing: it's not entirely clear to me whether the numbers for
"without dedicated CPU feature" are still with the equivalent of
"pinning' -- i.e., is it guaranteed that no other vcpu will be run on
the same cpu as the test program?

Assuming that's the case, the numbers you give above show a 0.3%
improvement for the "dedicated" cpu for cpu-intensive workloads.


> We can see that the dedicated CPU feature did save time for the
> cpu-intensive task. Without the dedicated CPU feature, the hypervisor
> scheduler may steal time from the domU and delay the execution of the
> task inside domU.
> 
> I did vary the number of multiplications of the above piece of code in
> cpu-intensive task, and draw a figure to show the relation of the
> overhead and the execution time of the task on native linux. The
> figure can be found at
> http://www.cis.upenn.edu/~mengxu/xen-ml/cpu-base-alone_multiply_0_0_100.virtOhVSwcetnative.pdf.
> Please note the x-axis is the "log" value of the execution time. So
> the overhead is actually linear to the execution time of the task.

I'm not sure I can gain any useful information out of this graph.  A
more useful comparison  would be to graph the execution time as an
*overhead* compared to the Linux execution time.  For instance, in the
numbers above, you'd have Linux = 1, RTDS+dedicated = 1.000337, RTDS =
1.00361.

But what it sounds like you're saying is that if you did such a graph,
the overhead would be pretty flat.  That's what I'd expect -- a fairly
constant overhead, regardless of how long you were running the test.

> As to the frequence of invoking the RTDS scheduler with/without the
> dedicated CPU feature, I add some code to trace which event triggers
> the scheduler on the dedicated cpu and how frequent it is.
> 
> Before we apply the dedicated CPU feature to the RTDS scheduler, the
> dedicated CPU 3 was invoked once
> every 3.5us in average.
> (XEN) cpu 3 has invoked 356805936 SCHED_SOFTIRQ (sched) within 1267613845122 ns
> (XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(356789129), do_pool(18)
> (XEN) vcpu_yield(0), vcpu_block(10483)
> 
> 
> After we apply the dedicated CPU feature to the RTDS scheduler, the
> dedicated CPU 3 was invoked once every 136ms in average. And the
> scheduler was invoked because of vcpu_block/vcpu_unblock event. (We
> could modify Linux in domU as Konrad suggests to avoid the hypercall
> when vcpu is blocked/unblocked, but I'm unsure if it is better to do
> that since it involves the change in domU.)
> (XEN) cpu 3 has invoked 5396 SCHED_SOFTIRQ (nooh) within 736973916783 ns
> (XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(0), do_pool(0)
> (XEN) vcpu_yield(0), vcpu_block(2698)
> 
> 
> Here are some conclusions/observation we have:
> 1) Dedicated CPU feature can save the scheduler overhead from domU and
> thus reduce the virtualization overhead.
> 2) Scheduler overhead of the current RTDS scheduler in the view of
> application is higher than the scheduler overhead of the current
> credit/credit2 scheduler because the RTDS scheduler is invoked much
> more frequent than the credit/credit2 scheduler. (RTDS scheduler is
> invoked <= every 1ms, while credit2 scheduler is invoked once every
> 30ms.) This shows we do need to move the RTDS scheduler from quantum
> driven to event driven (i.e., timer-driven) and only call the
> scheduler when it is necessary.

So a couple of things.  First, the vast majority of people using
virtualization don't care *that much* about the CPU overhead.  Even in
the case of embedded, a 0.3% overhead reduction would probably translate
to a 0.3% improvement in battery life -- an amount so miniscule that it
would be lost in the noise.

Secondly, adding an entirely new interface, as implementing the
"dedicated cpu" would require, on the other hand, is a fairly
significant cost.  It's costly for users to learn and configure the new
interface, it's costly to document, and once it's there we have to
continue to support it perhaps for a long time to come; and the feature
itself is also fairly complicated and increases the code maintenance.

So the performance improvement you've shown so far I think is nowhere
near high enough a benefit to outweigh this cost.

And in any case, as you say, it looks like the source of the overhead is
the very frequent invocation of the RTDS scheduler.  You could probably
get the same kinds of benefits without adding any new interfaces by
reducing the amount of time the scheduler gets invoked when there are no
other tasks to run on that cpu.

What I was expecting you to test, for the RTDS scheduler, was the
wake-up latency.  Have you looked at that at all?

> 3) There exist some constant virtualization overhead (see the case
> when the the task's execution time is very small 9264 cycles). I don't
> know where this kind of constant virtualization overhead comes from
> and if we can eliminate/bound this kind of overhead. Do you have any
> suggestion/advice on this?

How are you testing this -- running RDTSC?  Do you know what TSC mode
you're running in?  If you're trapping on TSCs, that might account for
some of the overhead for very small cycles.

Other than that, nothing comes to mind off the top of my head.

 -George

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
  2015-04-08  9:13 ` George Dunlap
@ 2015-04-08 20:52   ` Meng Xu
  2015-04-10 14:28     ` Konrad Rzeszutek Wilk
  2015-04-23 12:48     ` Dario Faggioli
  0 siblings, 2 replies; 6+ messages in thread
From: Meng Xu @ 2015-04-08 20:52 UTC (permalink / raw)
  To: George Dunlap
  Cc: Dario Faggioli, xen-devel, Oleg Sokolsky, Linh Thi Xuan Phan,
	Insup Lee, Dagaen Golomb

2015-04-08 5:13 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
> On 04/07/2015 09:25 PM, Meng Xu wrote:
>> Hi George, Dario and Konrad,
>>
>> I finished a prototype of the RTDS scheduler with the dedicated CPU
>> feature and did some quick evaluation on this feature. Right now, I
>> need to refactor the code (because it is kind of messy when I was
>> exploring different approaches :() and will send out the clean patch
>> later (this week or next week). But the design follows our discussion
>> at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.
>>
>> In a nutshell of the design, when a CPU is marked as dedicated CPU,
>> the scheduler on that CPU will return the dedicated VCPU on it with a
>> negative time so that it disable the scheduler timer on that CPU and
>> other CPUs will no longer send SCHEDULE_SOFTIRQ to the dedicated CPU.
>> The scheduler on the dedicated CPU may still be invoked when the
>> dedicated VCPU is blocked/unblocked by the domU. Once this situation
>> occurs, the schedule go though a fast pass to just return an idle
>> VCPU/the dedicated VCPU instead of go through the runq.
>>
>> I did the following evaluation to show the benefits of introducing
>> this dedicated CPU feature:
>>
>> I created a simple cpu-intensive task which just do the multiplication
>> for specified times:
>>         start = rdtsc();
>>         while ( i++ < cpu_measurement->multiply_times )
>>             result += i * i;
>>         finish = rdtsc();
>>         latencies[k] = finish - start;
>>
>> I run this task and measure the execution time of above piece of code
>> on different environments: native Linux on bare metal, domU on Xen
>> with RTDS scheduler and domU on Xen with RTDS scheduler with dedicated
>> CPU feature, domU on Xen with Credit/Credit2 scheduler.
>>
>> The difference between the execution time in virtualization
>> environment and the execution time on native linux on bare metal is
>> the virtualization overhead introduced by Xen.
>>
>> I want to see that
>> 1) The virtualization overhead decreases a lot after the dedicated CPU
>> feature is employed for RTDS scheduler (because the execution of the
>> task will no longer suffer the scheduler overhead any more).
>> 2) The frequency of invoking the scheduler on the dedicated CPU
>> becomes very low once the dedicated CPU feature is applied.
>>
>> The result is as follows:
>> When the cpu-intensive task did the multiplication for 1024 times, the
>> execution time of the piece of code is:
>> 9264 cycles on native linux on bare metal;
>> 10320 cycles on Xen RTDS scheduler with dedicated CPU feature;
>> 10324 cycles on Xen RTDS scheduler without dedicated CPU feature;
>>
>> We didn't see the improvement of the dedicated CPU feature here
>> because the execution time is too short and it may not experience the
>> scheduler overhead yet.
>>
>> When the cpu-intensive task did the multiplication for  536870912
>> times, the execution time of the piece of code is:
>> 4838016028  cycles on native linux on bare metal;
>> 4839649567 cycles on Xen RTDS scheduler with dedicated CPU feature;
>> 4855509977 cycles on Xen RTDS scheduler without dedicated CPU feature;
>
> Hey Meng!  Thanks for looking at this.
>
> One thing: it's not entirely clear to me whether the numbers for
> "without dedicated CPU feature" are still with the equivalent of
> "pinning' -- i.e., is it guaranteed that no other vcpu will be run on
> the same cpu as the test program?

Yes. In the experiment, every VCPU is pinned to one isolated CPU. So
no other VCPU will run on the same cpu as the test program.

>
> Assuming that's the case, the numbers you give above show a 0.3%
> improvement for the "dedicated" cpu for cpu-intensive workloads.
>
>

Yes. This is the save when the cpu-intensive task did the
multiplication for  536870912 times.
Another thing to note is that the scheduling quantum of RTDS scheduler
is <= 1ms. If the scheduling quantum is smaller, the saving should
increase, but won't be too much in my speculation. (So I agree with
your conclusion that the benefits may not worth the complexity)

>> We can see that the dedicated CPU feature did save time for the
>> cpu-intensive task. Without the dedicated CPU feature, the hypervisor
>> scheduler may steal time from the domU and delay the execution of the
>> task inside domU.
>>
>> I did vary the number of multiplications of the above piece of code in
>> cpu-intensive task, and draw a figure to show the relation of the
>> overhead and the execution time of the task on native linux. The
>> figure can be found at
>> http://www.cis.upenn.edu/~mengxu/xen-ml/cpu-base-alone_multiply_0_0_100.virtOhVSwcetnative.pdf.
>> Please note the x-axis is the "log" value of the execution time. So
>> the overhead is actually linear to the execution time of the task.
>
> I'm not sure I can gain any useful information out of this graph.  A
> more useful comparison  would be to graph the execution time as an
> *overhead* compared to the Linux execution time.  For instance, in the
> numbers above, you'd have Linux = 1, RTDS+dedicated = 1.000337, RTDS =
> 1.00361.

Do you mean I should normalize the execution time on RTDS+dedicated
and RTDS based on the execution time on Linux?
I can do it if necessary. (Maybe it won't be necessary. :-) )

>
> But what it sounds like you're saying is that if you did such a graph,
> the overhead would be pretty flat.  That's what I'd expect -- a fairly
> constant overhead, regardless of how long you were running the test.

Yes.

>
>> As to the frequence of invoking the RTDS scheduler with/without the
>> dedicated CPU feature, I add some code to trace which event triggers
>> the scheduler on the dedicated cpu and how frequent it is.
>>
>> Before we apply the dedicated CPU feature to the RTDS scheduler, the
>> dedicated CPU 3 was invoked once
>> every 3.5us in average.
>> (XEN) cpu 3 has invoked 356805936 SCHED_SOFTIRQ (sched) within 1267613845122 ns
>> (XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(356789129), do_pool(18)
>> (XEN) vcpu_yield(0), vcpu_block(10483)
>>
>>
>> After we apply the dedicated CPU feature to the RTDS scheduler, the
>> dedicated CPU 3 was invoked once every 136ms in average. And the
>> scheduler was invoked because of vcpu_block/vcpu_unblock event. (We
>> could modify Linux in domU as Konrad suggests to avoid the hypercall
>> when vcpu is blocked/unblocked, but I'm unsure if it is better to do
>> that since it involves the change in domU.)
>> (XEN) cpu 3 has invoked 5396 SCHED_SOFTIRQ (nooh) within 736973916783 ns
>> (XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(0), do_pool(0)
>> (XEN) vcpu_yield(0), vcpu_block(2698)
>>
>>
>> Here are some conclusions/observation we have:
>> 1) Dedicated CPU feature can save the scheduler overhead from domU and
>> thus reduce the virtualization overhead.
>> 2) Scheduler overhead of the current RTDS scheduler in the view of
>> application is higher than the scheduler overhead of the current
>> credit/credit2 scheduler because the RTDS scheduler is invoked much
>> more frequent than the credit/credit2 scheduler. (RTDS scheduler is
>> invoked <= every 1ms, while credit2 scheduler is invoked once every
>> 30ms.) This shows we do need to move the RTDS scheduler from quantum
>> driven to event driven (i.e., timer-driven) and only call the
>> scheduler when it is necessary.
>
> So a couple of things.  First, the vast majority of people using
> virtualization don't care *that much* about the CPU overhead.  Even in
> the case of embedded, a 0.3% overhead reduction would probably translate
> to a 0.3% improvement in battery life -- an amount so miniscule that it
> would be lost in the noise.

I see.

>
> Secondly, adding an entirely new interface, as implementing the
> "dedicated cpu" would require, on the other hand, is a fairly
> significant cost.  It's costly for users to learn and configure the new
> interface, it's costly to document, and once it's there we have to
> continue to support it perhaps for a long time to come; and the feature
> itself is also fairly complicated and increases the code maintenance.
>
> So the performance improvement you've shown so far I think is nowhere
> near high enough a benefit to outweigh this cost.

OK. I see and agree.

>
> And in any case, as you say, it looks like the source of the overhead is
> the very frequent invocation of the RTDS scheduler.  You could probably
> get the same kinds of benefits without adding any new interfaces by
> reducing the amount of time the scheduler gets invoked when there are no
> other tasks to run on that cpu.

Yes. This is what Dagaen (cc.ed) is doing right now. He had a RFC
patch and sent to me last week. We  are working on refining the patch
before sending it out to the mailing list.

>
> What I was expecting you to test, for the RTDS scheduler, was the
> wake-up latency.  Have you looked at that at all?

Ah, I didn't realize this... Do you have any concrete evaluation plan for this?
In my mind, I can issue hypercalls in domU to wake-up and sleep a vcpu
and measure how long it takes to wake up a vcpu. Maybe you have some
better idea in mind?
(The wake up latency of a vcpu will depends on the priority of the
vcpu and how heavy loaded the system is, in my speculation.)

>
>> 3) There exist some constant virtualization overhead (see the case
>> when the the task's execution time is very small 9264 cycles). I don't
>> know where this kind of constant virtualization overhead comes from
>> and if we can eliminate/bound this kind of overhead. Do you have any
>> suggestion/advice on this?
>
> How are you testing this -- running RDTSC?  Do you know what TSC mode
> you're running in?  If you're trapping on TSCs, that might account for
> some of the overhead for very small cycles.

I'm using rtdsc to read the tsc counter in the test program in domU.
I didn't configure the TSC mode for domU, so I think it should be the
default mode (i.e., tsc_mode=1 emulated mode?).  Do you think this
could be the source of the ~1000 cycles overhead for task with small
execution time?

You mentioned "If you're trapping on TSCs", when does this (trapping)
may happen? Is it related to always emulated mode or never emulated
mode or the PVRDTSCP mode? Do you have any suggestion on how I should
investigate this?  (I had a look at
http://xenbits.xen.org/docs/4.3-testing/misc/tscmode.txt, but didn't
find the idea how to look into this. :-( )

>
> Other than that, nothing comes to mind off the top of my head.

I see.

Thank you very much for your advice and time!

Best,

Meng

>
>  -George
>



-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
  2015-04-08 20:52   ` Meng Xu
@ 2015-04-10 14:28     ` Konrad Rzeszutek Wilk
  2015-04-23 12:48     ` Dario Faggioli
  1 sibling, 0 replies; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-04-10 14:28 UTC (permalink / raw)
  To: Meng Xu, andrii.tseglytskyi
  Cc: George Dunlap, Dario Faggioli, xen-devel, Oleg Sokolsky,
	Linh Thi Xuan Phan, Insup Lee, Dagaen Golomb

Andrii,

I've a question to you at the bottom that I hope you can answer.

> >> 3) There exist some constant virtualization overhead (see the case
> >> when the the task's execution time is very small 9264 cycles). I don't
> >> know where this kind of constant virtualization overhead comes from
> >> and if we can eliminate/bound this kind of overhead. Do you have any
> >> suggestion/advice on this?
> >
> > How are you testing this -- running RDTSC?  Do you know what TSC mode
> > you're running in?  If you're trapping on TSCs, that might account for
> > some of the overhead for very small cycles.
> 
> I'm using rtdsc to read the tsc counter in the test program in domU.
> I didn't configure the TSC mode for domU, so I think it should be the
> default mode (i.e., tsc_mode=1 emulated mode?).  Do you think this
> could be the source of the ~1000 cycles overhead for task with small
> execution time?

It should be the =2, which is native (no emulation). At least
that is what libxl sets by default.

> 
> You mentioned "If you're trapping on TSCs", when does this (trapping)
> may happen? Is it related to always emulated mode or never emulated
> mode or the PVRDTSCP mode? Do you have any suggestion on how I should
> investigate this?  (I had a look at

We would trap (VMEXIT) and if you look at the EIP of the guest the virtual
address should correspond to the 'rdstcp' opcode.

> http://xenbits.xen.org/docs/4.3-testing/misc/tscmode.txt, but didn't
> find the idea how to look into this. :-( )
> 
> >
> > Other than that, nothing comes to mind off the top of my head.
> 
> I see.

I had dug through the path of the different scenarios when the
guest does an HALT operation - and the codepaths we select.

You might want to see http://mid.gmane.org/20140423212824.GB12560@phenom.dumpdata.com
and http://mid.gmane.org/20140506173627.GA6942@phenom.dumpdata.com

The point is that you probably want to run 'xentrace' and use
the xenformat to get the raw state of all the traces.

Then comes the hard part of figuring out what happens in between
the traces. I figured out that my problem was due to three
softirqs being scheduled - TIMER, TASKLET, SCHEDULE and TASKLET
was taking an global spinlock.

In your case you probably don't have the TASKLET.

Since you are looking this from the perspective of the guest
you can setup an page between hypervisor and your guest - and
add an marker in there (set a bit). When the hypervisor
goes in the VMEXIT it can check if the marker is there and
do an trace op - and also one when it is right about to
go back to the guest.

Andrii, CC-ed here, I believe did something like that?

> 
> Thank you very much for your advice and time!
> 
> Best,
> 
> Meng
> 
> >
> >  -George
> >
> 
> 
> 
> -- 
> 
> 
> -----------
> Meng Xu
> PhD Student in Computer and Information Science
> University of Pennsylvania

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
  2015-04-08 20:52   ` Meng Xu
  2015-04-10 14:28     ` Konrad Rzeszutek Wilk
@ 2015-04-23 12:48     ` Dario Faggioli
  2015-04-23 19:20       ` Meng Xu
  1 sibling, 1 reply; 6+ messages in thread
From: Dario Faggioli @ 2015-04-23 12:48 UTC (permalink / raw)
  To: Meng Xu
  Cc: George Dunlap, xen-devel, Oleg Sokolsky, Linh Thi Xuan Phan,
	Insup Lee, Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 6764 bytes --]

Hey, guys,

I know, I know, I'm soooo much late to the party! Sorry, I got trapped
into a thing that I really needed to finish... :-/

I've got no intention to resurrect this old thread, just wanted to
pointed out a few things.

On Wed, 2015-04-08 at 16:52 -0400, Meng Xu wrote:
> 2015-04-08 5:13 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
> > On 04/07/2015 09:25 PM, Meng Xu wrote:
> >> Hi George, Dario and Konrad,
> >>
> >> I finished a prototype of the RTDS scheduler with the dedicated CPU
> >> feature and did some quick evaluation on this feature. Right now, I
> >> need to refactor the code (because it is kind of messy when I was
> >> exploring different approaches :() and will send out the clean patch
> >> later (this week or next week). But the design follows our discussion
> >> at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.
> >>
The idea of 'dedicated CPU' makes sense. It's also always been quite
common, in the Linux community, to see it as a real-time oriented
feature. I personally don't agree much, as real-time is about
determinism and dedicating a CPU to a task (in our case, that would mean
dedicating a pCPU to a vCPU and then, in the guest, that vCPU to a task)
does not automatically gives you determinism.

Sure, it cuts off some overhead and some sources of unpredictable
behavior (e.g., scheduler code), but not all of them (what about, for
instance, caches shared with non-isolated pCPUs). No, IMO, if you want
determinism, you should make the code deterministic, not get rid of
it! :-D

In fact, Linux has a feature similar the one Meng investigated, and that
has traditionally been used (at least until I was involved with Linux
scheduling) by HPC people, database engines and high frequency trading
use cases (which are also often categorized as 'real-time workloads' but
just aren't, IMO).

It's called isolcpus. For sure there was a boot time parameter for it,
and it looks like it is still there:
http://wiki.linuxcnc.org/cgi-bin/wiki.pl?The_Isolcpus_Boot_Parameter_And_GRUB2
http://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html
http://lxr.linux.no/linux+v3.19.1/Documentation/kernel-parameters.txt#L1530

I'm not sure whether they grew interfaces to setup this at runtime, but
doubt it.

For us, I'm not sure whether something like that would be useful. To be
fruitfully used together with something similar to Linux's isolcpus, it
need to look like how Meng is doing it, i.e., it ought to be possible to
handle single vCPUs, not full domains. However...

> > Secondly, adding an entirely new interface, as implementing the
> > "dedicated cpu" would require, on the other hand, is a fairly
> > significant cost.  It's costly for users to learn and configure the new
> > interface, it's costly to document, and once it's there we have to
> > continue to support it perhaps for a long time to come; and the feature
> > itself is also fairly complicated and increases the code maintenance.
> >
> > So the performance improvement you've shown so far I think is nowhere
> > near high enough a benefit to outweigh this cost.
> 
... I agree with George on this...

> OK. I see and agree.
> 
... and I'm happy you also do! :-D

> > And in any case, as you say, it looks like the source of the overhead is
> > the very frequent invocation of the RTDS scheduler.
>
Exactly! I'd put it this way: there are more urgent and more useful
optimization, in general, but especially in RTDS, to be done before
thinking at something like this.

>   You could probably
> > get the same kinds of benefits without adding any new interfaces by
> > reducing the amount of time the scheduler gets invoked when there are no
> > other tasks to run on that cpu.
> 
Exactly. And again, that is particularly relevant to RTDS, as numbers
show. Looking again at Linux world, this (i.e., avoiding invoking the
scheduler when there is only one task on a CPU) is also something
they've introduced rather recently.

It's called full dynticks:
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
http://ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf
https://lwn.net/Articles/549580/
http://thread.gmane.org/gmane.linux.kernel/1485210 [*]

[*] check out Linus' replies... "awesome" as usual, he even managed to
rant about virtualization, all by himself!! :-P

That is IMO a line of action that may deserve some investigation.
RTDS-wise, for sure... the Credit-s, are not at all bad from that
perspective (as your numbers also show), but it might be possible to do
better.

> Yes. This is what Dagaen (cc.ed) is doing right now. He had a RFC
> patch and sent to me last week. We  are working on refining the patch
> before sending it out to the mailing list.
> 
I'll be super glad to see this! :-D

> >
> > What I was expecting you to test, for the RTDS scheduler, was the
> > wake-up latency.  Have you looked at that at all?
> 
Indeed, that would be really interesting.

> Ah, I didn't realize this... Do you have any concrete evaluation plan for this?
> In my mind, I can issue hypercalls in domU to wake-up and sleep a vcpu
> and measure how long it takes to wake up a vcpu. Maybe you have some
> better idea in mind?
> (The wake up latency of a vcpu will depends on the priority of the
> vcpu and how heavy loaded the system is, in my speculation.)
> 
Yes, that is something that could (should?) be done, as the wakeup
latency of a vcpu is a lower bound for the wakeup latency of in-guest
workloads, so we really want to know where we stand wrt to that, if we
need to improve things, and if yes how.

It's priority and load dependant... yes, of course, but that's why we
have real-time schedulers for, isn't it? :-P Jokes apart, for the actual
'lower bound', we're clearly interesting to measure a vcpu when running
alone on a pCPU, or with top priority.

On the other hand, to look at wakeup latency from within the guest,
cyclictest is the way to go:
https://rt.wiki.kernel.org/index.php/Cyclictest

What we want is to run it inside a guest, under different host and guest
load conditions (and using different schedulers, varying the scheduling
parameters, etc), and see what happens... Ever looked at that? I think
it would be interesting.

I've done something similar while preparing this talk:
https://archive.fosdem.org/2014/schedule/event/virtiaas16/

But never got the chance to repeat the experiments (neither I did any
further reasoning or investigation about how the timestamps are
obtained, TSC emulation, etc., as George pointed out)

That's all... Sorry again for chiming in only now. :-(

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU
  2015-04-23 12:48     ` Dario Faggioli
@ 2015-04-23 19:20       ` Meng Xu
  0 siblings, 0 replies; 6+ messages in thread
From: Meng Xu @ 2015-04-23 19:20 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: George Dunlap, xen-devel, Oleg Sokolsky, Linh Thi Xuan Phan,
	Insup Lee, Dagaen Golomb

Hi Dario,

2015-04-23 8:48 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:
> Hey, guys,
>
> I know, I know, I'm soooo much late to the party! Sorry, I got trapped
> into a thing that I really needed to finish... :-/
>
> I've got no intention to resurrect this old thread, just wanted to
> pointed out a few things.

What you pointed out is very interesting! I will have a look at them
and then show the numbers of the measurements.
(I will also need/have to redo some experiments to see how the TSC may
affect the results, which I didn't realize before. :-( )

>
> On Wed, 2015-04-08 at 16:52 -0400, Meng Xu wrote:
>> 2015-04-08 5:13 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
>> > On 04/07/2015 09:25 PM, Meng Xu wrote:
>> >> Hi George, Dario and Konrad,
>> >>
>> >> I finished a prototype of the RTDS scheduler with the dedicated CPU
>> >> feature and did some quick evaluation on this feature. Right now, I
>> >> need to refactor the code (because it is kind of messy when I was
>> >> exploring different approaches :() and will send out the clean patch
>> >> later (this week or next week). But the design follows our discussion
>> >> at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.
>> >>
> The idea of 'dedicated CPU' makes sense. It's also always been quite
> common, in the Linux community, to see it as a real-time oriented
> feature. I personally don't agree much, as real-time is about
> determinism and dedicating a CPU to a task (in our case, that would mean
> dedicating a pCPU to a vCPU and then, in the guest, that vCPU to a task)
> does not automatically gives you determinism.
>
> Sure, it cuts off some overhead and some sources of unpredictable
> behavior (e.g., scheduler code), but not all of them (what about, for
> instance, caches shared with non-isolated pCPUs).

I'm working on eliminating the shared cache interference among guest
domains on non-isolated pCPUs. I have a prototype that can statically
partition the LLC into equal size partitions and assign it to guest
domains based on domain's configuration file. I'm running some
evaluations to show the effectiveness of the cache partitioning
approach (via page coloring) and will send you guys a slide about this
soon. :-)

>> > And in any case, as you say, it looks like the source of the overhead is
>> > the very frequent invocation of the RTDS scheduler.
>>
> Exactly! I'd put it this way: there are more urgent and more useful
> optimization, in general, but especially in RTDS, to be done before
> thinking at something like this.

Right.

>> > What I was expecting you to test, for the RTDS scheduler, was the
>> > wake-up latency.  Have you looked at that at all?
>>
> Indeed, that would be really interesting.
>
>> Ah, I didn't realize this... Do you have any concrete evaluation plan for this?
>> In my mind, I can issue hypercalls in domU to wake-up and sleep a vcpu
>> and measure how long it takes to wake up a vcpu. Maybe you have some
>> better idea in mind?
>> (The wake up latency of a vcpu will depends on the priority of the
>> vcpu and how heavy loaded the system is, in my speculation.)
>>
> Yes, that is something that could (should?) be done, as the wakeup
> latency of a vcpu is a lower bound for the wakeup latency of in-guest
> workloads, so we really want to know where we stand wrt to that, if we
> need to improve things, and if yes how.
>
> It's priority and load dependant... yes, of course, but that's why we
> have real-time schedulers for, isn't it? :-P Jokes apart, for the actual
> 'lower bound', we're clearly interesting to measure a vcpu when running
> alone on a pCPU, or with top priority.
>
> On the other hand, to look at wakeup latency from within the guest,
> cyclictest is the way to go:
> https://rt.wiki.kernel.org/index.php/Cyclictest
>
> What we want is to run it inside a guest, under different host and guest
> load conditions (and using different schedulers, varying the scheduling
> parameters, etc), and see what happens... Ever looked at that? I think
> it would be interesting.

I will have a look at the cyclictest and have some evaluation set up.
I totally agree that wakeup latency (on highest priority vcpu) is very
important for real-time applications. I will start a new thread about
the cyclictest result once I get it.

>
> I've done something similar while preparing this talk:
> https://archive.fosdem.org/2014/schedule/event/virtiaas16/
>
> But never got the chance to repeat the experiments (neither I did any
> further reasoning or investigation about how the timestamps are
> obtained, TSC emulation, etc., as George pointed out)
>
> That's all... Sorry again for chiming in only now. :-(

No problem at all! :-) Your advice and information are very useful!

Thanks and best regards,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-04-23 19:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-07 20:25 Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU Meng Xu
2015-04-08  9:13 ` George Dunlap
2015-04-08 20:52   ` Meng Xu
2015-04-10 14:28     ` Konrad Rzeszutek Wilk
2015-04-23 12:48     ` Dario Faggioli
2015-04-23 19:20       ` Meng Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.