All of lore.kernel.org
 help / color / mirror / Atom feed
* TSC scaling and softtsc reprise, and PROPOSAL
@ 2009-07-20 17:05 Dan Magenheimer
  2009-07-20 17:14 ` Keir Fraser
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-20 17:05 UTC (permalink / raw)
  To: Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, Keir Fraser, Zhang, Xiantao, John Levon

While at Linux Symposium last week, I heard a rumor that VMware ESX
always traps and emulates all rdtsc instructions.  (Can anyone confirm
or deny this?)

This reminded me that I'm not sure we came to any conclusion
for proper handling of TSC in Xen, though I think that the
scaling patch was taken into xen-unstable, meaning that some
users will unknowingly be using softtsc (all rdtsc instructions
fully emulated) when live migrating between machines with
different Hz rates.  This could lead to the bizarre situation
where a time-sensitive SMP app might fail in cryptic ways if
it has never migrated, but work fine if it has.

(Here's the last discussion I think:
http://lists.xensource.com/archives/html/xen-devel/2009-06/msg00980.html)

I dug up some old measurements from when we first implemented
softtsc that I think showed that emulating TSC averages around
one microsecond on my Conroe box.  John Levon's measurements showed
that Solaris' mstate accounting was doing rdtsc at a frequency of
about 3000/sec (per processor on an idle system), which would
translate to a fraction of a percent of CPU time, even for this
very excessive use of rdtsc.

While I'd like to see my measurement independently confirmed
(and on a wider variety of old and new systems); and some better
(heavy-workload) data on mstate accounting tsc frequency; and
a rerun of the oltp workload that showed poor (10%?) results
to prove that this number is real and not just apocryphal;
this raw data leads me to the following:

PROPOSAL:

The default mode for all xen systems should be that all rdtsc
instructions should be emulated by xen using xen system time
as the timestamp counter (i.e. nanosecond frequency).

The no-softtsc Xen boot option remains available to force the
non-trapping mechanism if desired.  It might make sense to
add a per-guest config option to override per guest.

The Xen CPU info emulation should reflect that tsc is constant
and safe to use on an SMP.

Comments?  I think someone at Intel (Eddie?) was studying the
TSC emulation path to see if it could be faster, but I'm not
sure where that ended up.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-20 17:05 TSC scaling and softtsc reprise, and PROPOSAL Dan Magenheimer
@ 2009-07-20 17:14 ` Keir Fraser
  2009-07-20 20:02   ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-20 17:14 UTC (permalink / raw)
  To: Dan Magenheimer, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, Zhang, Xiantao, John Levon

On 20/07/2009 18:05, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> The default mode for all xen systems should be that all rdtsc
> instructions should be emulated by xen using xen system time
> as the timestamp counter (i.e. nanosecond frequency).
> 
> The no-softtsc Xen boot option remains available to force the
> non-trapping mechanism if desired.  It might make sense to
> add a per-guest config option to override per guest.
> 
> The Xen CPU info emulation should reflect that tsc is constant
> and safe to use on an SMP.
> 
> Comments?  I think someone at Intel (Eddie?) was studying the
> TSC emulation path to see if it could be faster, but I'm not
> sure where that ended up.

Defaults which slow things down are never popular. The slowdown on a
non-idle Solaris guest, for example, could be significant. It is a
correctness/accuracy vs performance tradeoff though. But I don't think there
are many real-world complaints about the TSC accuracy now -- I think the
default is set appropriately.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-20 17:14 ` Keir Fraser
@ 2009-07-20 20:02   ` Dan Magenheimer
  2009-07-20 21:02     ` Keir Fraser
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-20 20:02 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, Zhang, Xiantao, John Levon

> > The default mode for all xen systems should be that all rdtsc
> > instructions should be emulated by xen using xen system time
> > as the timestamp counter (i.e. nanosecond frequency).
> > 
> > The no-softtsc Xen boot option remains available to force the
> > non-trapping mechanism if desired.  It might make sense to
> > add a per-guest config option to override per guest.
> > 
> > The Xen CPU info emulation should reflect that tsc is constant
> > and safe to use on an SMP.
> > 
> > Comments?  I think someone at Intel (Eddie?) was studying the
> > TSC emulation path to see if it could be faster, but I'm not
> > sure where that ended up.
> 
> Defaults which slow things down are never popular. The slowdown on a
> non-idle Solaris guest, for example, could be significant. It is a
> correctness/accuracy vs performance tradeoff though. But I 
> don't think there
> are many real-world complaints about the TSC accuracy now -- 
> I think the
> default is set appropriately.

Just wondering... are there other known cases in Xen where
a correctness-vs-performance tradeoff has been made in favor
of performance?

I agree that if the performance is *really bad*, the default
should not change.  But I think we are still flying on rumors
of data collected years ago in a very different world, and
the performance data should be re-collected to prove that
it is still *really bad*.  If the degradation is a fraction
of a percent even in worst case analysis, I think the default
should be changed so that correctness prevails.

Why now?  Because more and more real-world applications are
built on top of multi-core platforms where TSC is reliable
and (by far) the best timesource.  And I think(?) we all agree
now that softtsc is the only way to guarantee correctness
in a virtual environment.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-20 20:02   ` Dan Magenheimer
@ 2009-07-20 21:02     ` Keir Fraser
  2009-07-20 23:52       ` Dan Magenheimer
  2009-07-22  5:05       ` Zhang, Xiantao
  0 siblings, 2 replies; 36+ messages in thread
From: Keir Fraser @ 2009-07-20 21:02 UTC (permalink / raw)
  To: Dan Magenheimer, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, Zhang, Xiantao, John Levon

On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I agree that if the performance is *really bad*, the default
> should not change.  But I think we are still flying on rumors
> of data collected years ago in a very different world, and
> the performance data should be re-collected to prove that
> it is still *really bad*.  If the degradation is a fraction
> of a percent even in worst case analysis, I think the default
> should be changed so that correctness prevails.
> 
> Why now?  Because more and more real-world applications are
> built on top of multi-core platforms where TSC is reliable
> and (by far) the best timesource.  And I think(?) we all agree
> now that softtsc is the only way to guarantee correctness
> in a virtual environment.

So how bad is the non-softtsc default mode anyway? Our default timer_mode
has guest TSCs track host TSC (plus a fixed per-vcpu offset that defaults to
having all vcpus of a domain aligned to vcpu0 boot = zero tsc).

Looking at the email thread you cited, all I see is someone from Intel
saying something about how their code to improve TSC consistency across
migration avoids RDTSC exiting where possible (which I do not see -- if the
TSC rates across the hosts do not match closely then RDTSC exiting is
enabled forever for that domain), and, most bizarrely, that their 'solution'
may have a tsc drift >10^5 cycles. Where did this huge number come from?
What solution is being talked about, and under what conditions might the
claim hold? Who knows!

I don't think we have really solid data on either the performance or the
accuracy side of the debate. And that means we don't have much to argue
over.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-20 21:02     ` Keir Fraser
@ 2009-07-20 23:52       ` Dan Magenheimer
  2009-07-22  5:05       ` Zhang, Xiantao
  1 sibling, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-20 23:52 UTC (permalink / raw)
  To: Keir Fraser, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, Zhang, Xiantao, John Levon

> So how bad is the non-softtsc default mode anyway?

A fair question.  To me, "bad" means that TSC going backwards
can be detected by an application that samples TSC in
different threads that have been synchronized through some
simple "ordering" semaphore.  I admit this is a difficult
goal to achieve and few applications in reality will depend
on this exactly, but it is certainly feasible for a database or
a tracing tool to timestamp ordered events this way and
expect to be able to replay them in timestamp order.
(For the sake of any further discussion, let's call this
tsc-epsilon.... if the skew exceeds tsc-epsilon then the
app might observe time going backwards.)

> Our default timer_mode
> has guest TSCs track host TSC (plus a fixed per-vcpu offset 
> that defaults to
> having all vcpus of a domain aligned to vcpu0 boot = zero tsc).

Are you referring to c/s 19506?  It looks like this code
only runs on a physical machine on which tsc is already
well-behaved.  Is this because the X86_FEATURE_CONSTANT_TSC
bit is passed through unchanged to the guest so that
you are assuming guests "know" whether they can trust TSC
or not?  AFAIK, this bit is not particularly reliable (reflects
the socket, not the system) and not well-exposed to applications.

> Looking at the email thread you cited, all I see is someone from Intel
> saying something about how their code to improve TSC 
> consistency across
> migration avoids RDTSC exiting where possible (which I do not 
> see -- if the
> TSC rates across the hosts do not match closely then RDTSC exiting is
> enabled forever for that domain), and, most bizarrely, that 
> their 'solution'
> may have a tsc drift >10^5 cycles. Where did this huge number 
> come from?

Yes, I don't know where that number comes from either.

> What solution is being talked about, and under what 
> conditions might the
> claim hold? Who knows!
> 
> I don't think we have really solid data on either the 
> performance or the
> accuracy side of the debate. And that means we don't have 
> much to argue
> over.

I'm concerned with correctness.  Although sufficient accuracy
provides correctness, I don't think we are anywhere near
tsc-epsilon.  So the only way to guarantee correctness is
via softtsc on all vcpus.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-20 21:02     ` Keir Fraser
  2009-07-20 23:52       ` Dan Magenheimer
@ 2009-07-22  5:05       ` Zhang, Xiantao
  2009-07-23 13:24         ` Dan Magenheimer
  1 sibling, 1 reply; 36+ messages in thread
From: Zhang, Xiantao @ 2009-07-22  5:05 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, John Levon

[-- Attachment #1: Type: text/plain, Size: 4341 bytes --]

Keir Fraser wrote:
> On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@oracle.com>
> wrote: 
> 
>> I agree that if the performance is *really bad*, the default
>> should not change.  But I think we are still flying on rumors
>> of data collected years ago in a very different world, and
>> the performance data should be re-collected to prove that
>> it is still *really bad*.  If the degradation is a fraction
>> of a percent even in worst case analysis, I think the default
>> should be changed so that correctness prevails.
>> 
>> Why now?  Because more and more real-world applications are
>> built on top of multi-core platforms where TSC is reliable
>> and (by far) the best timesource.  And I think(?) we all agree
>> now that softtsc is the only way to guarantee correctness
>> in a virtual environment.
> 
> So how bad is the non-softtsc default mode anyway? Our default
> timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu
> offset that defaults to having all vcpus of a domain aligned to vcpu0
> boot = zero tsc). 
> 
> Looking at the email thread you cited, all I see is someone from Intel
> saying something about how their code to improve TSC consistency
> across migration avoids RDTSC exiting where possible (which I do not
> see -- if the TSC rates across the hosts do not match closely then
> RDTSC exiting is enabled forever for that domain), and, most
> bizarrely, that their 'solution' may have a tsc drift >10^5 cycles.
> Where did this huge number come from? What solution is being talked
> about, and under what conditions might the claim hold? Who knows!

We had done the experiment to measure the performance impact with softtsc using oltp workload, and we saw ~10% performance loss if rdtsc rate is more than 120,000/second. And we also did some other tests, and the results show that ~1% perfomance loss is caused by 10000 rdtsc instructions.  So if the rdtsc rate is not that high(>10000/second), the performance impact can be ignored.  

We also introduced some performance optimization solutions, but as we claimed before, they may bring some TSC drift ( 10^5~10^6 cycles) between virtual processors in SMP cases.  One solution is described below, for example, the guest is migrated from low TSC freq(low_freq) machine to a high TSC freq one(high_freq), you know, the low frequency is guest's expected frequency(exp_freq), and we should let guest be aware that it is running on the machine with exp_freq TSC to avoid possbile issues caused by faster TSC in any optimization solution. 

1. In this solution, we only guarantee guest's TSC is increasing monotonically and the average frequency equals guest's expected frequency(exp_freq) in a fixed time slot (eg. ~1ms). 
2. To be simple,  let guest running in high_freq TSC (with hardware TSC offset feature, no perfomrance loss) for 1ms, and then enable rdtsc exiting and use trap and emulation method(suffers perfomance loss) to let guest running in a *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time, and the specific value can be calculated with the formula to guarantee average TSC frquency == exp_freq:     
		time = (high_freq - low_freq) / (low_freq - 0.2). 

3. If the guest migrate from 2.4G machine to 3.0G machine, only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer performance loss in the total time 1ms+0.273ms ,and that is also to say, in most of the time guest can leverage hardware's TSC offset feature to reduce perfomrance loss. 

4.  In the 1.273ms, we can say guest's TSC frequency is emulated to its expected one through the hardware and software's co-emulation. And the perfomance loss is very minor compared with purely softtsc solution. 
5.  But at the same time, since each vcpu's TSC is emulated indpendently for SMP guest, and they may generate a drift value between vcpus, and the drift vaule's range should be 10^5 ~10^6 cycles, and we don't know such drift between vcpus whether can bring other side-effects.  At least, one side-effect case we can figure out is when one application running on one vcpu, and it may see backward TSC value after its migrating to another vcpu.  Not sure this is a real problem, but it should exist in theory. 

Attached the draft patch to implement the solution based on an old #Cset19591. 

Xiantao








[-- Attachment #2: two_phase.patch --]
[-- Type: application/octet-stream, Size: 17976 bytes --]

# HG changeset patch
# User root@localhost.localdomain
# Date 1242883048 14400
# Node ID 8fc79e85b57b7749162b9fc03bfd0943ac923580
# Parent  f8187a343ad2bdbfe3166d7ee7e3d55a9f157fdc
Commit two-phased tsc scaling

Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>

diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/Makefile
--- a/xen/arch/x86/Makefile	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/Makefile	Thu May 21 01:17:28 2009 -0400
@@ -54,6 +54,7 @@ obj-y += tboot.o
 obj-y += tboot.o
 obj-y += hpet.o
 obj-y += bzimage.o
+obj-y += lib.o
 
 obj-$(crash_debug) += gdbstub.o
 
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/hvm/hvm.c	Thu May 21 01:17:28 2009 -0400
@@ -149,16 +149,181 @@ void hvm_set_guest_tsc(struct vcpu *v, u
     hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
 }
 
+static uint64_t hvm_scale_guest_tsc(struct vcpu *v, uint64_t host_tsc)
+{
+    int64_t tsc_delta;
+    struct hvm_tsc_scale *ts = &v->arch.hvm_vcpu.tsc_scale;
+
+    tsc_delta = host_tsc - ts->last_tsc;
+
+    if (tsc_delta < 0)
+        return ts->last_tsc;
+
+    tsc_delta = muldiv64(tsc_delta, ts->low_freq, ts->high_freq);
+
+    return ts->last_tsc + tsc_delta;
+}
+
 u64 hvm_get_guest_tsc(struct vcpu *v)
 {
     u64 host_tsc;
 
-    if ( opt_softtsc )
-        host_tsc = hvm_get_guest_time(v);
-    else
+    rdtscll(host_tsc);
+
+    if (dom_tsc_scaled(v->domain)) { 
+        host_tsc = hvm_scale_guest_tsc(v, host_tsc);
+    }
+
+    return host_tsc + v->arch.hvm_vcpu.cache_tsc_offset;
+}
+
+int hvm_gtsc_needs_scale(struct domain *d)
+{
+    uint32_t gtsc_freq;
+
+    gtsc_freq = d->arch.hvm_domain.gtsc_freq / 1000;
+
+    if (gtsc_freq && gtsc_freq != (uint32_t)cpu_khz / 1000) {
+        d->arch.hvm_domain.is_tsc_scaled = 1;
+        return 1;
+    }
+
+    d->arch.hvm_domain.is_tsc_scaled = 0;
+    return 0;
+}
+
+void hvm_init_gtsc_scale(struct domain *d)
+{
+    struct vcpu *v;
+    struct hvm_tsc_scale *ts = &d->vcpu[0]->arch.hvm_vcpu.tsc_scale;
+
+    if (d->arch.hvm_domain.gtsc_freq < cpu_khz) {
+
+        ts->high_freq = (uint32_t)cpu_khz;
+        ts->low_freq = d->arch.hvm_domain.gtsc_freq / 32;
+        ts->gtsc_freq = d->arch.hvm_domain.gtsc_freq ;
+
+        ts->freq_delta1 = d->arch.hvm_domain.gtsc_freq - ts->low_freq; 
+        ts->freq_delta2 = (uint32_t)cpu_khz - d->arch.hvm_domain.gtsc_freq;
+
+        ts->delta2 = cpu_khz * 15 ; /* 2ms */
+
+        ts->rdtsc_exiting = 0;
+        ts->is_low_to_high = 1;
+        ts->is_initialized = 0;
+
+    } else {
+        ts->freq_delta = d->arch.hvm_domain.gtsc_freq - (uint32_t)cpu_khz;
+
+        ts->delta1 = cpu_khz / 2; /* 0.5ms */
+
+        ts->is_low_to_high = 0;
+    }
+
+    for_each_vcpu(d, v) {
+        if (v != d->vcpu[0])
+            v->arch.hvm_vcpu.tsc_scale = *ts;
+
+        if (opt_softtsc && hvm_funcs.enable_rdtsc_exiting)
+                    hvm_funcs.enable_rdtsc_exiting(v);
+    }
+
+    d->arch.hvm_domain.is_tsc_scaled = 1;
+}
+
+void hvm_save_gtsc_for_scale(struct vcpu *v)
+{
+    uint64_t host_tsc;
+    struct hvm_tsc_scale *ts = &v->arch.hvm_vcpu.tsc_scale;
+
+    if (ts->is_low_to_high && ts->rdtsc_exiting)
+            ts->saved_tsc = hvm_get_guest_tsc(v);
+    else {
         rdtscll(host_tsc);
-
-    return host_tsc + v->arch.hvm_vcpu.cache_tsc_offset;
+        ts->saved_tsc = host_tsc + v->arch.hvm_vcpu.cache_tsc_offset;
+    }
+}
+
+
+static void hvm_adjust_gtsc_to_normal(struct vcpu *v)
+{
+    uint64_t guest_tsc, host_tsc;
+    struct hvm_tsc_scale *ts = &v->arch.hvm_vcpu.tsc_scale;
+
+    rdtscll(host_tsc);
+
+    guest_tsc = ts->init_gtsc + muldiv64(host_tsc - ts->init_htsc, 
+                ts->gtsc_freq, cpu_khz);
+
+    if (guest_tsc < ts->saved_tsc)
+        gdprintk(XENLOG_WARNING, "Guest tsc maybe going backward!\n");
+
+    hvm_set_guest_tsc(v, guest_tsc);
+    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
+}
+
+void hvm_check_tsc_scale(struct vcpu *v)
+{
+    uint64_t host_tsc;
+    int64_t tsc_delta;
+    struct hvm_tsc_scale *ts = &v->arch.hvm_vcpu.tsc_scale;
+
+    rdtscll(host_tsc);
+    tsc_delta = host_tsc - ts->last_tsc;
+
+    if (ts->is_low_to_high) {
+
+        if (unlikely(!ts->is_initialized)) {
+            ts->init_gtsc = v->arch.hvm_vcpu.cache_tsc_offset + host_tsc;
+            ts->init_htsc = host_tsc;
+            ts->last_tsc = host_tsc;
+            ts->is_initialized = 1;
+        }
+
+        if (likely(!ts->rdtsc_exiting)) {
+            if (tsc_delta >= ts->delta2) {
+
+                if (tsc_delta > ts->delta2 * 2) /*Maybe schedule-out ? */
+                    hvm_adjust_gtsc_to_normal(v);
+                else {
+                    if ( hvm_funcs.enable_rdtsc_exiting ) {
+                        hvm_funcs.enable_rdtsc_exiting(v);
+                        ts->rdtsc_exiting = 1;
+                    }
+
+                    ts->delta1 = muldiv64(tsc_delta, ts->freq_delta2,
+                            ts->freq_delta1); 
+                } 
+
+                rdtscll(host_tsc);
+                ts->last_tsc = host_tsc;
+            }
+
+        } else {
+            if (tsc_delta >= ts->delta1) {
+
+                hvm_adjust_gtsc_to_normal(v);
+
+                if ( hvm_funcs.disable_rdtsc_exiting ) {
+                    hvm_funcs.disable_rdtsc_exiting(v);
+                    ts->rdtsc_exiting = 0;
+                }
+
+                rdtscll(host_tsc);
+                ts->last_tsc = host_tsc;
+            }
+        }
+    } else {
+        if (tsc_delta > ts->delta1) {
+            uint64_t lagged_tsc;
+
+            lagged_tsc = muldiv64(tsc_delta, ts->freq_delta, cpu_khz);
+            v->arch.hvm_vcpu.cache_tsc_offset += lagged_tsc;
+            hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
+
+            ts->last_tsc = host_tsc;
+        } 
+    }
 }
 
 void hvm_migrate_timers(struct vcpu *v)
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/hvm/i8254.c
--- a/xen/arch/x86/hvm/i8254.c	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/hvm/i8254.c	Thu May 21 01:17:28 2009 -0400
@@ -56,30 +56,6 @@ static int handle_speaker_io(
 
 #define get_guest_time(v) \
    (is_hvm_vcpu(v) ? hvm_get_guest_time(v) : (u64)get_s_time())
-
-/* Compute with 96 bit intermediate result: (a*b)/c */
-static uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
-{
-    union {
-        uint64_t ll;
-        struct {
-#ifdef WORDS_BIGENDIAN
-            uint32_t high, low;
-#else
-            uint32_t low, high;
-#endif            
-        } l;
-    } u, res;
-    uint64_t rl, rh;
-
-    u.ll = a;
-    rl = (uint64_t)u.l.low * (uint64_t)b;
-    rh = (uint64_t)u.l.high * (uint64_t)b;
-    rh += (rl >> 32);
-    res.l.high = rh / c;
-    res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
-    return res.ll;
-}
 
 static int pit_get_count(PITState *pit, int channel)
 {
@@ -481,8 +457,6 @@ void pit_init(struct vcpu *v, unsigned l
     register_portio_handler(v->domain, PIT_BASE, 4, handle_pit_io);
     register_portio_handler(v->domain, 0x61, 1, handle_speaker_io);
 
-    ticks_per_sec(v) = cpu_khz * (int64_t)1000;
-
     pit_reset(v->domain);
 }
 
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/hvm/save.c
--- a/xen/arch/x86/hvm/save.c	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/hvm/save.c	Thu May 21 01:17:28 2009 -0400
@@ -32,7 +32,9 @@ void arch_hvm_save(struct domain *d, str
     cpuid(1, &eax, &ebx, &ecx, &edx);
     hdr->cpuid = eax;
 
-    hdr->pad0 = 0;
+    /* Save guest's preferred TSC. */
+    hdr->gtsc_freq = d->arch.hvm_domain.gtsc_freq;
+
 }
 
 int arch_hvm_load(struct domain *d, struct hvm_save_header *hdr)
@@ -59,6 +61,17 @@ int arch_hvm_load(struct domain *d, stru
         gdprintk(XENLOG_WARNING, "HVM restore: saved CPUID (%#"PRIx32") "
                "does not match host (%#"PRIx32").\n", hdr->cpuid, eax);
 
+    /* Restore guest's preferred TSC frequency. */
+    d->arch.hvm_domain.gtsc_freq = hdr->gtsc_freq;
+
+    if ( hdr->gtsc_freq && hvm_gtsc_needs_scale(d) ) {
+        hvm_init_gtsc_scale(d);
+
+        gdprintk(XENLOG_WARNING, "Migrate to the platform with different "
+                "freq:%ldMhz, expected freq:%dMhz, scaling guest tsc..!\n",
+                    cpu_khz / 1000 , hdr->gtsc_freq / 1000 );
+    }
+
     /* VGA state is not saved/restored, so we nobble the cache. */
     d->arch.hvm_domain.stdvga.cache = 0;
 
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/hvm/vmx/intr.c
--- a/xen/arch/x86/hvm/vmx/intr.c	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/hvm/vmx/intr.c	Thu May 21 01:17:28 2009 -0400
@@ -169,6 +169,10 @@ asmlinkage void vmx_intr_assist(void)
     if ( unlikely(intack.source != hvm_intsrc_none) )
         enable_intr_window(v, intack);
 
+    /* Handle domain tsc scale if necessary */
+    if (dom_tsc_scaled(v->domain)) 
+        hvm_check_tsc_scale(v);
+
  out:
     if ( cpu_has_vmx_tpr_shadow )
         __vmwrite(TPR_THRESHOLD, tpr_threshold);
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu May 21 01:17:28 2009 -0400
@@ -1340,6 +1340,22 @@ static void vmx_set_info_guest(struct vc
     vmx_vmcs_exit(v);
 }
 
+static void vmx_enable_rdtsc_exiting(struct vcpu *v)
+{
+    vmx_vmcs_enter(v);
+    v->arch.hvm_vmx.exec_control |= CPU_BASED_RDTSC_EXITING;
+    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+     vmx_vmcs_exit(v);
+}
+
+static void vmx_disable_rdtsc_exiting(struct vcpu *v)
+{
+    vmx_vmcs_enter(v);
+    v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
+    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+     vmx_vmcs_exit(v);
+}
+
 static struct hvm_function_table vmx_function_table = {
     .name                 = "VMX",
     .domain_initialise    = vmx_domain_initialise,
@@ -1371,7 +1387,9 @@ static struct hvm_function_table vmx_fun
     .msr_write_intercept  = vmx_msr_write_intercept,
     .invlpg_intercept     = vmx_invlpg_intercept,
     .set_uc_mode          = vmx_set_uc_mode,
-    .set_info_guest       = vmx_set_info_guest
+    .set_info_guest       = vmx_set_info_guest,
+    .enable_rdtsc_exiting = vmx_enable_rdtsc_exiting,
+    .disable_rdtsc_exiting = vmx_disable_rdtsc_exiting
 };
 
 static unsigned long *vpid_bitmap;
@@ -2184,6 +2202,7 @@ static void ept_handle_violation(unsigne
 
     domain_crash(d);
 }
+//extern void hvm_save_guest_tsc(struct vcpu *v);
 
 static void vmx_failed_vmentry(unsigned int exit_reason,
                                struct cpu_user_regs *regs)
@@ -2564,6 +2583,9 @@ asmlinkage void vmx_vmexit_handler(struc
         domain_crash(v->domain);
         break;
     }
+    
+    if (dom_tsc_scaled(v->domain))
+        hvm_save_gtsc_for_scale(v);
 }
 
 asmlinkage void vmx_trace_vmentry(void)
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/hvm/vpt.c
--- a/xen/arch/x86/hvm/vpt.c	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/arch/x86/hvm/vpt.c	Thu May 21 01:17:28 2009 -0400
@@ -32,6 +32,9 @@ void hvm_init_guest_time(struct domain *
     spin_lock_init(&pl->pl_time_lock);
     pl->stime_offset = -(u64)get_s_time();
     pl->last_guest_time = 0;
+
+    /* Initialize guest's expected freqency */
+    d->arch.hvm_domain.gtsc_freq = cpu_khz;
 }
 
 u64 hvm_get_guest_time(struct vcpu *v)
diff -r f8187a343ad2 -r 8fc79e85b57b xen/arch/x86/lib.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/arch/x86/lib.c	Thu May 21 01:17:28 2009 -0400
@@ -0,0 +1,62 @@
+/*
+ * lib.c: Common library for Xen x86.
+ *
+ * Copyright (c) 2009, Intel Corporation.
+ * 
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ */
+
+#include <xen/ctype.h>
+#include <xen/lib.h>
+#include <xen/types.h>
+#include <asm/byteorder.h>
+
+
+#ifdef __x86_64__
+/* Compute (a*b)/c in x86 64-bit mode. */
+uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
+{
+    __asm__ __volatile__ ("mul %%rdx;"
+                          "div %%rcx;"
+                          : "=a"(a)
+                          : "0"(a), "d"(b), "c"(c)
+    ); 
+    return a;
+}
+#else
+/* Compute with 96 bit intermediate result: (a*b)/c */
+uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c)
+{
+    union {
+        uint64_t ll;
+        struct {
+#ifdef WORDS_BIGENDIAN
+            uint32_t high, low;
+#else
+            uint32_t low, high;
+#endif            
+        } l;
+    } u, res;
+    uint64_t rl, rh;
+
+    u.ll = a;
+    rl = (uint64_t)u.l.low * (uint64_t)b;
+    rh = (uint64_t)u.l.high * (uint64_t)b;
+    rh += (rl >> 32);
+    res.l.high = rh / c;
+    res.l.low = (((rh % c) << 32) + (rl & 0xffffffff)) / c;
+    return res.ll;
+}
+#endif
+
diff -r f8187a343ad2 -r 8fc79e85b57b xen/include/asm-x86/hvm/domain.h
--- a/xen/include/asm-x86/hvm/domain.h	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/include/asm-x86/hvm/domain.h	Thu May 21 01:17:28 2009 -0400
@@ -40,11 +40,16 @@ struct hvm_ioreq_page {
     void *va;
 };
 
+#define dom_tsc_scaled(d) \
+    d->arch.hvm_domain.is_tsc_scaled
+
 struct hvm_domain {
     struct hvm_ioreq_page  ioreq;
     struct hvm_ioreq_page  buf_ioreq;
 
-    s64                    tsc_frequency;
+    uint32_t               gtsc_freq; /* kHz */
+    uint32_t               is_tsc_scaled;
+
     struct pl_time         pl_time;
 
     struct hvm_io_handler  io_handler;
diff -r f8187a343ad2 -r 8fc79e85b57b xen/include/asm-x86/hvm/hvm.h
--- a/xen/include/asm-x86/hvm/hvm.h	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/include/asm-x86/hvm/hvm.h	Thu May 21 01:17:28 2009 -0400
@@ -129,6 +129,8 @@ struct hvm_function_table {
     void (*invlpg_intercept)(unsigned long vaddr);
     void (*set_uc_mode)(struct vcpu *v);
     void (*set_info_guest)(struct vcpu *v);
+    void (*enable_rdtsc_exiting)(struct vcpu *v);
+    void (*disable_rdtsc_exiting)(struct vcpu *v);
 };
 
 extern struct hvm_function_table hvm_funcs;
@@ -153,6 +155,11 @@ void hvm_init_guest_time(struct domain *
 void hvm_init_guest_time(struct domain *d);
 void hvm_set_guest_time(struct vcpu *v, u64 guest_time);
 u64 hvm_get_guest_time(struct vcpu *v);
+
+int hvm_gtsc_needs_scale(struct domain *d);
+void hvm_init_gtsc_scale(struct domain *d);
+void hvm_check_tsc_scale(struct vcpu *v);
+void hvm_save_gtsc_for_scale(struct vcpu *v);
 
 #define hvm_paging_enabled(v) \
     (!!((v)->arch.hvm_vcpu.guest_cr[0] & X86_CR0_PG))
diff -r f8187a343ad2 -r 8fc79e85b57b xen/include/asm-x86/hvm/vcpu.h
--- a/xen/include/asm-x86/hvm/vcpu.h	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/include/asm-x86/hvm/vcpu.h	Thu May 21 01:17:28 2009 -0400
@@ -32,6 +32,32 @@ enum hvm_io_state {
     HVMIO_awaiting_completion,
     HVMIO_handle_mmio_awaiting_completion,
     HVMIO_completed
+};
+
+struct hvm_tsc_scale {
+    union {
+        struct { /* From low to high frequency */
+            uint32_t is_initialized;
+            uint32_t rdtsc_exiting;
+            uint32_t low_freq;
+            uint32_t high_freq;
+            uint32_t gtsc_freq;
+            uint32_t freq_delta1;
+            uint32_t freq_delta2;
+            uint64_t init_htsc;
+            uint64_t init_gtsc;
+        };
+
+        struct {/* From high to low frequency */
+            uint32_t freq_delta;
+            uint64_t delta;
+        };
+    };
+    uint32_t is_low_to_high;
+    uint64_t delta1;
+    uint64_t delta2;
+    uint64_t last_tsc;
+    uint64_t saved_tsc;
 };
 
 struct hvm_vcpu {
@@ -97,6 +123,9 @@ struct hvm_vcpu {
     /* We may write up to m128 as a number of device-model transactions. */
     paddr_t mmio_large_write_pa;
     unsigned int mmio_large_write_bytes;
+
+    /* Used for scaling guest TSC */
+    struct hvm_tsc_scale tsc_scale;
 };
 
 #endif /* __ASM_X86_HVM_VCPU_H__ */
diff -r f8187a343ad2 -r 8fc79e85b57b xen/include/asm-x86/hvm/vpt.h
--- a/xen/include/asm-x86/hvm/vpt.h	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/include/asm-x86/hvm/vpt.h	Thu May 21 01:17:28 2009 -0400
@@ -136,8 +136,6 @@ struct pl_time {    /* platform time */
     spinlock_t pl_time_lock;
 };
 
-#define ticks_per_sec(v) (v->domain->arch.hvm_domain.tsc_frequency)
-
 void pt_save_timer(struct vcpu *v);
 void pt_restore_timer(struct vcpu *v);
 void pt_update_irq(struct vcpu *v);
diff -r f8187a343ad2 -r 8fc79e85b57b xen/include/public/arch-x86/hvm/save.h
--- a/xen/include/public/arch-x86/hvm/save.h	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/include/public/arch-x86/hvm/save.h	Thu May 21 01:17:28 2009 -0400
@@ -38,7 +38,7 @@ struct hvm_save_header {
     uint32_t version;           /* File format version */
     uint64_t changeset;         /* Version of Xen that saved this file */
     uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving machine */
-    uint32_t pad0;
+    uint32_t gtsc_freq;
 };
 
 DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
diff -r f8187a343ad2 -r 8fc79e85b57b xen/include/xen/lib.h
--- a/xen/include/xen/lib.h	Fri Feb 20 17:02:36 2009 +0000
+++ b/xen/include/xen/lib.h	Thu May 21 01:17:28 2009 -0400
@@ -91,6 +91,8 @@ unsigned long long simple_strtoull(
 
 unsigned long long parse_size_and_unit(const char *s, const char **ps);
 
+uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c);
+
 #define TAINT_UNSAFE_SMP                (1<<0)
 #define TAINT_MACHINE_CHECK             (1<<1)
 #define TAINT_BAD_PAGE                  (1<<2)

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-22  5:05       ` Zhang, Xiantao
@ 2009-07-23 13:24         ` Dan Magenheimer
  2009-07-23 14:54           ` Ian Pratt
  2009-07-28  0:55           ` Zhang, Xiantao
  0 siblings, 2 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-23 13:24 UTC (permalink / raw)
  To: Zhang, Xiantao, Keir Fraser, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, John Levon

Hi Xiantao --

Sorry for delayed response.  A few comments/questions:

Thanks very much for the additional detail on the 10%
performance loss.  What is this oltp benchmark?  Is
it available for others to run?  Also is the rdtsc
rate 120000/sec on EACH processor?

Assuming a 3GHz machine, your results seem to show that
emulating a rdtsc with softtsc takes about 2500 cycles.
This agrees with my approximation of about 1 usec.

Have you analyzed where this 2500 cycles is being used?
My suggestion about performance optimization was not
to try a different algorithm but to see if it is possible
to code the existing algorithm much faster using a
special trap path and assembly code. (We called this
a "fast path" on Xen/ia64.)  Even if the 2500 cycles
can be cut in half, that would be a big win.

Am I correct in reading that your patch is ONLY for
HVM guests?  If so, since some (maybe most) workloads
that rely on tsc for transaction timestamps will be
PV, your patch doesn't solve the whole problem.

Can someone at Intel confirm or deny that VMware ESX
always traps rdtsc?  If so, it is probably not hard
to write an application that works on VMware ESX (on
certain hardware) but fails on Xen.

Thanks,
Dan

> -----Original Message-----
> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
> Sent: Tuesday, July 21, 2009 11:05 PM
> To: Keir Fraser; Dan Magenheimer; Xen-Devel (E-mail)
> Cc: John Levon; Ian Pratt; Dong, Eddie
> Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL
> 
> 
> Keir Fraser wrote:
> > On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@oracle.com>
> > wrote: 
> > 
> >> I agree that if the performance is *really bad*, the default
> >> should not change.  But I think we are still flying on rumors
> >> of data collected years ago in a very different world, and
> >> the performance data should be re-collected to prove that
> >> it is still *really bad*.  If the degradation is a fraction
> >> of a percent even in worst case analysis, I think the default
> >> should be changed so that correctness prevails.
> >> 
> >> Why now?  Because more and more real-world applications are
> >> built on top of multi-core platforms where TSC is reliable
> >> and (by far) the best timesource.  And I think(?) we all agree
> >> now that softtsc is the only way to guarantee correctness
> >> in a virtual environment.
> > 
> > So how bad is the non-softtsc default mode anyway? Our default
> > timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu
> > offset that defaults to having all vcpus of a domain 
> aligned to vcpu0
> > boot = zero tsc). 
> > 
> > Looking at the email thread you cited, all I see is someone 
> from Intel
> > saying something about how their code to improve TSC consistency
> > across migration avoids RDTSC exiting where possible (which I do not
> > see -- if the TSC rates across the hosts do not match closely then
> > RDTSC exiting is enabled forever for that domain), and, most
> > bizarrely, that their 'solution' may have a tsc drift >10^5 cycles.
> > Where did this huge number come from? What solution is being talked
> > about, and under what conditions might the claim hold? Who knows!
> 
> We had done the experiment to measure the performance impact 
> with softtsc using oltp workload, and we saw ~10% performance 
> loss if rdtsc rate is more than 120,000/second. And we also 
> did some other tests, and the results show that ~1% 
> perfomance loss is caused by 10000 rdtsc instructions.  So if 
> the rdtsc rate is not that high(>10000/second), the 
> performance impact can be ignored.  
> 
> We also introduced some performance optimization solutions, 
> but as we claimed before, they may bring some TSC drift ( 
> 10^5~10^6 cycles) between virtual processors in SMP cases.  
> One solution is described below, for example, the guest is 
> migrated from low TSC freq(low_freq) machine to a high TSC 
> freq one(high_freq), you know, the low frequency is guest's 
> expected frequency(exp_freq), and we should let guest be 
> aware that it is running on the machine with exp_freq TSC to 
> avoid possbile issues caused by faster TSC in any 
> optimization solution. 
> 
> 1. In this solution, we only guarantee guest's TSC is 
> increasing monotonically and the average frequency equals 
> guest's expected frequency(exp_freq) in a fixed time slot (eg. ~1ms). 
> 2. To be simple,  let guest running in high_freq TSC (with 
> hardware TSC offset feature, no perfomrance loss) for 1ms, 
> and then enable rdtsc exiting and use trap and emulation 
> method(suffers perfomance loss) to let guest running in a 
> *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time, 
> and the specific value can be calculated with the formula to 
> guarantee average TSC frquency == exp_freq:     
> 		time = (high_freq - low_freq) / (low_freq - 0.2). 
> 
> 3. If the guest migrate from 2.4G machine to 3.0G machine, 
> only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer 
> performance loss in the total time 1ms+0.273ms ,and that is 
> also to say, in most of the time guest can leverage 
> hardware's TSC offset feature to reduce perfomrance loss. 
> 
> 4.  In the 1.273ms, we can say guest's TSC frequency is 
> emulated to its expected one through the hardware and 
> software's co-emulation. And the perfomance loss is very 
> minor compared with purely softtsc solution. 
> 5.  But at the same time, since each vcpu's TSC is emulated 
> indpendently for SMP guest, and they may generate a drift 
> value between vcpus, and the drift vaule's range should be 
> 10^5 ~10^6 cycles, and we don't know such drift between vcpus 
> whether can bring other side-effects.  At least, one 
> side-effect case we can figure out is when one application 
> running on one vcpu, and it may see backward TSC value after 
> its migrating to another vcpu.  Not sure this is a real 
> problem, but it should exist in theory. 
> 
> Attached the draft patch to implement the solution based on 
> an old #Cset19591. 
> 
> Xiantao

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 13:24         ` Dan Magenheimer
@ 2009-07-23 14:54           ` Ian Pratt
  2009-07-23 15:18             ` Dan Magenheimer
  2009-07-27 14:47             ` Dan Magenheimer
  2009-07-28  0:55           ` Zhang, Xiantao
  1 sibling, 2 replies; 36+ messages in thread
From: Ian Pratt @ 2009-07-23 14:54 UTC (permalink / raw)
  To: Dan Magenheimer, Zhang, Xiantao, Keir Fraser, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, John Levon


> Am I correct in reading that your patch is ONLY for
> HVM guests?  If so, since some (maybe most) workloads
> that rely on tsc for transaction timestamps will be
> PV, your patch doesn't solve the whole problem.

pre-VT it wasn't possible to trap RDTSC, so this can't help PV guests.

> Can someone at Intel confirm or deny that VMware ESX
> always traps rdtsc?  If so, it is probably not hard
> to write an application that works on VMware ESX (on
> certain hardware) but fails on Xen.

I'd be rather surprised if VMware trapped RDTSC. From what I gather, ESX3 doesn't make a great deal of use of VT for 32b guests, so at the very least it would be tricky to do anything about user space use of rdtsc.

I've informally heard that certain version of the JVM and Oracle Db have a habit of pounding rdtsc hard from user space, but I don't know what rates.

Ian


> 
> Thanks,
> Dan
> 
> > -----Original Message-----
> > From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
> > Sent: Tuesday, July 21, 2009 11:05 PM
> > To: Keir Fraser; Dan Magenheimer; Xen-Devel (E-mail)
> > Cc: John Levon; Ian Pratt; Dong, Eddie
> > Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL
> >
> >
> > Keir Fraser wrote:
> > > On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@oracle.com>
> > > wrote:
> > >
> > >> I agree that if the performance is *really bad*, the default
> > >> should not change.  But I think we are still flying on rumors
> > >> of data collected years ago in a very different world, and
> > >> the performance data should be re-collected to prove that
> > >> it is still *really bad*.  If the degradation is a fraction
> > >> of a percent even in worst case analysis, I think the default
> > >> should be changed so that correctness prevails.
> > >>
> > >> Why now?  Because more and more real-world applications are
> > >> built on top of multi-core platforms where TSC is reliable
> > >> and (by far) the best timesource.  And I think(?) we all agree
> > >> now that softtsc is the only way to guarantee correctness
> > >> in a virtual environment.
> > >
> > > So how bad is the non-softtsc default mode anyway? Our default
> > > timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu
> > > offset that defaults to having all vcpus of a domain
> > aligned to vcpu0
> > > boot = zero tsc).
> > >
> > > Looking at the email thread you cited, all I see is someone
> > from Intel
> > > saying something about how their code to improve TSC consistency
> > > across migration avoids RDTSC exiting where possible (which I do not
> > > see -- if the TSC rates across the hosts do not match closely then
> > > RDTSC exiting is enabled forever for that domain), and, most
> > > bizarrely, that their 'solution' may have a tsc drift >10^5 cycles.
> > > Where did this huge number come from? What solution is being talked
> > > about, and under what conditions might the claim hold? Who knows!
> >
> > We had done the experiment to measure the performance impact
> > with softtsc using oltp workload, and we saw ~10% performance
> > loss if rdtsc rate is more than 120,000/second. And we also
> > did some other tests, and the results show that ~1%
> > perfomance loss is caused by 10000 rdtsc instructions.  So if
> > the rdtsc rate is not that high(>10000/second), the
> > performance impact can be ignored.
> >
> > We also introduced some performance optimization solutions,
> > but as we claimed before, they may bring some TSC drift (
> > 10^5~10^6 cycles) between virtual processors in SMP cases.
> > One solution is described below, for example, the guest is
> > migrated from low TSC freq(low_freq) machine to a high TSC
> > freq one(high_freq), you know, the low frequency is guest's
> > expected frequency(exp_freq), and we should let guest be
> > aware that it is running on the machine with exp_freq TSC to
> > avoid possbile issues caused by faster TSC in any
> > optimization solution.
> >
> > 1. In this solution, we only guarantee guest's TSC is
> > increasing monotonically and the average frequency equals
> > guest's expected frequency(exp_freq) in a fixed time slot (eg. ~1ms).
> > 2. To be simple,  let guest running in high_freq TSC (with
> > hardware TSC offset feature, no perfomrance loss) for 1ms,
> > and then enable rdtsc exiting and use trap and emulation
> > method(suffers perfomance loss) to let guest running in a
> > *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time,
> > and the specific value can be calculated with the formula to
> > guarantee average TSC frquency == exp_freq:
> > 		time = (high_freq - low_freq) / (low_freq - 0.2).
> >
> > 3. If the guest migrate from 2.4G machine to 3.0G machine,
> > only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer
> > performance loss in the total time 1ms+0.273ms ,and that is
> > also to say, in most of the time guest can leverage
> > hardware's TSC offset feature to reduce perfomrance loss.
> >
> > 4.  In the 1.273ms, we can say guest's TSC frequency is
> > emulated to its expected one through the hardware and
> > software's co-emulation. And the perfomance loss is very
> > minor compared with purely softtsc solution.
> > 5.  But at the same time, since each vcpu's TSC is emulated
> > indpendently for SMP guest, and they may generate a drift
> > value between vcpus, and the drift vaule's range should be
> > 10^5 ~10^6 cycles, and we don't know such drift between vcpus
> > whether can bring other side-effects.  At least, one
> > side-effect case we can figure out is when one application
> > running on one vcpu, and it may see backward TSC value after
> > its migrating to another vcpu.  Not sure this is a real
> > problem, but it should exist in theory.
> >
> > Attached the draft patch to implement the solution based on
> > an old #Cset19591.
> >
> > Xiantao

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 14:54           ` Ian Pratt
@ 2009-07-23 15:18             ` Dan Magenheimer
  2009-07-23 15:29               ` Keir Fraser
  2009-07-23 15:45               ` Keir Fraser
  2009-07-27 14:47             ` Dan Magenheimer
  1 sibling, 2 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-23 15:18 UTC (permalink / raw)
  To: Ian Pratt, Zhang, Xiantao, Keir Fraser, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com]

> pre-VT it wasn't possible to trap RDTSC, so this can't help PV guests.

For PV guests, CR4.TSD would always be set, generating
a general protection fault for every rdtsc.  (Or perhaps
I am missing some x86 architectural subtlety?  This is
how it is done on ia64.)

> I'd be rather surprised if VMware trapped RDTSC. From what I 
> gather, ESX3 doesn't make a great deal of use of VT for 32b 
> guests, so at the very least it would be tricky to do 
> anything about user space use of rdtsc.

I had not heard it before, so am very interested in
independent confirmation (or denial).  Given that
it is impossible (I think) to guarantee correct SMP
behavior without it, and given VMware's attention
to correctness details, I guess it doesn't surprise
me.

> I've informally heard that certain version of the JVM and 
> Oracle Db have a habit of pounding rdtsc hard from user 
> space, but I don't know what rates.

Indeed they do and they use it for timestamping
events/transactions, so these are the very same
apps that need to guarantee SMP timestamp ordering.

I realize this is an ugly problem and am searching for
the best middle ground.  For example, if tsc emulation
can be made "fast enough", that's a good answer.

> -----Original Message-----
> From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com]
> Sent: Thursday, July 23, 2009 8:54 AM
> To: Dan Magenheimer; Zhang, Xiantao; Keir Fraser; Xen-Devel (E-mail)
> Cc: John Levon; Dong, Eddie; Ian Pratt
> Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL
> 
> 
> 
> > Am I correct in reading that your patch is ONLY for
> > HVM guests?  If so, since some (maybe most) workloads
> > that rely on tsc for transaction timestamps will be
> > PV, your patch doesn't solve the whole problem.
> 
> pre-VT it wasn't possible to trap RDTSC, so this can't help PV guests.
> 
> > Can someone at Intel confirm or deny that VMware ESX
> > always traps rdtsc?  If so, it is probably not hard
> > to write an application that works on VMware ESX (on
> > certain hardware) but fails on Xen.
> 
> I'd be rather surprised if VMware trapped RDTSC. From what I 
> gather, ESX3 doesn't make a great deal of use of VT for 32b 
> guests, so at the very least it would be tricky to do 
> anything about user space use of rdtsc.
> 
> I've informally heard that certain version of the JVM and 
> Oracle Db have a habit of pounding rdtsc hard from user 
> space, but I don't know what rates.
> 
> Ian
> 
> 
> > 
> > Thanks,
> > Dan
> > 
> > > -----Original Message-----
> > > From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
> > > Sent: Tuesday, July 21, 2009 11:05 PM
> > > To: Keir Fraser; Dan Magenheimer; Xen-Devel (E-mail)
> > > Cc: John Levon; Ian Pratt; Dong, Eddie
> > > Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL
> > >
> > >
> > > Keir Fraser wrote:
> > > > On 20/07/2009 21:02, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com>
> > > > wrote:
> > > >
> > > >> I agree that if the performance is *really bad*, the default
> > > >> should not change.  But I think we are still flying on rumors
> > > >> of data collected years ago in a very different world, and
> > > >> the performance data should be re-collected to prove that
> > > >> it is still *really bad*.  If the degradation is a fraction
> > > >> of a percent even in worst case analysis, I think the default
> > > >> should be changed so that correctness prevails.
> > > >>
> > > >> Why now?  Because more and more real-world applications are
> > > >> built on top of multi-core platforms where TSC is reliable
> > > >> and (by far) the best timesource.  And I think(?) we all agree
> > > >> now that softtsc is the only way to guarantee correctness
> > > >> in a virtual environment.
> > > >
> > > > So how bad is the non-softtsc default mode anyway? Our default
> > > > timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu
> > > > offset that defaults to having all vcpus of a domain
> > > aligned to vcpu0
> > > > boot = zero tsc).
> > > >
> > > > Looking at the email thread you cited, all I see is someone
> > > from Intel
> > > > saying something about how their code to improve TSC consistency
> > > > across migration avoids RDTSC exiting where possible 
> (which I do not
> > > > see -- if the TSC rates across the hosts do not match 
> closely then
> > > > RDTSC exiting is enabled forever for that domain), and, most
> > > > bizarrely, that their 'solution' may have a tsc drift 
> >10^5 cycles.
> > > > Where did this huge number come from? What solution is 
> being talked
> > > > about, and under what conditions might the claim hold? 
> Who knows!
> > >
> > > We had done the experiment to measure the performance impact
> > > with softtsc using oltp workload, and we saw ~10% performance
> > > loss if rdtsc rate is more than 120,000/second. And we also
> > > did some other tests, and the results show that ~1%
> > > perfomance loss is caused by 10000 rdtsc instructions.  So if
> > > the rdtsc rate is not that high(>10000/second), the
> > > performance impact can be ignored.
> > >
> > > We also introduced some performance optimization solutions,
> > > but as we claimed before, they may bring some TSC drift (
> > > 10^5~10^6 cycles) between virtual processors in SMP cases.
> > > One solution is described below, for example, the guest is
> > > migrated from low TSC freq(low_freq) machine to a high TSC
> > > freq one(high_freq), you know, the low frequency is guest's
> > > expected frequency(exp_freq), and we should let guest be
> > > aware that it is running on the machine with exp_freq TSC to
> > > avoid possbile issues caused by faster TSC in any
> > > optimization solution.
> > >
> > > 1. In this solution, we only guarantee guest's TSC is
> > > increasing monotonically and the average frequency equals
> > > guest's expected frequency(exp_freq) in a fixed time slot 
> (eg. ~1ms).
> > > 2. To be simple,  let guest running in high_freq TSC (with
> > > hardware TSC offset feature, no perfomrance loss) for 1ms,
> > > and then enable rdtsc exiting and use trap and emulation
> > > method(suffers perfomance loss) to let guest running in a
> > > *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time,
> > > and the specific value can be calculated with the formula to
> > > guarantee average TSC frquency == exp_freq:
> > > 		time = (high_freq - low_freq) / (low_freq - 0.2).
> > >
> > > 3. If the guest migrate from 2.4G machine to 3.0G machine,
> > > only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer
> > > performance loss in the total time 1ms+0.273ms ,and that is
> > > also to say, in most of the time guest can leverage
> > > hardware's TSC offset feature to reduce perfomrance loss.
> > >
> > > 4.  In the 1.273ms, we can say guest's TSC frequency is
> > > emulated to its expected one through the hardware and
> > > software's co-emulation. And the perfomance loss is very
> > > minor compared with purely softtsc solution.
> > > 5.  But at the same time, since each vcpu's TSC is emulated
> > > indpendently for SMP guest, and they may generate a drift
> > > value between vcpus, and the drift vaule's range should be
> > > 10^5 ~10^6 cycles, and we don't know such drift between vcpus
> > > whether can bring other side-effects.  At least, one
> > > side-effect case we can figure out is when one application
> > > running on one vcpu, and it may see backward TSC value after
> > > its migrating to another vcpu.  Not sure this is a real
> > > problem, but it should exist in theory.
> > >
> > > Attached the draft patch to implement the solution based on
> > > an old #Cset19591.
> > >
> > > Xiantao
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 15:18             ` Dan Magenheimer
@ 2009-07-23 15:29               ` Keir Fraser
  2009-07-23 16:39                 ` Dan Magenheimer
  2009-07-23 15:45               ` Keir Fraser
  1 sibling, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-23 15:29 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 23/07/2009 16:18, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> I've informally heard that certain version of the JVM and
>> Oracle Db have a habit of pounding rdtsc hard from user
>> space, but I don't know what rates.
> 
> Indeed they do and they use it for timestamping
> events/transactions, so these are the very same
> apps that need to guarantee SMP timestamp ordering.

Why would you expect host TSC consistency running on Xen to be worse than
when running on a native OS?

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 15:18             ` Dan Magenheimer
  2009-07-23 15:29               ` Keir Fraser
@ 2009-07-23 15:45               ` Keir Fraser
  2009-07-23 16:45                 ` Dan Magenheimer
  1 sibling, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-23 15:45 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 23/07/2009 16:18, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> pre-VT it wasn't possible to trap RDTSC, so this can't help PV guests.
> 
> For PV guests, CR4.TSD would always be set, generating
> a general protection fault for every rdtsc.  (Or perhaps
> I am missing some x86 architectural subtlety?  This is
> how it is done on ia64.)

Forgot about CR4.TSD. Of course it's not going to be a super fast path,
since #GP(0) will need to be demuxed by decoding the faulting instruction in
the hypervisor, via a not-so-short path.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 15:29               ` Keir Fraser
@ 2009-07-23 16:39                 ` Dan Magenheimer
  2009-07-24  8:04                   ` Keir Fraser
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-23 16:39 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> >> I've informally heard that certain version of the JVM and
> >> Oracle Db have a habit of pounding rdtsc hard from user
> >> space, but I don't know what rates.
> > 
> > Indeed they do and they use it for timestamping
> > events/transactions, so these are the very same
> > apps that need to guarantee SMP timestamp ordering.
> 
> Why would you expect host TSC consistency running on Xen to 
> be worse than
> when running on a native OS?

In short, it is because a new class of machine
is emerging in the virtualization space that
is really a NUMA machine, tries to look like
a SMP (non-NUMA) machine by making memory access
fast enough that NUMA-ness can be ignored,
but for the purposes of time, is still a
NUMA machine.

Let's consider three physical platforms:

SMALL = single socket (multi-core)
MEDIUM = multiple sockets, same motherboard
LARGE = multiple sockets, multiple motherboards

The LARGE is becoming more widely available (e.g.
HP DL785) because multiple motherboards are
very convenient for field upgradeability (which
has a major impact on support costs).  They
also make a very nice consolidation target for
virtualizing a bunch of SMALL  machines.  However,
SMALL and MEDIUM are much less expensive so much
more prevalent (especially as development machines!).

On SMALL, TSC is always consistent between cores
(at least on all but the first dual-core processors).

On MEDIUM, some claim that TSC is always consistent
between cores on different sockets because the
sockets share a motherboard crystal.  I don't
know if this is true; if it is true, MEDIUM can
be considered the same as SMALL, if not MEDIUM
can be considered the same as LARGE.  So
ignore MEDIUM as a subcase of one of the others.

On LARGE, the motherboards are connected by
HT or QPI, but neither has any form of clock
synchronization.  So, from a clock perspective,
LARGE needs to be "partitioned"; OR there has
to be sophisticated system software that does
its best to synchronize TSC across all of
the cores (which enterprise OS's like HP-UX
and AIX have, Linux is working on, and Xen
has... though it remains to be seen if any
of these work "good enough"); OR TSC has to
be abandoned altogether by all software that
relies on it (OR TSC needs to be emulated).

This problem on LARGE machines is obscure enough
that software is developed (on SMALL machines)
that has a hidden timebomb if TSC is not perfectly
consistent. Admittedly, all such software should
have a switch that abandons TSC altogether in favor
of an OS "gettimeofday", but this either depends
on TSC as well or on a verrryyy sllloooowwww
platform timer that if used frequently probably
has as bad or worse a performance impact as
emulating TSC.

So what is "good enough"?  If Xen's existing
algorithm works poorly on LARGE systems (or
even on older SMALL systems), applications
should abandon TSC.  If Xen's existing algorithm
works "well", then applications can and should
use TSC.  But unless "good enough" can be carefully
defined and agreed upon between Xen and the
applications AND Xen can communicate "YES
this platform is good enough or NOT" to any
software that cares, we are caught in a gray
area.  Unfortunately, neither is true:  "good
enough" is not defined, AND there is no clean
way to communicate it even if it were.

And living in the gray area means some very
infrequent, very bizarre bugs can arise because
sometimes, unbeknownst to that application,
rarely and irreproducibly, time will appear to
go backwards.  And if timestamps are used,
for example, to replay transactions, data
corruption occurs.

So the choices are:
1) Ignore the problem and hope it never happens (or
   if it does that Xen doesn't get blamed)
2) Tell all Xen users that TSC should not be used
   as a timestamp.  (In other words, fix your apps
   or always turn on the app's TSC-is-bad option when
   running virtualized on a "bad" physical machine.)
3) Always emulate TSC and let the heavy TSC users
   pay the performance cost.

Last, as Intel has pointed out, a related kind of
issue occurs when live migration moves a running
VM from a machine with one TSC rate to another machine
with a different TSC rate (or if TSC rate varies
on the same machine, i.e. for power-savings reasons).
It would be nice if our choice (above) solves this
problem too.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 15:45               ` Keir Fraser
@ 2009-07-23 16:45                 ` Dan Magenheimer
  0 siblings, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-23 16:45 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> >> pre-VT it wasn't possible to trap RDTSC, so this can't 
> help PV guests.
> > 
> > For PV guests, CR4.TSD would always be set, generating
> > a general protection fault for every rdtsc.  (Or perhaps
> > I am missing some x86 architectural subtlety?  This is
> > how it is done on ia64.)
> 
> Forgot about CR4.TSD. Of course it's not going to be a super 
> fast path,
> since #GP(0) will need to be demuxed by decoding the faulting 
> instruction in
> the hypervisor, via a not-so-short path.

And demuxing/decoding can never be "fast enough"?
Even if this particular situation is special-cased?

I'm not saying it will be pretty, or maintainable,
just wondering what the best possible could be?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 16:39                 ` Dan Magenheimer
@ 2009-07-24  8:04                   ` Keir Fraser
  2009-07-24 14:47                     ` Dan Magenheimer
  2009-08-05  0:05                     ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 36+ messages in thread
From: Keir Fraser @ 2009-07-24  8:04 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 23/07/2009 17:39, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> Why would you expect host TSC consistency running on Xen to
>> be worse than
>> when running on a native OS?
> 
> In short, it is because a new class of machine
> is emerging in the virtualization space that
> is really a NUMA machine, tries to look like
> a SMP (non-NUMA) machine by making memory access
> fast enough that NUMA-ness can be ignored,
> but for the purposes of time, is still a
> NUMA machine.

Okay, so the issue you are worried about is not specific to Xen. So how is
native Linux tackling this, for example?

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-24  8:04                   ` Keir Fraser
@ 2009-07-24 14:47                     ` Dan Magenheimer
  2009-08-05  0:05                     ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-24 14:47 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> >> Why would you expect host TSC consistency running on Xen to
> >> be worse than
> >> when running on a native OS?
> > 
> > In short, it is because a new class of machine
> > is emerging in the virtualization space that
> > is really a NUMA machine, tries to look like
> > a SMP (non-NUMA) machine by making memory access
> > fast enough that NUMA-ness can be ignored,
> > but for the purposes of time, is still a
> > NUMA machine.
> 
> Okay, so the issue you are worried about is not specific to 
> Xen. So how is
> native Linux tackling this, for example?

I'm not sure that it is, though I'll look into it.

But the difference is that, in a virtual environment,
sometimes it is "safe" to use TSC and sometimes it is
not and, on a LARGE machine, this changes dynamically.
Further, a guest may "originate" on a physical machine
where it is safe and migrate to a physical machine
where it is not.

OS's may ask "is TSC safe", but do so once at startup,
and unfortunately the method to ask is ill-defined.
Applications have no way of asking "is TSC safe" so
either use a one-time startup configuration option
or depend on the OS to make the determination by
always using something like gettimeofdayns().

So if Xen ever responds to an OS asking "is TSC safe",
it should answer it for the whole datacenter (which
itself is not static as new machines might be added
while a VM is live).  As a result, Xen's response must
always be NO.  (Unless, softtsc is the default in which
case the answer can be YES.)

If Xen's response is always NO, apps must use,
indirectly through the OS, a platform timer (which is
probably a lot slower than softtsc!)

So, in the end, to guarantee correctness, high-
frequency-time-stamping apps are going to have slow
access anyway.  And so my conclusion is that we should
always trap TSC, which can guarantee a fixed-frequency
monotonically-increasing timestamp source across all
machines of all frequencies, whether an app or OS
asks "is TSC safe" or not.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 14:54           ` Ian Pratt
  2009-07-23 15:18             ` Dan Magenheimer
@ 2009-07-27 14:47             ` Dan Magenheimer
  2009-07-27 14:55               ` Keir Fraser
  2009-07-28  1:46               ` Zhang, Xiantao
  1 sibling, 2 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-27 14:47 UTC (permalink / raw)
  To: Ian Pratt, Zhang, Xiantao, Keir Fraser, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> > Can someone at Intel confirm or deny that VMware ESX
> > always traps rdtsc?  If so, it is probably not hard
> > to write an application that works on VMware ESX (on
> > certain hardware) but fails on Xen.
> 
> I'd be rather surprised if VMware trapped RDTSC. From what I 
> gather, ESX3 doesn't make a great deal of use of VT for 32b 
> guests, so at the very least it would be tricky to do 
> anything about user space use of rdtsc.

Some googling and reading provides evidence that VMware
does indeed virtualize the TSC.  The timekeeping paper
http://www.vmware.com/pdf/vmware_timekeeping.pdf
tells how to turn vTSC off, but says that turning it
off is no longer recommended.  The ASPLOS paper
http://www.vmware.com/pdf/asplos235_adams.pdf
uses rdtsc as an example of how binary translation
is much faster than emulation or callout (though
their BT version fetches a stale TSC which afaict
doesn't solve the ordering problem).

Also, Avi Kivity tells me that the KVM folks have
also recently come to the conclusion that it is necessary
to emulate TSC, though KVM currently does not.

Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 14:47             ` Dan Magenheimer
@ 2009-07-27 14:55               ` Keir Fraser
  2009-07-27 17:25                 ` Dan Magenheimer
  2009-07-28  1:46               ` Zhang, Xiantao
  1 sibling, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-27 14:55 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 27/07/2009 15:47, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Some googling and reading provides evidence that VMware
> does indeed virtualize the TSC.  The timekeeping paper
> http://www.vmware.com/pdf/vmware_timekeeping.pdf
> tells how to turn vTSC off, but says that turning it
> off is no longer recommended.

I believe this affects the guest OS executing RDTSC, not guest apps, and is
only to delay the TSC to not 'run past' pending timer ticks (typically where
they have been delayed due to the guest being preempted).

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 14:55               ` Keir Fraser
@ 2009-07-27 17:25                 ` Dan Magenheimer
  2009-07-27 19:55                   ` Keir Fraser
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-27 17:25 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> > Some googling and reading provides evidence that VMware
> > does indeed virtualize the TSC.  The timekeeping paper
> > http://www.vmware.com/pdf/vmware_timekeeping.pdf
> > tells how to turn vTSC off, but says that turning it
> > off is no longer recommended.
> 
> I believe this affects the guest OS executing RDTSC, not 
> guest apps, and is
> only to delay the TSC to not 'run past' pending timer ticks 
> (typically where
> they have been delayed due to the guest being preempted).
> 
>  -- Keir

Could be.  The text would lead me to believe otherwise
though.  Read the section on "Virtual TSC" in the
above PDF. Specifically the Virtual TSC "advances even
when the the virtual CPU is not running" and "In the
past, this feature had sometimes been recommended to
improve performance of APPLICATIONS that read the
TSC frequently..." (my emphasis)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 17:25                 ` Dan Magenheimer
@ 2009-07-27 19:55                   ` Keir Fraser
  2009-07-27 22:14                     ` Dan Magenheimer
  2009-08-03 20:19                     ` Dan Magenheimer
  0 siblings, 2 replies; 36+ messages in thread
From: Keir Fraser @ 2009-07-27 19:55 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 27/07/2009 18:25, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> I believe this affects the guest OS executing RDTSC, not
>> guest apps, and is
>> only to delay the TSC to not 'run past' pending timer ticks
>> (typically where
>> they have been delayed due to the guest being preempted).
>> 
>>  -- Keir
> 
> Could be.  The text would lead me to believe otherwise
> though.  Read the section on "Virtual TSC" in the
> above PDF. Specifically the Virtual TSC "advances even
> when the the virtual CPU is not running" and "In the
> past, this feature had sometimes been recommended to
> improve performance of APPLICATIONS that read the
> TSC frequently..." (my emphasis)

Yes, then it sounds like they virtualise it for apps too. Also there is an
option to virtualise the TSC at a specified frequency -- that would be
pretty weird if it applied only to guest-OS RDTSCs but not guest-app RDTSCs.

Interesting...

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 19:55                   ` Keir Fraser
@ 2009-07-27 22:14                     ` Dan Magenheimer
  2009-07-27 22:39                       ` Keir Fraser
  2009-08-03 20:19                     ` Dan Magenheimer
  1 sibling, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-27 22:14 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> >> I believe this affects the guest OS executing RDTSC, not
> >> guest apps, and is
> >> only to delay the TSC to not 'run past' pending timer ticks
> >> (typically where
> >> they have been delayed due to the guest being preempted).
> > 
> > Could be.  The text would lead me to believe otherwise
> > though.  Read the section on "Virtual TSC" in the
> > above PDF. Specifically the Virtual TSC "advances even
> > when the the virtual CPU is not running" and "In the
> > past, this feature had sometimes been recommended to
> > improve performance of APPLICATIONS that read the
> > TSC frequently..." (my emphasis)
> 
> Yes, then it sounds like they virtualise it for apps too. 
> Also there is an
> option to virtualise the TSC at a specified frequency -- that would be
> pretty weird if it applied only to guest-OS RDTSCs but not 
> guest-app RDTSCs.
> 
> Interesting...
> 
>  -- Keir

And further, the frequency is "sticky" across migration, with
the frequency set to whatever machine the VM originated on.

I'd be inclined to just use Xen system time and thus
the TSC frequency would be always 1GHz on all systems.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 22:14                     ` Dan Magenheimer
@ 2009-07-27 22:39                       ` Keir Fraser
  0 siblings, 0 replies; 36+ messages in thread
From: Keir Fraser @ 2009-07-27 22:39 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 27/07/2009 23:14, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> Yes, then it sounds like they virtualise it for apps too.
>> Also there is an
>> option to virtualise the TSC at a specified frequency -- that would be
>> pretty weird if it applied only to guest-OS RDTSCs but not
>> guest-app RDTSCs.
>> 
>> Interesting...
> 
> And further, the frequency is "sticky" across migration, with
> the frequency set to whatever machine the VM originated on.
> 
> I'd be inclined to just use Xen system time and thus
> the TSC frequency would be always 1GHz on all systems.

Well that is what softtsc does already, albeit only for HVM guests so far.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-23 13:24         ` Dan Magenheimer
  2009-07-23 14:54           ` Ian Pratt
@ 2009-07-28  0:55           ` Zhang, Xiantao
  1 sibling, 0 replies; 36+ messages in thread
From: Zhang, Xiantao @ 2009-07-28  0:55 UTC (permalink / raw)
  To: Dan Magenheimer, Keir Fraser, Xen-Devel (E-mail)
  Cc: Ian Pratt, Dong, Eddie, John Levon

Hi, Dan
	Sorry for late reply!  See my comments below. 
> 
> Thanks very much for the additional detail on the 10%
> performance loss.  What is this oltp benchmark?  Is
> it available for others to run?  Also is the rdtsc
> rate 120000/sec on EACH processor?

OLTP benchmark is a test case of sysbench, and you can get it through the following link:
http://sysbench.sourceforge.net/

And we only configured one virtual processor for one VM,  and I don't know oltp whether can use two virtual processors. 

> 
> Assuming a 3GHz machine, your results seem to show that
> emulating a rdtsc with softtsc takes about 2500 cycles.
> This agrees with my approximation of about 1 usec.
> 
> Have you analyzed where this 2500 cycles is being used?
> My suggestion about performance optimization was not
> to try a different algorithm but to see if it is possible
> to code the existing algorithm much faster using a
> special trap path and assembly code. (We called this
> a "fast path" on Xen/ia64.)  Even if the 2500 cycles
> can be cut in half, that would be a big win.

It should have no fast path for emulating rdtsc in x86 side, and the main cost should be from hardware context switch. Since I am using an old machine when run this benchmark, the cost should be reduced sharply in latest processors where I haven't done the test. 

> Am I correct in reading that your patch is ONLY for
> HVM guests?  If so, since some (maybe most) workloads
> that rely on tsc for transaction timestamps will be
> PV, your patch doesn't solve the whole problem.

Yes, this patch is only for HVM guest, because only HVM guest can use TSC offset feature(one of VT features) ,and also I don't think PV guest need it. 

> Can someone at Intel confirm or deny that VMware ESX
> always traps rdtsc?  If so, it is probably not hard
> to write an application that works on VMware ESX (on
> certain hardware) but fails on Xen.
\
> 
>> -----Original Message-----
>> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com]
>> Sent: Tuesday, July 21, 2009 11:05 PM
>> To: Keir Fraser; Dan Magenheimer; Xen-Devel (E-mail)
>> Cc: John Levon; Ian Pratt; Dong, Eddie
>> Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL
>> 
>> 
>> Keir Fraser wrote:
>>> On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@oracle.com>
>>> wrote: 
>>> 
>>>> I agree that if the performance is *really bad*, the default
>>>> should not change.  But I think we are still flying on rumors
>>>> of data collected years ago in a very different world, and
>>>> the performance data should be re-collected to prove that
>>>> it is still *really bad*.  If the degradation is a fraction
>>>> of a percent even in worst case analysis, I think the default
>>>> should be changed so that correctness prevails.
>>>> 
>>>> Why now?  Because more and more real-world applications are
>>>> built on top of multi-core platforms where TSC is reliable
>>>> and (by far) the best timesource.  And I think(?) we all agree
>>>> now that softtsc is the only way to guarantee correctness
>>>> in a virtual environment.
>>> 
>>> So how bad is the non-softtsc default mode anyway? Our default
>>> timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu
>>> offset that defaults to having all vcpus of a domain aligned to
>>> vcpu0 boot = zero tsc). 
>>> 
>>> Looking at the email thread you cited, all I see is someone from
>>> Intel saying something about how their code to improve TSC
>>> consistency across migration avoids RDTSC exiting where possible
>>> (which I do not see -- if the TSC rates across the hosts do not
>>> match closely then RDTSC exiting is enabled forever for that
>>> domain), and, most bizarrely, that their 'solution' may have a tsc
>>> drift >10^5 cycles. Where did this huge number come from? What
>>> solution is being talked about, and under what conditions might the
>>> claim hold? Who knows! 
>> 
>> We had done the experiment to measure the performance impact
>> with softtsc using oltp workload, and we saw ~10% performance
>> loss if rdtsc rate is more than 120,000/second. And we also
>> did some other tests, and the results show that ~1%
>> perfomance loss is caused by 10000 rdtsc instructions.  So if
>> the rdtsc rate is not that high(>10000/second), the
>> performance impact can be ignored.
>> 
>> We also introduced some performance optimization solutions,
>> but as we claimed before, they may bring some TSC drift (
>> 10^5~10^6 cycles) between virtual processors in SMP cases.
>> One solution is described below, for example, the guest is
>> migrated from low TSC freq(low_freq) machine to a high TSC
>> freq one(high_freq), you know, the low frequency is guest's
>> expected frequency(exp_freq), and we should let guest be
>> aware that it is running on the machine with exp_freq TSC to
>> avoid possbile issues caused by faster TSC in any
>> optimization solution.
>> 
>> 1. In this solution, we only guarantee guest's TSC is
>> increasing monotonically and the average frequency equals
>> guest's expected frequency(exp_freq) in a fixed time slot (eg. ~1ms).
>> 2. To be simple,  let guest running in high_freq TSC (with
>> hardware TSC offset feature, no perfomrance loss) for 1ms,
>> and then enable rdtsc exiting and use trap and emulation
>> method(suffers perfomance loss) to let guest running in a
>> *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time,
>> and the specific value can be calculated with the formula to
>> guarantee average TSC frquency == exp_freq:
>> 		time = (high_freq - low_freq) / (low_freq - 0.2).
>> 
>> 3. If the guest migrate from 2.4G machine to 3.0G machine,
>> only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer
>> performance loss in the total time 1ms+0.273ms ,and that is
>> also to say, in most of the time guest can leverage
>> hardware's TSC offset feature to reduce perfomrance loss.
>> 
>> 4.  In the 1.273ms, we can say guest's TSC frequency is
>> emulated to its expected one through the hardware and
>> software's co-emulation. And the perfomance loss is very
>> minor compared with purely softtsc solution.
>> 5.  But at the same time, since each vcpu's TSC is emulated
>> indpendently for SMP guest, and they may generate a drift
>> value between vcpus, and the drift vaule's range should be
>> 10^5 ~10^6 cycles, and we don't know such drift between vcpus
>> whether can bring other side-effects.  At least, one
>> side-effect case we can figure out is when one application
>> running on one vcpu, and it may see backward TSC value after
>> its migrating to another vcpu.  Not sure this is a real
>> problem, but it should exist in theory.
>> 
>> Attached the draft patch to implement the solution based on
>> an old #Cset19591.
>> 
>> Xiantao

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 14:47             ` Dan Magenheimer
  2009-07-27 14:55               ` Keir Fraser
@ 2009-07-28  1:46               ` Zhang, Xiantao
  2009-07-28 14:45                 ` Dan Magenheimer
  1 sibling, 1 reply; 36+ messages in thread
From: Zhang, Xiantao @ 2009-07-28  1:46 UTC (permalink / raw)
  To: Dan Magenheimer, Ian Pratt, Keir Fraser, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

Dan Magenheimer wrote:
>>> Can someone at Intel confirm or deny that VMware ESX
>>> always traps rdtsc?  If so, it is probably not hard
>>> to write an application that works on VMware ESX (on
>>> certain hardware) but fails on Xen.
>> 
>> I'd be rather surprised if VMware trapped RDTSC. From what I
>> gather, ESX3 doesn't make a great deal of use of VT for 32b
>> guests, so at the very least it would be tricky to do
>> anything about user space use of rdtsc.
> 
> Some googling and reading provides evidence that VMware
> does indeed virtualize the TSC.  The timekeeping paper
> http://www.vmware.com/pdf/vmware_timekeeping.pdf
> tells how to turn vTSC off, but says that turning it
> off is no longer recommended.  The ASPLOS paper
> http://www.vmware.com/pdf/asplos235_adams.pdf
> uses rdtsc as an example of how binary translation
> is much faster than emulation or callout (though
> their BT version fetches a stale TSC which afaict
> doesn't solve the ordering problem).
> Also, Avi Kivity tells me that the KVM folks have
> also recently come to the conclusion that it is necessary
> to emulate TSC, though KVM currently does not. 

Hi, Dan 
   I am still confused about why need to emulate rdtsc is necessary. Even if emulating it in software, we still need to find a stable time source, right?  If you think TSC is not stable on SMP system, and I think the issue should exist in native OS which depends on TSC as time source instead of Xen-self issue.  If host's TSC is stable enough, I think the hardware's TSC offset feature should be the right way to go ?  

I have a proposal on it. If Xen finds hardware's TSC is not reliable, it can tell guest about the info at guest's boot stage, and guest should use other  time sources(eg, hpet) instead of TSC . And if the TSC is reliable in hardware, I think we should let Xen try the best to use hardware's feature and just leave it as current implementation. If users know hardware's TSC is not reliable from his knowledge, it may set softtsc to solve the possible issue manually.  So maybe we only need to create a way how to tell guest the TSC's status in Xen hypervisor. 
Xiantao

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28  1:46               ` Zhang, Xiantao
@ 2009-07-28 14:45                 ` Dan Magenheimer
  2009-07-28 15:00                   ` Keir Fraser
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-28 14:45 UTC (permalink / raw)
  To: Zhang, Xiantao, Ian Pratt, Keir Fraser, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

>    I am still confused about why need to emulate rdtsc is 
> necessary. Even if emulating it in software, we still need to 
> find a stable time source, right?  If you think TSC is not 
> stable on SMP system, and I think the issue should exist in 
> native OS which depends on TSC as time source instead of 
> Xen-self issue.  If host's TSC is stable enough, I think the 
> hardware's TSC offset feature should be the right way to go ?  
> 
> I have a proposal on it. If Xen finds hardware's TSC is not 
> reliable, it can tell guest about the info at guest's boot 
> stage, and guest should use other  time sources(eg, hpet) 
> instead of TSC . And if the TSC is reliable in hardware, I 
> think we should let Xen try the best to use hardware's 
> feature and just leave it as current implementation. If users 
> know hardware's TSC is not reliable from his knowledge, it 
> may set softtsc to solve the possible issue manually.  So 
> maybe we only need to create a way how to tell guest the 
> TSC's status in Xen hypervisor. 

Hi Xiantao --

Thanks for the info in your previous reply.

The issue is that there's no easy way for Xen to
determine for sure if the hardware has a reliable TSC.
The TSC_CONSTANT bit in the MSR only applies to the
socket, not to the entire system.  Even if
it is possible on one box, it is not possible to
determine it for the whole data center (to handle
live migration).  Even if it is possible for the
whole data center, new machines may be live-added
to a data center that might be different.  So
no SMP application in a virtualized data center
can assume TSC is monotonic.  But SMP applications
designed on smaller multi-core physical systems
CAN assume TSC is monotonic; when these apps
are moved to virtual systems, problems will
occur (including possible data corruption).

As you've pointed out, there are other issues
with migration and power management where the
TSC frequency changes.  Unless/until there is
a TSC-scaling feature as well as a TSC-offset
feature, frequency changes will have to be
handled in software.

A virtual TSC (always trapping all rdtsc instructions)
allows us to guarantee monotonicity and provide a
constant rate (Xen time*, 1GHz) across all processors
in a machine and all machines in a data center.
There is a performance impact for applications
that execute RDTSC at a high frequency but I hope
that we can reduce this penalty somewhat.

I am proposing that the virtual TSC be the default.
We should provide a per-VM option and a Xen boot
option to allow VMs to NOT trap rdtsc, but this
should have a warning that it is not recommended
and may result in data corruption in some apps.

* Xen time by itself is not monotonic across multiple
processors but can be supplemented with a global
variable to always provide "last_tsc + 1" to
enforce monotonicity.

Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 14:45                 ` Dan Magenheimer
@ 2009-07-28 15:00                   ` Keir Fraser
  2009-07-28 15:46                     ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-28 15:00 UTC (permalink / raw)
  To: Dan Magenheimer, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 28/07/2009 15:45, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I am proposing that the virtual TSC be the default.
> We should provide a per-VM option and a Xen boot
> option to allow VMs to NOT trap rdtsc, but this
> should have a warning that it is not recommended
> and may result in data corruption in some apps.

This I can agree with. The softtsc boot option is just lazy. This should
properly be a per-VM option, for both HVM and PV guests. For example
tsc_freq=x sets virtual TSC of x MHz. And not specifying tsc_freq gets you
current behaviour (pass through host TSC where possible). That I could quite
happily live with, although I'm not planning to implement it myself.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 15:00                   ` Keir Fraser
@ 2009-07-28 15:46                     ` Dan Magenheimer
  2009-07-28 15:58                       ` Keir Fraser
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-28 15:46 UTC (permalink / raw)
  To: Keir Fraser, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> And not specifying tsc_freq gets you
> current behaviour (pass through host TSC where possible). 

I fear that unless the default is changed, it will
not be possible to sufficiently explain the problem
to users/administrators and the option will not get
turned on. In which case, it might as well not be
done at all... just one more obscure option that
nobody understands or uses.

Given that correctness is at stake (and given that
Xen's primary competitors are choosing correctness
over performance), I see this as a Xen developer
problem to fix, not one to pawn off on harried
system admins.

Savvy system admins (who know every app in their
data center and/or are willing to take the risk
for better performance) should be able to easily
disable softtsc though on all servers with a xen boot
option, or on a per VM basis.

We could quibble about details (maybe softtsc
should only be automatically enabled on SMP guests
or on 64-bit SMP guests or ?? ) but I suspect
that just creates a mess and IMHO we should just
bite the bullet.

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Tuesday, July 28, 2009 9:00 AM
> To: Dan Magenheimer; Zhang, Xiantao; Ian Pratt; Xen-Devel (E-mail)
> Cc: John Levon; Dong, Eddie
> Subject: Re: TSC scaling and softtsc reprise, and PROPOSAL
> 
> 
> On 28/07/2009 15:45, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com> wrote:
> 
> > I am proposing that the virtual TSC be the default.
> > We should provide a per-VM option and a Xen boot
> > option to allow VMs to NOT trap rdtsc, but this
> > should have a warning that it is not recommended
> > and may result in data corruption in some apps.
> 
> This I can agree with. The softtsc boot option is just lazy. 
> This should
> properly be a per-VM option, for both HVM and PV guests. For example
> tsc_freq=x sets virtual TSC of x MHz. And not specifying 
> tsc_freq gets you
> current behaviour (pass through host TSC where possible). 
> That I could quite
> happily live with, although I'm not planning to implement it myself.
> 
>  -- Keir
> 
> 
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 15:46                     ` Dan Magenheimer
@ 2009-07-28 15:58                       ` Keir Fraser
  2009-07-28 18:15                         ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-28 15:58 UTC (permalink / raw)
  To: Dan Magenheimer, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 28/07/2009 16:46, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Savvy system admins (who know every app in their
> data center and/or are willing to take the risk
> for better performance) should be able to easily
> disable softtsc though on all servers with a xen boot
> option, or on a per VM basis.
> 
> We could quibble about details (maybe softtsc
> should only be automatically enabled on SMP guests
> or on 64-bit SMP guests or ?? ) but I suspect
> that just creates a mess and IMHO we should just
> bite the bullet.

I can live with that, if it is driven from the xend toolstack. It will have
to default off in the hypervisor for compatibility with old saved images.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 15:58                       ` Keir Fraser
@ 2009-07-28 18:15                         ` Dan Magenheimer
  2009-07-28 18:43                           ` Keir Fraser
  2009-07-28 18:48                           ` Keir Fraser
  0 siblings, 2 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-28 18:15 UTC (permalink / raw)
  To: Keir Fraser, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

 
> > Savvy system admins (who know every app in their
> > data center and/or are willing to take the risk
> > for better performance) should be able to easily
> > disable softtsc though on all servers with a xen boot
> > option, or on a per VM basis.
> > 
> > We could quibble about details (maybe softtsc
> > should only be automatically enabled on SMP guests
> > or on 64-bit SMP guests or ?? ) but I suspect
> > that just creates a mess and IMHO we should just
> > bite the bullet.
> 
> I can live with that, if it is driven from the xend 
> toolstack. It will have
> to default off in the hypervisor for compatibility with old 
> saved images.
> 
>  -- Keir

Hmmm... one could argue that with the current model,
any VM using TSC is "at your own peril" and there are
certainly cases of restore that will break whatever
assumptions the VM is making about pre-save TSC
values.   So while I'm a believer in compatibility,
I'd suggest default ON in the hypervisor and
a new restore option that force-overrides the
softtsc boot-time default for any VM being restored.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 18:15                         ` Dan Magenheimer
@ 2009-07-28 18:43                           ` Keir Fraser
  2009-07-28 19:10                             ` Dan Magenheimer
  2009-07-28 18:48                           ` Keir Fraser
  1 sibling, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2009-07-28 18:43 UTC (permalink / raw)
  To: Dan Magenheimer, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 28/07/2009 19:15, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> I can live with that, if it is driven from the xend
>> toolstack. It will have
>> to default off in the hypervisor for compatibility with old
>> saved images.
> 
> Hmmm... one could argue that with the current model,
> any VM using TSC is "at your own peril" and there are
> certainly cases of restore that will break whatever
> assumptions the VM is making about pre-save TSC
> values.   So while I'm a believer in compatibility,
> I'd suggest default ON in the hypervisor and
> a new restore option that force-overrides the
> softtsc boot-time default for any VM being restored.

It would be defaulted on by the toolstack for all newly created guests.
That's quite sufficient I think.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 18:15                         ` Dan Magenheimer
  2009-07-28 18:43                           ` Keir Fraser
@ 2009-07-28 18:48                           ` Keir Fraser
  1 sibling, 0 replies; 36+ messages in thread
From: Keir Fraser @ 2009-07-28 18:48 UTC (permalink / raw)
  To: Dan Magenheimer, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

On 28/07/2009 19:15, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> a new restore option that force-overrides the
> softtsc boot-time default for any VM being restored.

Not sure if you mean guest boot-time or host boot-time here, by the way. If
you mean the latter, I should point out that a per-VM option for this would
supplant the softtsc boot parameter (which would be completely removed).

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-28 18:43                           ` Keir Fraser
@ 2009-07-28 19:10                             ` Dan Magenheimer
  0 siblings, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-07-28 19:10 UTC (permalink / raw)
  To: Keir Fraser, Zhang, Xiantao, Ian Pratt, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

> >> I can live with that, if it is driven from the xend
> >> toolstack. It will have
> >> to default off in the hypervisor for compatibility with old
> >> saved images.
> > 
> > Hmmm... one could argue that with the current model,
> > any VM using TSC is "at your own peril" and there are
> > certainly cases of restore that will break whatever
> > assumptions the VM is making about pre-save TSC
> > values.   So while I'm a believer in compatibility,
> > I'd suggest default ON in the hypervisor and
> > a new restore option that force-overrides the
> > softtsc boot-time default for any VM being restored.
> 
> It would be defaulted on by the toolstack for all newly 
> created guests.
> That's quite sufficient I think.

I guess I'm concerned that there are many toolstacks
that will need to be fixed, but there is one hypervisor.
Defaulting to softtsc in the hypervisor essentially
fixes the problem for the future and makes it clear
that the Xen developers have made a decision; waiting
for various vendor toolstacks to enforce a default (not
to mention going through the argument to convince
each vendor) presents a mixed message, prolongs the agony,
and almost guarantees chaos for years to come.

This is a subtle but fundamental change in the way
Xen works, necessary for correctness.  I think we
should bite the bullet and do it right.

Can the hypervisor itself tell the difference whether
a domain is being created vs restored?  I think not,
but if it can, that might be a good compromise.

Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-27 19:55                   ` Keir Fraser
  2009-07-27 22:14                     ` Dan Magenheimer
@ 2009-08-03 20:19                     ` Dan Magenheimer
  1 sibling, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-08-03 20:19 UTC (permalink / raw)
  To: Keir Fraser, Ian Pratt, Zhang, Xiantao, Xen-Devel (E-mail)
  Cc: Dong, Eddie, John Levon

FYI, I have confirmed with a VMware expert that
TSC is always emulated (unless a flag is set).

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Monday, July 27, 2009 1:55 PM
> To: Dan Magenheimer; Ian Pratt; Zhang, Xiantao; Xen-Devel (E-mail)
> Cc: John Levon; Dong, Eddie
> Subject: Re: TSC scaling and softtsc reprise, and PROPOSAL
> 
> 
> On 27/07/2009 18:25, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com> wrote:
> 
> >> I believe this affects the guest OS executing RDTSC, not
> >> guest apps, and is
> >> only to delay the TSC to not 'run past' pending timer ticks
> >> (typically where
> >> they have been delayed due to the guest being preempted).
> >> 
> >>  -- Keir
> > 
> > Could be.  The text would lead me to believe otherwise
> > though.  Read the section on "Virtual TSC" in the
> > above PDF. Specifically the Virtual TSC "advances even
> > when the the virtual CPU is not running" and "In the
> > past, this feature had sometimes been recommended to
> > improve performance of APPLICATIONS that read the
> > TSC frequently..." (my emphasis)
> 
> Yes, then it sounds like they virtualise it for apps too. 
> Also there is an
> option to virtualise the TSC at a specified frequency -- that would be
> pretty weird if it applied only to guest-OS RDTSCs but not 
> guest-app RDTSCs.
> 
> Interesting...
> 
>  -- Keir
> 
> 
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-07-24  8:04                   ` Keir Fraser
  2009-07-24 14:47                     ` Dan Magenheimer
@ 2009-08-05  0:05                     ` Jeremy Fitzhardinge
  2009-08-05  5:35                       ` Tian, Kevin
  1 sibling, 1 reply; 36+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-05  0:05 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Dan Magenheimer, Xen-Devel (E-mail),
	Dong, Eddie, John Levon, Ian Pratt, Zhang, Xiantao

On 07/24/09 01:04, Keir Fraser wrote:
> Okay, so the issue you are worried about is not specific to Xen. So how is
> native Linux tackling this, for example?
>   

Linux will use the tsc where possible, but regularly assesses its
perceived accuracy and will move to a different clocksource if the tsc
appears to the playing up.  I don't think it ever assumes the tsc is
synced between CPU/cores.

It allows rdtsc from usermode, but it is generally considered to be very
buggy and ill-defined behaviour.  It makes no attempt to make usermode
rdtsc in any way meaningful.  The exception is the vgettimeofday
vsyscall which does Xen-like timekeeping, in which it gets the tsc,cpu
tuple atomically, then scales it with timing parameters from the kernel.

    J

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-08-05  0:05                     ` Jeremy Fitzhardinge
@ 2009-08-05  5:35                       ` Tian, Kevin
  2009-08-06 21:13                         ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2009-08-05  5:35 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Keir Fraser
  Cc: Dan Magenheimer, Xen-Devel (E-mail),
	Dong, Eddie, Levon, Ian Pratt, John, Zhang, Xiantao

[-- Attachment #1: Type: text/plain, Size: 1116 bytes --]

>From: Jeremy Fitzhardinge
>Sent: 2009年8月5日 8:06
>
>On 07/24/09 01:04, Keir Fraser wrote:
>> Okay, so the issue you are worried about is not specific to 
>Xen. So how is
>> native Linux tackling this, for example?
>>   
>
>Linux will use the tsc where possible, but regularly assesses its
>perceived accuracy and will move to a different clocksource if the tsc
>appears to the playing up.  I don't think it ever assumes the tsc is
>synced between CPU/cores.

It cares. See tsc_sync.c under x86 arch, where unsynced warps
mark tsc as unstable. 

Thanks,
Kevin

>
>It allows rdtsc from usermode, but it is generally considered 
>to be very
>buggy and ill-defined behaviour.  It makes no attempt to make usermode
>rdtsc in any way meaningful.  The exception is the vgettimeofday
>vsyscall which does Xen-like timekeeping, in which it gets the tsc,cpu
>tuple atomically, then scales it with timing parameters from 
>the kernel.
>
>    J
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-08-05  5:35                       ` Tian, Kevin
@ 2009-08-06 21:13                         ` Dan Magenheimer
  2009-08-06 21:41                           ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2009-08-06 21:13 UTC (permalink / raw)
  To: Tian, Kevin, Jeremy Fitzhardinge, Keir Fraser
  Cc: Ian Pratt, Xen-Devel (E-mail), Dong, Eddie, Zhang, Xiantao, John Levon

Well actually, "how Linux handles this" is subject to
a dizzying matrix of hardware-dependent, CONFIG_-dependent,
linux-boot-parameter-dependent choices
that have evolved/changed at nearly every Linux
kernel version.  While it might be useful to steal
some recent Linux code to help determine if it is
safe to build Xen system time on top of TSC on some
processors, I don't know if Linux is of much use as
a design guide for how to expose TSC to guests/apps,
especially when said guests/apps may be moving back and
forth between hardware with widely varying TSC
characteristics.

But, yes, as Kevin points out, on some recent versions
of Linux on some hardware with some CONFIG/boot-params,
the kernel does indeed try to use TSC as a reliable
foundation for delivering ticks and gettimeofday-ish
services.

> -----Original Message-----
> From: Tian, Kevin [mailto:kevin.tian@intel.com]
> Sent: Tuesday, August 04, 2009 11:36 PM
> To: Jeremy Fitzhardinge; Keir Fraser
> Cc: Dan Magenheimer; Xen-Devel (E-mail); Dong, Eddie; John
> Levon; Ian Pratt; Zhang, Xiantao
> Subject: RE: [Xen-devel] Re: TSC scaling and softtsc reprise,
> and PROPOSAL
>
>
> >From: Jeremy Fitzhardinge
> >Sent: 2009?8?5? 8:06
> >
> >On 07/24/09 01:04, Keir Fraser wrote:
> >> Okay, so the issue you are worried about is not specific to
> >Xen. So how is
> >> native Linux tackling this, for example?
> >>  
> >
> >Linux will use the tsc where possible, but regularly assesses its
> >perceived accuracy and will move to a different clocksource
> if the tsc
> >appears to the playing up.  I don't think it ever assumes the tsc is
> >synced between CPU/cores.
>
> It cares. See tsc_sync.c under x86 arch, where unsynced warps
> mark tsc as unstable.
>
> Thanks,
> Kevin
>
> >
> >It allows rdtsc from usermode, but it is generally considered
> >to be very
> >buggy and ill-defined behaviour.  It makes no attempt to
> make usermode
> >rdtsc in any way meaningful.  The exception is the vgettimeofday
> >vsyscall which does Xen-like timekeeping, in which it gets
> the tsc,cpu
> >tuple atomically, then scales it with timing parameters from
> >the kernel.
> >
> >    J
> >
> >_______________________________________________
> >Xen-devel mailing list
> >Xen-devel@lists.xensource.com
> >http://lists.xensource.com/xen-devel
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Re: TSC scaling and softtsc reprise, and PROPOSAL
  2009-08-06 21:13                         ` Dan Magenheimer
@ 2009-08-06 21:41                           ` Dan Magenheimer
  0 siblings, 0 replies; 36+ messages in thread
From: Dan Magenheimer @ 2009-08-06 21:41 UTC (permalink / raw)
  To: dan.magenheimer, Tian, Kevin, Jeremy Fitzhardinge, Keir Fraser
  Cc: Ian Pratt, Xen-Devel (E-mail), Dong, Eddie, Zhang, Xiantao, John Levon

Oops, forgot to add...

>>>Linux will use the tsc where possible, but regularly
>>>assesses its

"Regularly assesses" is a big misleading... according
to my reading of the 2.6.30 code, it checks for "good
synchronization" once at boot, then after that only
ensures that things haven't gone completely whacko
by checking that multiple TSCs haven't diverged by
more than ~60msec(?).  While I suppose this will catch
the most likely divergent hardware cases, I suspect
Xen's periodically-attempt-to-sync-the-TSC code might
lull the linux kernel into complacency (and 60msec
accuracy is just not good enough for applications).

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Thursday, August 06, 2009 3:13 PM
> To: Tian, Kevin; Jeremy Fitzhardinge; Keir Fraser
> Cc: Ian Pratt; Xen-Devel (E-mail); Dong, Eddie; Zhang, Xiantao; John
> Levon
> Subject: RE: [Xen-devel] Re: TSC scaling and softtsc reprise, and
> PROPOSAL
> 
> 
> Well actually, "how Linux handles this" is subject to
> a dizzying matrix of hardware-dependent, CONFIG_-dependent,
> linux-boot-parameter-dependent choices
> that have evolved/changed at nearly every Linux
> kernel version.  While it might be useful to steal
> some recent Linux code to help determine if it is
> safe to build Xen system time on top of TSC on some
> processors, I don't know if Linux is of much use as
> a design guide for how to expose TSC to guests/apps,
> especially when said guests/apps may be moving back and
> forth between hardware with widely varying TSC
> characteristics.
> 
> But, yes, as Kevin points out, on some recent versions
> of Linux on some hardware with some CONFIG/boot-params,
> the kernel does indeed try to use TSC as a reliable
> foundation for delivering ticks and gettimeofday-ish
> services.
> 
> > -----Original Message-----
> > From: Tian, Kevin [mailto:kevin.tian@intel.com]
> > Sent: Tuesday, August 04, 2009 11:36 PM
> > To: Jeremy Fitzhardinge; Keir Fraser
> > Cc: Dan Magenheimer; Xen-Devel (E-mail); Dong, Eddie; John
> > Levon; Ian Pratt; Zhang, Xiantao
> > Subject: RE: [Xen-devel] Re: TSC scaling and softtsc reprise,
> > and PROPOSAL
> >
> >
> > >From: Jeremy Fitzhardinge
> > >Sent: 2009?8?5? 8:06
> > >
> > >On 07/24/09 01:04, Keir Fraser wrote:
> > >> Okay, so the issue you are worried about is not specific to
> > >Xen. So how is
> > >> native Linux tackling this, for example?
> > >>  
> > >
> > >Linux will use the tsc where possible, but regularly assesses its
> > >perceived accuracy and will move to a different clocksource
> > if the tsc
> > >appears to the playing up.  I don't think it ever assumes 
> the tsc is
> > >synced between CPU/cores.
> >
> > It cares. See tsc_sync.c under x86 arch, where unsynced warps
> > mark tsc as unstable.
> >
> > Thanks,
> > Kevin
> >
> > >
> > >It allows rdtsc from usermode, but it is generally considered
> > >to be very
> > >buggy and ill-defined behaviour.  It makes no attempt to
> > make usermode
> > >rdtsc in any way meaningful.  The exception is the vgettimeofday
> > >vsyscall which does Xen-like timekeeping, in which it gets
> > the tsc,cpu
> > >tuple atomically, then scales it with timing parameters from
> > >the kernel.
> > >
> > >    J
> > >
> > >_______________________________________________
> > >Xen-devel mailing list
> > >Xen-devel@lists.xensource.com
> > >http://lists.xensource.com/xen-devel
> > >
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2009-08-06 21:41 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-20 17:05 TSC scaling and softtsc reprise, and PROPOSAL Dan Magenheimer
2009-07-20 17:14 ` Keir Fraser
2009-07-20 20:02   ` Dan Magenheimer
2009-07-20 21:02     ` Keir Fraser
2009-07-20 23:52       ` Dan Magenheimer
2009-07-22  5:05       ` Zhang, Xiantao
2009-07-23 13:24         ` Dan Magenheimer
2009-07-23 14:54           ` Ian Pratt
2009-07-23 15:18             ` Dan Magenheimer
2009-07-23 15:29               ` Keir Fraser
2009-07-23 16:39                 ` Dan Magenheimer
2009-07-24  8:04                   ` Keir Fraser
2009-07-24 14:47                     ` Dan Magenheimer
2009-08-05  0:05                     ` Jeremy Fitzhardinge
2009-08-05  5:35                       ` Tian, Kevin
2009-08-06 21:13                         ` Dan Magenheimer
2009-08-06 21:41                           ` Dan Magenheimer
2009-07-23 15:45               ` Keir Fraser
2009-07-23 16:45                 ` Dan Magenheimer
2009-07-27 14:47             ` Dan Magenheimer
2009-07-27 14:55               ` Keir Fraser
2009-07-27 17:25                 ` Dan Magenheimer
2009-07-27 19:55                   ` Keir Fraser
2009-07-27 22:14                     ` Dan Magenheimer
2009-07-27 22:39                       ` Keir Fraser
2009-08-03 20:19                     ` Dan Magenheimer
2009-07-28  1:46               ` Zhang, Xiantao
2009-07-28 14:45                 ` Dan Magenheimer
2009-07-28 15:00                   ` Keir Fraser
2009-07-28 15:46                     ` Dan Magenheimer
2009-07-28 15:58                       ` Keir Fraser
2009-07-28 18:15                         ` Dan Magenheimer
2009-07-28 18:43                           ` Keir Fraser
2009-07-28 19:10                             ` Dan Magenheimer
2009-07-28 18:48                           ` Keir Fraser
2009-07-28  0:55           ` Zhang, Xiantao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.