All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
@ 2015-12-28 16:59 Joao Martins
  2015-12-28 16:59 ` [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info Joao Martins
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Joao Martins, Keir Fraser, Jan Beulich

Hey!

I've been working on pvclock vdso support on Linux guests, and came
across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
required for vdso as of the latest pvclock ABI shared with KVM.

In addition, I've found some problems which aren't necessarily visible
to kernel as the pvclock algorithm in linux keeps the highest pvti
time read among all cpus. But as is, a process using vdso gettimeofday
observes a significant amount of warps (i.e. time going backwards) and
it could be due to 1) time calibration skew in xen rendezvous
algorithm, 2) xen clocksource not in sync with TSC and 3) in
situations when guests unaware of VCPUS moving to a different PCPU.
The warps are seen more frequently on PV guests (potentially because
vcpu time infos are only updated when guest is in kernel mode, and
perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
it is worth noting that with guests VCPUs pinned, only PV guests see
these warps. But on HVM guests specifically: such warps only occur
when one of guest VCPUs is pinned to CPU0.

PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the
vcpu_time_info (pvti) are monotonic as seen by any CPU,
and supporting it could help fixing the issue mentioned above. This
series aims to propose a solution to that and it's divided as
following:

	* Patch 1: Adds the missing flag field to vcpu_time_info.
	* Patch 2: Adds a new clocksource based on TSC
	* Patch 3, 4: Cleanups for patch 5
	* Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT.
	* Patch 6: Same as 5 before but for other platform timers

I have some doubts on the correctness of Patch 6 but was the only
solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using
other clocksources (i.e. != tsc). The test was running time-warp-test,
that constantly calls clock_gettime/gettimeofday on every CPU. It
measures a delta with the previous returned value and mark a warp if
it's negative. I measured it during periods of 1h and 6h and check how
many warps and their values (alongside the amount of time skew).
Measurements are included in individual patches.

Note that most of the testing has been done with Linux 4.4 in which
these warps/skew could be easily observed as vdso would use each vCPU
pvti. Though Linux 4.5 will change this behaviour and use only the
vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage.

I've been using it for a while in the past weeks with no issues so far
though still requires testing on bad TSC machines. But I would like to
get folks comments/suggestions if this is the correct approach in
implementing this flag.

Thanks!
Joao

Joao Martins (6):
  public/xen.h: add flags field to vcpu_time_info
  x86/time: implement tsc as clocksource
  x86/time: streamline platform time init on plt_init()
  x86/time: refactor read_platform_stime()
  x86/time: implement PVCLOCK_TSC_STABLE_BIT
  x86/time: convert counter to tsc for non-tsc clocksource

 xen/arch/x86/time.c      | 141 +++++++++++++++++++++++++++++++++++++++++------
 xen/include/public/xen.h |   6 +-
 2 files changed, 128 insertions(+), 19 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
@ 2015-12-28 16:59 ` Joao Martins
  2016-01-25 20:11   ` Konrad Rzeszutek Wilk
  2015-12-28 16:59 ` [PATCH RFC 2/6] x86/time: implement tsc as clocksource Joao Martins
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Jan Beulich,
	Joao Martins

This field has two possible flags (as of latest pvclock ABI
shared with KVM).

flags: bits in this field indicate extended capabilities
coordinated between the guest and the hypervisor.  Specifically
on KVM, availability of specific flags has to be checked in
0x40000001 cpuid leaf. On Xen, we don't have that but we can
still check some of the flags after registering the time info
page since a force_update_vcpu_system_time is performed.

Current flags are:

 flag bit   | cpuid bit    | meaning
-------------------------------------------------------------
            |              | time measures taken across
     0      |      24      | multiple cpus are guaranteed to
            |              | be monotonic
-------------------------------------------------------------
            |              | guest vcpu has been paused by
     1      |     N/A      | the host
            |              |
-------------------------------------------------------------

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 xen/include/public/xen.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index ff5547e..1223686 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -601,10 +601,14 @@ struct vcpu_time_info {
      */
     uint32_t tsc_to_system_mul;
     int8_t   tsc_shift;
-    int8_t   pad1[3];
+    int8_t   flags;
+    int8_t   pad1[2];
 }; /* 32 bytes */
 typedef struct vcpu_time_info vcpu_time_info_t;
 
+#define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
+#define PVCLOCK_GUEST_STOPPED	(1 << 1)
+
 struct vcpu_info {
     /*
      * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 2/6] x86/time: implement tsc as clocksource
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
  2015-12-28 16:59 ` [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info Joao Martins
@ 2015-12-28 16:59 ` Joao Martins
  2015-12-29 15:11   ` Andrew Cooper
  2015-12-28 16:59 ` [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init() Joao Martins
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Joao Martins, Keir Fraser, Jan Beulich

Introduce support for using TSC as platform time which is the highest
resolution time and most performant to get (~20 nsecs).  Though there
are also several problems associated with its usage, and there isn't a
complete (and architecturally defined) guarantee that all machines
will provide reliable and monotonic TSC across all CPUs, on different
sockets and different P/C states.  I believe Intel to be the only that
can guarantee that. For this reason it's set with less priority when
compared to HPET unless adminstrator changes "clocksource" boot option
to "tsc". Upon initializing it, we also check for time warps and
appropriate CPU features i.e.  TSC_RELIABLE, CONSTANT_TSC and
NONSTOP_TSC. And in case none of these conditions are met, we fail in
order to fallback to the next available clocksource.

It is also worth nothing that with clocksource=tsc there isn't no
synchronization with another clocksource, and I could verify that
great portion the time skew was eliminated and seeing much less time
warps happening. With HPET I used to observe ~500 warps in the period
of 1h of around 27 us, and with TSC down to 50 warps in the same
period having each warp < 100 ns. The warps still exist though but
are only related to cross CPU calibration (being how much it takes to
rendezvous with master), in which a later patch in this series aims to
solve.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 xen/arch/x86/time.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 30d52c4..c9e5c14 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -433,6 +433,64 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
 }
 
 /************************************************************
+ * PLATFORM TIMER 4: TSC
+ */
+static bool_t clocksource_is_tsc = 0;
+static u64 tsc_freq;
+static unsigned long tsc_max_warp;
+static void tsc_check_reliability(void);
+
+static int __init init_tsctimer(struct platform_timesource *pts)
+{
+    bool_t tsc_reliable = 0;
+
+    tsc_check_reliability();
+
+    if ( tsc_max_warp > 0 )
+    {
+        tsc_reliable = 0;
+        printk("TSC: didn't passed warp test\n");
+    }
+    else if ( boot_cpu_has(X86_FEATURE_TSC_RELIABLE) ||
+              ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+                boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ) )
+    {
+        tsc_reliable = 1;
+    }
+    else if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
+    {
+        tsc_reliable = (max_cstate <= 2);
+
+        if (tsc_reliable)
+            printk("TSC: no deep Cstates, deemed reliable\n");
+        else
+            printk("TSC: deep Cstates possible, so not reliable\n");
+    }
+
+    pts->frequency = tsc_freq;
+    return ( clocksource_is_tsc = tsc_reliable );
+}
+
+static u64 read_tsc(void)
+{
+    return rdtsc();
+}
+
+static void resume_tsctimer(struct platform_timesource *pts)
+{
+}
+
+static struct platform_timesource __initdata plt_tsc =
+{
+    .id = "tsc",
+    .name = "TSC",
+    .read_counter = read_tsc,
+    .counter_bits = 64,
+    .init = init_tsctimer,
+    .resume = resume_tsctimer,
+};
+
+/************************************************************
  * GENERIC PLATFORM TIMER INFRASTRUCTURE
  */
 
@@ -537,7 +595,7 @@ static void resume_platform_timer(void)
 static void __init init_platform_timer(void)
 {
     static struct platform_timesource * __initdata plt_timers[] = {
-        &plt_hpet, &plt_pmtimer, &plt_pit
+        &plt_hpet, &plt_tsc, &plt_pmtimer, &plt_pit
     };
 
     struct platform_timesource *pts = NULL;
@@ -1174,7 +1232,7 @@ static void check_tsc_warp(unsigned long tsc_khz, unsigned long *max_warp)
     }
 }
 
-static unsigned long tsc_max_warp, tsc_check_count;
+static unsigned long tsc_check_count;
 static cpumask_t tsc_check_cpumask;
 
 static void tsc_check_slave(void *unused)
@@ -1458,6 +1516,7 @@ void __init early_time_init(void)
     struct cpu_time *t = &this_cpu(cpu_time);
     u64 tmp = init_pit_and_calibrate_tsc();
 
+    tsc_freq = tmp;
     set_time_scale(&t->tsc_scale, tmp);
     t->local_tsc_stamp = boot_tsc_stamp;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init()
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
  2015-12-28 16:59 ` [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info Joao Martins
  2015-12-28 16:59 ` [PATCH RFC 2/6] x86/time: implement tsc as clocksource Joao Martins
@ 2015-12-28 16:59 ` Joao Martins
  2016-01-25 20:26   ` Konrad Rzeszutek Wilk
  2015-12-28 16:59 ` [PATCH RFC 4/6] x86/time: refactor read_platform_stime() Joao Martins
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Joao Martins, Keir Fraser, Jan Beulich

And use to initialize platform time solely for clocksource=tsc,
as opposed to initializing platform overflow timer, which would
only fire in ~180 years (on 2.2 Ghz Broadwell processor).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 xen/arch/x86/time.c | 41 +++++++++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index c9e5c14..df42d31 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -518,17 +518,31 @@ static s_time_t __read_platform_stime(u64 platform_time)
     return (stime_platform_stamp + scale_delta(diff, &plt_scale));
 }
 
+static void __plt_init(void)
+{
+    u64 count;
+
+    ASSERT(spin_is_locked(&platform_timer_lock));
+    count = plt_src.read_counter();
+    plt_stamp64 += (count - plt_stamp) & plt_mask;
+    plt_stamp = count;
+}
+
+static void plt_init(void)
+{
+    spin_lock_irq(&platform_timer_lock);
+    __plt_init();
+    spin_unlock_irq(&platform_timer_lock);
+}
+
 static void plt_overflow(void *unused)
 {
     int i;
-    u64 count;
     s_time_t now, plt_now, plt_wrap;
 
     spin_lock_irq(&platform_timer_lock);
 
-    count = plt_src.read_counter();
-    plt_stamp64 += (count - plt_stamp) & plt_mask;
-    plt_stamp = count;
+    __plt_init();
 
     now = NOW();
     plt_wrap = __read_platform_stime(plt_stamp64);
@@ -635,11 +649,22 @@ static void __init init_platform_timer(void)
 
     set_time_scale(&plt_scale, pts->frequency);
 
-    plt_overflow_period = scale_delta(
-        1ull << (pts->counter_bits-1), &plt_scale);
-    init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
     plt_src = *pts;
-    plt_overflow(NULL);
+
+    if ( clocksource_is_tsc )
+    {
+        plt_init();
+    }
+    else
+    {
+        plt_overflow_period = scale_delta(
+            1ull << (pts->counter_bits-1), &plt_scale);
+        init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
+        plt_overflow(NULL);
+
+        printk("Platform timer overflow period is %lu secs\n",
+               plt_overflow_period/1000000000);
+    }
 
     platform_timer_stamp = plt_stamp64;
     stime_platform_stamp = NOW();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 4/6] x86/time: refactor read_platform_stime()
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (2 preceding siblings ...)
  2015-12-28 16:59 ` [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init() Joao Martins
@ 2015-12-28 16:59 ` Joao Martins
  2015-12-28 16:59 ` [PATCH RFC 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Joao Martins, Keir Fraser, Jan Beulich

To fetch the last read from the clocksource which was used to
calculate system_time. In the case of clocksource=tsc we will
use it to set tsc_timestamp.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 xen/arch/x86/time.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index df42d31..3f96ce6 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -509,6 +509,7 @@ static s_time_t stime_platform_stamp; /* System time at below platform time */
 static u64 platform_timer_stamp;      /* Platform time at above system time */
 static u64 plt_stamp64;          /* 64-bit platform counter stamp           */
 static u64 plt_stamp;            /* hardware-width platform counter stamp   */
+static u64 plt_stamp_counter;    /* last read since read_counter */
 static struct timer plt_overflow_timer;
 
 static s_time_t __read_platform_stime(u64 platform_time)
@@ -567,7 +568,7 @@ static void plt_overflow(void *unused)
     set_timer(&plt_overflow_timer, NOW() + plt_overflow_period);
 }
 
-static s_time_t read_platform_stime(void)
+static s_time_t read_platform_stime(u64 *stamp)
 {
     u64 count;
     s_time_t stime;
@@ -575,8 +576,11 @@ static s_time_t read_platform_stime(void)
     ASSERT(!local_irq_is_enabled());
 
     spin_lock(&platform_timer_lock);
-    count = plt_stamp64 + ((plt_src.read_counter() - plt_stamp) & plt_mask);
+    plt_stamp_counter = plt_src.read_counter();
+    count = plt_stamp64 + ((plt_stamp_counter - plt_stamp) & plt_mask);
     stime = __read_platform_stime(count);
+    if (stamp)
+        *stamp = plt_stamp_counter;
     spin_unlock(&platform_timer_lock);
 
     return stime;
@@ -694,7 +698,7 @@ void cstate_restore_tsc(void)
     if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
         return;
 
-    write_tsc(stime2tsc(read_platform_stime()));
+    write_tsc(stime2tsc(read_platform_stime(NULL)));
 }
 
 /***************************************************************************
@@ -1005,7 +1009,7 @@ int cpu_frequency_change(u64 freq)
 
     local_irq_disable();
     /* Platform time /first/, as we may be delayed by platform_timer_lock. */
-    t->stime_master_stamp = read_platform_stime();
+    t->stime_master_stamp = read_platform_stime(NULL);
     /* TSC-extrapolated time may be bogus after frequency change. */
     /*t->stime_local_stamp = get_s_time();*/
     t->stime_local_stamp = t->stime_master_stamp;
@@ -1323,7 +1327,7 @@ static void time_calibration_tsc_rendezvous(void *_r)
 
             if ( r->master_stime == 0 )
             {
-                r->master_stime = read_platform_stime();
+                r->master_stime = read_platform_stime(NULL);
                 r->master_tsc_stamp = rdtsc();
             }
             atomic_inc(&r->semaphore);
@@ -1415,7 +1419,7 @@ void init_percpu_time(void)
 
     local_irq_save(flags);
     t->local_tsc_stamp = rdtsc();
-    now = read_platform_stime();
+    now = read_platform_stime(NULL);
     local_irq_restore(flags);
 
     t->stime_master_stamp = now;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (3 preceding siblings ...)
  2015-12-28 16:59 ` [PATCH RFC 4/6] x86/time: refactor read_platform_stime() Joao Martins
@ 2015-12-28 16:59 ` Joao Martins
  2015-12-28 16:59 ` [PATCH RFC 6/6] x86/time: convert counter to tsc for non-tsc clocksource Joao Martins
  2015-12-29 14:58 ` [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Andrew Cooper
  6 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Joao Martins, Keir Fraser, Jan Beulich

When using TSC as clocksource we will solely rely on TSC for updating
vcpu time infos (pvti). Right now, each vCPU takes the tsc_timestamp at
different instants meaning every EPOCH + delta. This delta is variable
depending on the time the CPU calibrates with CPU 0 (master), and will
likely be different and variable across vCPUS. This means that each VCPU
pvti won't account to its calibration error which could lead to time
going backwards, and allowing a situation where time read on VCPU B
immediately after A being smaller. While this doesn't happen a lot, I
was able to observe (for clocksource=tsc) around 50 times in an hour
having warps of < 100 ns.

This patch proposes relying on host TSC synchronization and passthrough
of the master tsc to the guest, when running on a TSC-safe platform. On
the rendezvous function we will retrieve the platform time in ns and the
last count read by the clocksource that was used to compute system time.
master will write both master_tsc_stamp and master_stime, and the other
vCPUS (slave) will use it to update their correspondent time infos.
This way we can guarantee that on a platform with a constant and
reliable TSC, that the time read on vcpu B right after A is bigger
independently of the VCPU calibration error. Since pvclock time infos
are monotonic as seen by any vCPU we set PVCLOCK_TSC_STABLE_BIT flag,
which then enables usage of VDSO on Linux. IIUC, this is similar to how
it's implemented on KVM.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 xen/arch/x86/time.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 3f96ce6..e623891 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -910,6 +910,8 @@ static void __update_vcpu_system_time(struct vcpu *v, int force)
 
     _u.tsc_timestamp = tsc_stamp;
     _u.system_time   = t->stime_local_stamp;
+    if ( clocksource_is_tsc )
+        _u.flags    |= PVCLOCK_TSC_STABLE_BIT;
 
     if ( is_hvm_domain(d) )
         _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
@@ -1370,9 +1372,12 @@ static void time_calibration_std_rendezvous(void *_r)
 
     if ( smp_processor_id() == 0 )
     {
+        u64 last_counter;
         while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
             cpu_relax();
-        r->master_stime = read_platform_stime();
+        r->master_stime = read_platform_stime(&last_counter);
+        if ( clocksource_is_tsc )
+            r->master_tsc_stamp = last_counter;
         mb(); /* write r->master_stime /then/ signal */
         atomic_inc(&r->semaphore);
     }
@@ -1384,7 +1389,10 @@ static void time_calibration_std_rendezvous(void *_r)
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    c->local_tsc_stamp = rdtsc();
+    if ( clocksource_is_tsc )
+        c->local_tsc_stamp = r->master_tsc_stamp;
+    else
+        c->local_tsc_stamp = rdtsc();
     c->stime_local_stamp = get_s_time();
     c->stime_master_stamp = r->master_stime;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 6/6] x86/time: convert counter to tsc for non-tsc clocksource
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (4 preceding siblings ...)
  2015-12-28 16:59 ` [PATCH RFC 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
@ 2015-12-28 16:59 ` Joao Martins
  2015-12-29 14:58 ` [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Andrew Cooper
  6 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2015-12-28 16:59 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Joao Martins, Keir Fraser, Jan Beulich

This commit is an attempt at suporting PVCLOCK_TSC_STABLE_BIT when
using other clocksources. master_tsc_stamp is set by converting
master_stime (returned from read_platform_stime()) and converting it
to a TSC by scaling the ns value accordingly.  We scale the system
time if TSC is reliable, and under the assumption there is a
considerable skew between platform timer and TSC. In my machine I was
measuring between -15 and 40 us between HPET and TSC.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 xen/arch/x86/time.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index e623891..fa5f948 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -910,7 +910,7 @@ static void __update_vcpu_system_time(struct vcpu *v, int force)
 
     _u.tsc_timestamp = tsc_stamp;
     _u.system_time   = t->stime_local_stamp;
-    if ( clocksource_is_tsc )
+    if ( clocksource_is_tsc || host_tsc_is_safe() )
         _u.flags    |= PVCLOCK_TSC_STABLE_BIT;
 
     if ( is_hvm_domain(d) )
@@ -1378,6 +1378,15 @@ static void time_calibration_std_rendezvous(void *_r)
         r->master_stime = read_platform_stime(&last_counter);
         if ( clocksource_is_tsc )
             r->master_tsc_stamp = last_counter;
+        else if ( host_tsc_is_safe() )
+        {
+            struct cpu_time *t = &this_cpu(cpu_time);
+            struct time_scale plt_scale_r =
+		    scale_reciprocal(t->tsc_scale);
+
+            r->master_tsc_stamp = scale_delta(r->master_stime,
+                                              &plt_scale_r);
+        }
         mb(); /* write r->master_stime /then/ signal */
         atomic_inc(&r->semaphore);
     }
@@ -1389,7 +1398,7 @@ static void time_calibration_std_rendezvous(void *_r)
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    if ( clocksource_is_tsc )
+    if ( clocksource_is_tsc || host_tsc_is_safe() )
         c->local_tsc_stamp = r->master_tsc_stamp;
     else
         c->local_tsc_stamp = rdtsc();
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (5 preceding siblings ...)
  2015-12-28 16:59 ` [PATCH RFC 6/6] x86/time: convert counter to tsc for non-tsc clocksource Joao Martins
@ 2015-12-29 14:58 ` Andrew Cooper
  2015-12-29 17:37   ` Joao Martins
  2015-12-30  3:47   ` Haozhong Zhang
  6 siblings, 2 replies; 19+ messages in thread
From: Andrew Cooper @ 2015-12-29 14:58 UTC (permalink / raw)
  To: Joao Martins, xen-devel; +Cc: Haozhong Zhang, Keir Fraser, Jan Beulich

On 28/12/2015 16:59, Joao Martins wrote:
> Hey!
>
> I've been working on pvclock vdso support on Linux guests, and came
> across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
> required for vdso as of the latest pvclock ABI shared with KVM.

, and originally borrowed from Xen.

Please be aware that all of this was originally the Xen ABI (c/s 
1271b793, 2005) and was added to Linux (c/s 7af192c, 2008) for join use 
with KVM.  In particular, Linux c/s 424c32f1a (which introduces 'flags') 
and every subsequent change in pvclock-abi.h violates the comment at the 
top, reminding people that the hypervisors must be kept in sync.

By the looks of things, the structures are still compatible, and having 
the two in sync is in everyones best interest. The first steps here need 
to be Linux upstreaming their local modifications, and further efforts 
made to ensuring that ABI changes don't go unnoticed as far as Xen is 
concerned (entry in the maintainers file with xen-devel listed?)

> In addition, I've found some problems which aren't necessarily visible
> to kernel as the pvclock algorithm in linux keeps the highest pvti
> time read among all cpus. But as is, a process using vdso gettimeofday
> observes a significant amount of warps (i.e. time going backwards) and
> it could be due to 1) time calibration skew in xen rendezvous
> algorithm, 2) xen clocksource not in sync with TSC and 3) in
> situations when guests unaware of VCPUS moving to a different PCPU.
> The warps are seen more frequently on PV guests (potentially because
> vcpu time infos are only updated when guest is in kernel mode, and
> perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
> it is worth noting that with guests VCPUs pinned, only PV guests see
> these warps. But on HVM guests specifically: such warps only occur
> when one of guest VCPUs is pinned to CPU0.

These are all concerning findings (especially the pinning on cpu0). 
Which version of Xen have you been developing on?

Haozhong Zhang (CC'd) found and fixed several timing related bugs as 
part of his VMX TSC Scaling support series (Message root at 
<1449435529-12989-1-git-send-email-haozhong.zhang@intel.com>). I would 
be surprised if your identified bugs and his identified bugs didn't at 
least partially overlap.  (Note that most of the series has yet to be 
applied).

As to the specific options you identify, the time calibration rendezvous 
is (or should be) a nop on modern boxes with constant/invarient/etc 
TSCs.  I have a low priority TODO item to investigating conditionally 
disabling it, as it is a scalability concern on larger boxes.  Option 2 
seems more likely.

As for option 3, on a modern system it shouldn't make any difference.  
On an older system, it can presumably only be fixed by a guest 
performing its rdtsc between the two version checks, to ensure that it 
sees a consistent timestamp and scale, along with the hypervisor bumping 
version on deschedule, and against on schedule. However, this might be 
contrary to proposed plan to have one global pv wallclock.

>
> PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the
> vcpu_time_info (pvti) are monotonic as seen by any CPU,
> and supporting it could help fixing the issue mentioned above.

Surely fixing some of these bugs are prerequisites to supporting 
TSC_STABLE_BIT ?  Either way, we should see about doing both.

>   This
> series aims to propose a solution to that and it's divided as
> following:
>
> 	* Patch 1: Adds the missing flag field to vcpu_time_info.
> 	* Patch 2: Adds a new clocksource based on TSC
> 	* Patch 3, 4: Cleanups for patch 5
> 	* Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT.
> 	* Patch 6: Same as 5 before but for other platform timers
>
> I have some doubts on the correctness of Patch 6 but was the only
> solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using
> other clocksources (i.e. != tsc). The test was running time-warp-test,
> that constantly calls clock_gettime/gettimeofday on every CPU. It
> measures a delta with the previous returned value and mark a warp if
> it's negative. I measured it during periods of 1h and 6h and check how
> many warps and their values (alongside the amount of time skew).
> Measurements are included in individual patches.
>
> Note that most of the testing has been done with Linux 4.4 in which
> these warps/skew could be easily observed as vdso would use each vCPU
> pvti. Though Linux 4.5 will change this behaviour and use only the
> vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage.
>
> I've been using it for a while in the past weeks with no issues so far
> though still requires testing on bad TSC machines. But I would like to
> get folks comments/suggestions if this is the correct approach in
> implementing this flag.

If you have some test instructions, I can probably find some bad TSC 
machines amongst the selection of older hardware in the XenServer test pool.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] x86/time: implement tsc as clocksource
  2015-12-28 16:59 ` [PATCH RFC 2/6] x86/time: implement tsc as clocksource Joao Martins
@ 2015-12-29 15:11   ` Andrew Cooper
  2015-12-29 17:37     ` Joao Martins
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2015-12-29 15:11 UTC (permalink / raw)
  To: Joao Martins, xen-devel; +Cc: Keir Fraser, Jan Beulich

On 28/12/2015 16:59, Joao Martins wrote:
> Introduce support for using TSC as platform time which is the highest
> resolution time and most performant to get (~20 nsecs).  Though there
> are also several problems associated with its usage, and there isn't a
> complete (and architecturally defined) guarantee that all machines
> will provide reliable and monotonic TSC across all CPUs, on different
> sockets and different P/C states.  I believe Intel to be the only that
> can guarantee that. For this reason it's set with less priority when
> compared to HPET unless adminstrator changes "clocksource" boot option
> to "tsc". Upon initializing it, we also check for time warps and
> appropriate CPU features i.e.  TSC_RELIABLE, CONSTANT_TSC and
> NONSTOP_TSC. And in case none of these conditions are met, we fail in
> order to fallback to the next available clocksource.
>
> It is also worth nothing that with clocksource=tsc there isn't no

I presume you mean "worth noting" and either "is no" or "isn't any"?

However, looking at the sentence below, do you mean "isn't any need to 
synchronise with another clocksource" ?

> synchronization with another clocksource, and I could verify that
> great portion the time skew was eliminated and seeing much less time
> warps happening. With HPET I used to observe ~500 warps in the period
> of 1h of around 27 us, and with TSC down to 50 warps in the same
> period having each warp < 100 ns. The warps still exist though but
> are only related to cross CPU calibration (being how much it takes to
> rendezvous with master), in which a later patch in this series aims to
> solve.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   xen/arch/x86/time.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 61 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index 30d52c4..c9e5c14 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -433,6 +433,64 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>   }
>   
>   /************************************************************
> + * PLATFORM TIMER 4: TSC
> + */
> +static bool_t clocksource_is_tsc = 0;

What does clocksource_is_tsc signify? It looks to be unused in this patch.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2015-12-29 14:58 ` [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Andrew Cooper
@ 2015-12-29 17:37   ` Joao Martins
  2016-02-22 21:14     ` Joao Martins
  2015-12-30  3:47   ` Haozhong Zhang
  1 sibling, 1 reply; 19+ messages in thread
From: Joao Martins @ 2015-12-29 17:37 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: Haozhong Zhang, Keir Fraser, Jan Beulich



On 12/29/2015 02:58 PM, Andrew Cooper wrote:
> On 28/12/2015 16:59, Joao Martins wrote:
>> Hey!
>>
>> I've been working on pvclock vdso support on Linux guests, and came
>> across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
>> required for vdso as of the latest pvclock ABI shared with KVM.
> 
> , and originally borrowed from Xen.
> 
> Please be aware that all of this was originally the Xen ABI (c/s 
> 1271b793, 2005) and was added to Linux (c/s 7af192c, 2008) for join use 
> with KVM.  In particular, Linux c/s 424c32f1a (which introduces 'flags') 
> and every subsequent change in pvclock-abi.h violates the comment at the 
> top, reminding people that the hypervisors must be kept in sync.
> 
OK, I indeed was aware that this came originally from Xen. Should perhaps
include a comment similar to what you state above?

> By the looks of things, the structures are still compatible, and having 
> the two in sync is in everyones best interest. The first steps here need 
> to be Linux upstreaming their local modifications, and further efforts 
> made to ensuring that ABI changes don't go unnoticed as far as Xen is 
> concerned (entry in the maintainers file with xen-devel listed?)
Good point. Fortunately the changed part was the padding region (which was
always zeroed out on update_vcpu_system_time) so compatibility was kept.

> 
>> In addition, I've found some problems which aren't necessarily visible
>> to kernel as the pvclock algorithm in linux keeps the highest pvti
>> time read among all cpus. But as is, a process using vdso gettimeofday
>> observes a significant amount of warps (i.e. time going backwards) and
>> it could be due to 1) time calibration skew in xen rendezvous
>> algorithm, 2) xen clocksource not in sync with TSC and 3) in
>> situations when guests unaware of VCPUS moving to a different PCPU.
>> The warps are seen more frequently on PV guests (potentially because
>> vcpu time infos are only updated when guest is in kernel mode, and
>> perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
>> it is worth noting that with guests VCPUs pinned, only PV guests see
>> these warps. But on HVM guests specifically: such warps only occur
>> when one of guest VCPUs is pinned to CPU0.
> 
> These are all concerning findings (especially the pinning on cpu0). 
> Which version of Xen have you been developing on?
> 
When I started working on this it was based on xen 4.5 (linux <= 4.4), but
lately (~ 1 month) I have been working on xen-unstable.

> Haozhong Zhang (CC'd) found and fixed several timing related bugs as 
> part of his VMX TSC Scaling support series (Message root at 
> <1449435529-12989-1-git-send-email-haozhong.zhang@intel.com>). I would 
> be surprised if your identified bugs and his identified bugs didn't at 
> least partially overlap.  (Note that most of the series has yet to be 
> applied).
I am afraid this might be a different bug: I have been basing my patches on top
of his work and also testing with a non-native vtsc_khz. (as part of the testing
requested from Boris). But I think these sort of issues might not affect in
kernel since pvclock in linux takes care of ensuring monotonicity even when
pvti's aren't guaranteed to be.

> 
> As to the specific options you identify, the time calibration rendezvous 
> is (or should be) a nop on modern boxes with constant/invarient/etc 
> TSCs.  I have a low priority TODO item to investigating conditionally 
> disabling it, as it is a scalability concern on larger boxes.  Option 2 
> seems more likely.
Hm, from what I understood it's not a nop: it's just that there is a different
rendezvous function depending on whether TSC is safe or not. And I believe there
to be where it lies the problem. The tsc_timestamp written in the pvti that is
used to calculate the delta on the guest takes into account the time that the
slave CPU rendezvous with master, thus could mark the TSC at a later instant
depending on the CPU. And this would lead to a situation such as: CPU 3 writing
an earlier tsc than say CPU 2 (thus leading to a smaller delta between both as
seen by the guest, even though TSC is guaranteed to be monotonic) and guest
would see a warp when doing a read on CPU 3 first right before CPU 2 and see
time going backwards. Patch 5 also extends on that and I followed a similar
approach to KVM.

But you suggestion of basically being a nop could makes things simpler (and
perhaps cleaner?), and eliminates potential scalability issues when host has a
large number of CPUs: on a system with a safe TSC every EPOCH it would read
platform time and "request" the vCPU to update the pvti next time it's
schedule() or continue_running() is called? This way we could remove the CPUS
spinning for master to write stime and tsc_timestamp.

> 
> As for option 3, on a modern system it shouldn't make any difference.  
> On an older system, it can presumably only be fixed by a guest 
> performing its rdtsc between the two version checks, to ensure that it 
> sees a consistent timestamp and scale, along with the hypervisor bumping 
> version on deschedule, and against on schedule. However, this might be 
> contrary to proposed plan to have one global pv wallclock.
> 
Indeed, and vdso gettimeofday/clock_gettime already do that i.e. versions checks
around rdtsc/pvti reads.

>>
>> PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the
>> vcpu_time_info (pvti) are monotonic as seen by any CPU,
>> and supporting it could help fixing the issue mentioned above.
> 
> Surely fixing some of these bugs are prerequisites to supporting 
> TSC_STABLE_BIT ?  Either way, we should see about doing both.
> 
I think so too.

>>   This
>> series aims to propose a solution to that and it's divided as
>> following:
>>
>> 	* Patch 1: Adds the missing flag field to vcpu_time_info.
>> 	* Patch 2: Adds a new clocksource based on TSC
>> 	* Patch 3, 4: Cleanups for patch 5
>> 	* Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT.
>> 	* Patch 6: Same as 5 before but for other platform timers
>>
>> I have some doubts on the correctness of Patch 6 but was the only
>> solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using
>> other clocksources (i.e. != tsc). The test was running time-warp-test,
>> that constantly calls clock_gettime/gettimeofday on every CPU. It
>> measures a delta with the previous returned value and mark a warp if
>> it's negative. I measured it during periods of 1h and 6h and check how
>> many warps and their values (alongside the amount of time skew).
>> Measurements are included in individual patches.
>>
>> Note that most of the testing has been done with Linux 4.4 in which
>> these warps/skew could be easily observed as vdso would use each vCPU
>> pvti. Though Linux 4.5 will change this behaviour and use only the
>> vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage.
>>
>> I've been using it for a while in the past weeks with no issues so far
>> though still requires testing on bad TSC machines. But I would like to
>> get folks comments/suggestions if this is the correct approach in
>> implementing this flag.
> 
> If you have some test instructions, I can probably find some bad TSC 
> machines amongst the selection of older hardware in the XenServer test pool.
Thanks! I have a bad TSC machine for testing already, and these tests would be
to make sure there aren't regressions, since with a bad TSC
PVCLOCK_TSC_STABLE_BIT wouldn't be set. Most of my tests were on a Broadwell CPU
single socket.

Joao

> 
> ~Andrew
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] x86/time: implement tsc as clocksource
  2015-12-29 15:11   ` Andrew Cooper
@ 2015-12-29 17:37     ` Joao Martins
  0 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2015-12-29 17:37 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 12/29/2015 03:11 PM, Andrew Cooper wrote:
> On 28/12/2015 16:59, Joao Martins wrote:
>> Introduce support for using TSC as platform time which is the highest
>> resolution time and most performant to get (~20 nsecs).  Though there
>> are also several problems associated with its usage, and there isn't a
>> complete (and architecturally defined) guarantee that all machines
>> will provide reliable and monotonic TSC across all CPUs, on different
>> sockets and different P/C states.  I believe Intel to be the only that
>> can guarantee that. For this reason it's set with less priority when
>> compared to HPET unless adminstrator changes "clocksource" boot option
>> to "tsc". Upon initializing it, we also check for time warps and
>> appropriate CPU features i.e.  TSC_RELIABLE, CONSTANT_TSC and
>> NONSTOP_TSC. And in case none of these conditions are met, we fail in
>> order to fallback to the next available clocksource.
>>
>> It is also worth nothing that with clocksource=tsc there isn't no
> 
> I presume you mean "worth noting" and either "is no" or "isn't any"?
> 
Yeap.

> However, looking at the sentence below, do you mean "isn't any need to 
> synchronise with another clocksource" ?
That's correct.

Will fix the commit message with these on the next version.

> 
>> synchronization with another clocksource, and I could verify that
>> great portion the time skew was eliminated and seeing much less time
>> warps happening. With HPET I used to observe ~500 warps in the period
>> of 1h of around 27 us, and with TSC down to 50 warps in the same
>> period having each warp < 100 ns. The warps still exist though but
>> are only related to cross CPU calibration (being how much it takes to
>> rendezvous with master), in which a later patch in this series aims to
>> solve.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   xen/arch/x86/time.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
>> index 30d52c4..c9e5c14 100644
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -433,6 +433,64 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>   }
>>   
>>   /************************************************************
>> + * PLATFORM TIMER 4: TSC
>> + */
>> +static bool_t clocksource_is_tsc = 0;
> 
> What does clocksource_is_tsc signify? It looks to be unused in this patch.
> 
Ah right, it's unused here which is wrong. It should only be added on patch 5 of
this series, which is where it gets used.

> ~Andrew
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2015-12-29 14:58 ` [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Andrew Cooper
  2015-12-29 17:37   ` Joao Martins
@ 2015-12-30  3:47   ` Haozhong Zhang
  1 sibling, 0 replies; 19+ messages in thread
From: Haozhong Zhang @ 2015-12-30  3:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Joao Martins, Keir Fraser, Jan Beulich, xen-devel

On 12/29/15 14:58, Andrew Cooper wrote:
> On 28/12/2015 16:59, Joao Martins wrote:
> >Hey!
> >
> >I've been working on pvclock vdso support on Linux guests, and came
> >across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
> >required for vdso as of the latest pvclock ABI shared with KVM.
> 
> , and originally borrowed from Xen.
> 
> Please be aware that all of this was originally the Xen ABI (c/s 1271b793,
> 2005) and was added to Linux (c/s 7af192c, 2008) for join use with KVM.  In
> particular, Linux c/s 424c32f1a (which introduces 'flags') and every
> subsequent change in pvclock-abi.h violates the comment at the top,
> reminding people that the hypervisors must be kept in sync.
> 
> By the looks of things, the structures are still compatible, and having the
> two in sync is in everyones best interest. The first steps here need to be
> Linux upstreaming their local modifications, and further efforts made to
> ensuring that ABI changes don't go unnoticed as far as Xen is concerned
> (entry in the maintainers file with xen-devel listed?)
> 
> >In addition, I've found some problems which aren't necessarily visible
> >to kernel as the pvclock algorithm in linux keeps the highest pvti
> >time read among all cpus. But as is, a process using vdso gettimeofday
> >observes a significant amount of warps (i.e. time going backwards) and
> >it could be due to 1) time calibration skew in xen rendezvous
> >algorithm, 2) xen clocksource not in sync with TSC and 3) in
> >situations when guests unaware of VCPUS moving to a different PCPU.
> >The warps are seen more frequently on PV guests (potentially because
> >vcpu time infos are only updated when guest is in kernel mode, and
> >perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
> >it is worth noting that with guests VCPUs pinned, only PV guests see
> >these warps. But on HVM guests specifically: such warps only occur
> >when one of guest VCPUs is pinned to CPU0.
> 
> These are all concerning findings (especially the pinning on cpu0). Which
> version of Xen have you been developing on?
> 
> Haozhong Zhang (CC'd) found and fixed several timing related bugs as part of
> his VMX TSC Scaling support series (Message root at
> <1449435529-12989-1-git-send-email-haozhong.zhang@intel.com>). I would be
> surprised if your identified bugs and his identified bugs didn't at least
> partially overlap.  (Note that most of the series has yet to be applied).
> 

My VMX TSC scaling support patch series takes more concern on bugs
that can cause problems when TSC scaling is used for HVM guests. I
didn't test much for PV guests and HVM guests with pinned vcpu, so I
think Joao's patch series is fixing different timing bugs.

Haozhong

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info
  2015-12-28 16:59 ` [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info Joao Martins
@ 2016-01-25 20:11   ` Konrad Rzeszutek Wilk
  2016-01-26 10:31     ` Joao Martins
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-25 20:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, xen-devel,
	Jan Beulich

On Mon, Dec 28, 2015 at 04:59:40PM +0000, Joao Martins wrote:
> This field has two possible flags (as of latest pvclock ABI
> shared with KVM).

<sigh>

Wish they had CC-ed xen-devel instead of just doing their
change
> 
> flags: bits in this field indicate extended capabilities
> coordinated between the guest and the hypervisor.  Specifically
> on KVM, availability of specific flags has to be checked in
> 0x40000001 cpuid leaf. On Xen, we don't have that but we can
> still check some of the flags after registering the time info
> page since a force_update_vcpu_system_time is performed.
> 
> Current flags are:
> 
>  flag bit   | cpuid bit    | meaning
> -------------------------------------------------------------
>             |              | time measures taken across
>      0      |      24      | multiple cpus are guaranteed to
>             |              | be monotonic
> -------------------------------------------------------------
>             |              | guest vcpu has been paused by
>      1      |     N/A      | the host
>             |              |
> -------------------------------------------------------------
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  xen/include/public/xen.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
> index ff5547e..1223686 100644
> --- a/xen/include/public/xen.h
> +++ b/xen/include/public/xen.h
> @@ -601,10 +601,14 @@ struct vcpu_time_info {
>       */
>      uint32_t tsc_to_system_mul;
>      int8_t   tsc_shift;
> -    int8_t   pad1[3];
> +    int8_t   flags;
> +    int8_t   pad1[2];
>  }; /* 32 bytes */
>  typedef struct vcpu_time_info vcpu_time_info_t;
>  
> +#define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
> +#define PVCLOCK_GUEST_STOPPED	(1 << 1)
> +
>  struct vcpu_info {
>      /*
>       * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
> -- 
> 2.1.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init()
  2015-12-28 16:59 ` [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init() Joao Martins
@ 2016-01-25 20:26   ` Konrad Rzeszutek Wilk
  2016-01-26 10:31     ` Joao Martins
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-01-25 20:26 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, Keir Fraser, Jan Beulich, xen-devel

> +    if ( clocksource_is_tsc )
> +    {
> +        plt_init();
> +    }
> +    else
> +    {
> +        plt_overflow_period = scale_delta(
> +            1ull << (pts->counter_bits-1), &plt_scale);
> +        init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
> +        plt_overflow(NULL);
> +
> +        printk("Platform timer overflow period is %lu secs\n",
> +               plt_overflow_period/1000000000);

s/1000000000/SECONDS(1) ?

> +    }
>  
>      platform_timer_stamp = plt_stamp64;
>      stime_platform_stamp = NOW();
> -- 
> 2.1.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init()
  2016-01-25 20:26   ` Konrad Rzeszutek Wilk
@ 2016-01-26 10:31     ` Joao Martins
  0 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2016-01-26 10:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Andrew Cooper, Keir Fraser, Jan Beulich, xen-devel



On 01/25/2016 08:26 PM, Konrad Rzeszutek Wilk wrote:
>> +    if ( clocksource_is_tsc )
>> +    {
>> +        plt_init();
>> +    }
>> +    else
>> +    {
>> +        plt_overflow_period = scale_delta(
>> +            1ull << (pts->counter_bits-1), &plt_scale);
>> +        init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
>> +        plt_overflow(NULL);
>> +
>> +        printk("Platform timer overflow period is %lu secs\n",
>> +               plt_overflow_period/1000000000);
> 
> s/1000000000/SECONDS(1) ?
> 
Yeah, looks much better that way.

>> +    }
>>  
>>      platform_timer_stamp = plt_stamp64;
>>      stime_platform_stamp = NOW();
>> -- 
>> 2.1.4
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info
  2016-01-25 20:11   ` Konrad Rzeszutek Wilk
@ 2016-01-26 10:31     ` Joao Martins
  0 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2016-01-26 10:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, xen-devel,
	Jan Beulich



On 01/25/2016 08:11 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Dec 28, 2015 at 04:59:40PM +0000, Joao Martins wrote:
>> This field has two possible flags (as of latest pvclock ABI
>> shared with KVM).
> 
> <sigh>
> 
> Wish they had CC-ed xen-devel instead of just doing their
> change
Indeed, Andrew was suggesting that an entry could perhaps be added to the
maintainers file with xen-devel, to avoid situations like this.


>>
>> flags: bits in this field indicate extended capabilities
>> coordinated between the guest and the hypervisor.  Specifically
>> on KVM, availability of specific flags has to be checked in
>> 0x40000001 cpuid leaf. On Xen, we don't have that but we can
>> still check some of the flags after registering the time info
>> page since a force_update_vcpu_system_time is performed.
>>
>> Current flags are:
>>
>>  flag bit   | cpuid bit    | meaning
>> -------------------------------------------------------------
>>             |              | time measures taken across
>>      0      |      24      | multiple cpus are guaranteed to
>>             |              | be monotonic
>> -------------------------------------------------------------
>>             |              | guest vcpu has been paused by
>>      1      |     N/A      | the host
>>             |              |
>> -------------------------------------------------------------
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Thanks!

>> ---
>>  xen/include/public/xen.h | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
>> index ff5547e..1223686 100644
>> --- a/xen/include/public/xen.h
>> +++ b/xen/include/public/xen.h
>> @@ -601,10 +601,14 @@ struct vcpu_time_info {
>>       */
>>      uint32_t tsc_to_system_mul;
>>      int8_t   tsc_shift;
>> -    int8_t   pad1[3];
>> +    int8_t   flags;
>> +    int8_t   pad1[2];
>>  }; /* 32 bytes */
>>  typedef struct vcpu_time_info vcpu_time_info_t;
>>  
>> +#define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
>> +#define PVCLOCK_GUEST_STOPPED	(1 << 1)
>> +
>>  struct vcpu_info {
>>      /*
>>       * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
>> -- 
>> 2.1.4
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2015-12-29 17:37   ` Joao Martins
@ 2016-02-22 21:14     ` Joao Martins
  2016-02-23  8:09       ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Joao Martins @ 2016-02-22 21:14 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich; +Cc: Haozhong Zhang, Keir Fraser, xen-devel



On 12/29/2015 05:37 PM, Joao Martins wrote:
> 
> 
> On 12/29/2015 02:58 PM, Andrew Cooper wrote:
>> On 28/12/2015 16:59, Joao Martins wrote:
>>> Hey!
>>>
>>> I've been working on pvclock vdso support on Linux guests, and came
>>> across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
>>> required for vdso as of the latest pvclock ABI shared with KVM.
>>
>> , and originally borrowed from Xen.
>>
>> Please be aware that all of this was originally the Xen ABI (c/s 
>> 1271b793, 2005) and was added to Linux (c/s 7af192c, 2008) for join use 
>> with KVM.  In particular, Linux c/s 424c32f1a (which introduces 'flags') 
>> and every subsequent change in pvclock-abi.h violates the comment at the 
>> top, reminding people that the hypervisors must be kept in sync.
>>
> OK, I indeed was aware that this came originally from Xen. Should perhaps
> include a comment similar to what you state above?
> 
>> By the looks of things, the structures are still compatible, and having 
>> the two in sync is in everyones best interest. The first steps here need 
>> to be Linux upstreaming their local modifications, and further efforts 
>> made to ensuring that ABI changes don't go unnoticed as far as Xen is 
>> concerned (entry in the maintainers file with xen-devel listed?)
> Good point. Fortunately the changed part was the padding region (which was
> always zeroed out on update_vcpu_system_time) so compatibility was kept.
> 
>>
>>> In addition, I've found some problems which aren't necessarily visible
>>> to kernel as the pvclock algorithm in linux keeps the highest pvti
>>> time read among all cpus. But as is, a process using vdso gettimeofday
>>> observes a significant amount of warps (i.e. time going backwards) and
>>> it could be due to 1) time calibration skew in xen rendezvous
>>> algorithm, 2) xen clocksource not in sync with TSC and 3) in
>>> situations when guests unaware of VCPUS moving to a different PCPU.
>>> The warps are seen more frequently on PV guests (potentially because
>>> vcpu time infos are only updated when guest is in kernel mode, and
>>> perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
>>> it is worth noting that with guests VCPUs pinned, only PV guests see
>>> these warps. But on HVM guests specifically: such warps only occur
>>> when one of guest VCPUs is pinned to CPU0.
>>
>> These are all concerning findings (especially the pinning on cpu0). 
>> Which version of Xen have you been developing on?
>>
> When I started working on this it was based on xen 4.5 (linux <= 4.4), but
> lately (~ 1 month) I have been working on xen-unstable.
> 
>> Haozhong Zhang (CC'd) found and fixed several timing related bugs as 
>> part of his VMX TSC Scaling support series (Message root at 
>> <1449435529-12989-1-git-send-email-haozhong.zhang@intel.com>). I would 
>> be surprised if your identified bugs and his identified bugs didn't at 
>> least partially overlap.  (Note that most of the series has yet to be 
>> applied).
> I am afraid this might be a different bug: I have been basing my patches on top
> of his work and also testing with a non-native vtsc_khz. (as part of the testing
> requested from Boris). But I think these sort of issues might not affect in
> kernel since pvclock in linux takes care of ensuring monotonicity even when
> pvti's aren't guaranteed to be.
> 
>>
>> As to the specific options you identify, the time calibration rendezvous 
>> is (or should be) a nop on modern boxes with constant/invarient/etc 
>> TSCs.  I have a low priority TODO item to investigating conditionally 
>> disabling it, as it is a scalability concern on larger boxes.  Option 2 
>> seems more likely.
> Hm, from what I understood it's not a nop: it's just that there is a different
> rendezvous function depending on whether TSC is safe or not. And I believe there
> to be where it lies the problem. The tsc_timestamp written in the pvti that is
> used to calculate the delta on the guest takes into account the time that the
> slave CPU rendezvous with master, thus could mark the TSC at a later instant
> depending on the CPU. And this would lead to a situation such as: CPU 3 writing
> an earlier tsc than say CPU 2 (thus leading to a smaller delta between both as
> seen by the guest, even though TSC is guaranteed to be monotonic) and guest
> would see a warp when doing a read on CPU 3 first right before CPU 2 and see
> time going backwards. Patch 5 also extends on that and I followed a similar
> approach to KVM.
> 
> But you suggestion of basically being a nop could makes things simpler (and
> perhaps cleaner?), and eliminates potential scalability issues when host has a
> large number of CPUs: on a system with a safe TSC every EPOCH it would read
> platform time and "request" the vCPU to update the pvti next time it's
> schedule() or continue_running() is called? This way we could remove the CPUS
> spinning for master to write stime and tsc_timestamp.
> 
>>
>> As for option 3, on a modern system it shouldn't make any difference.  
>> On an older system, it can presumably only be fixed by a guest 
>> performing its rdtsc between the two version checks, to ensure that it 
>> sees a consistent timestamp and scale, along with the hypervisor bumping 
>> version on deschedule, and against on schedule. However, this might be 
>> contrary to proposed plan to have one global pv wallclock.
>>
> Indeed, and vdso gettimeofday/clock_gettime already do that i.e. versions checks
> around rdtsc/pvti reads.


>>>
>>> PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the
>>> vcpu_time_info (pvti) are monotonic as seen by any CPU,
>>> and supporting it could help fixing the issue mentioned above.
>>
>> Surely fixing some of these bugs are prerequisites to supporting 
>> TSC_STABLE_BIT ?  Either way, we should see about doing both.
>>
> I think so too.
> 
>>>   This
>>> series aims to propose a solution to that and it's divided as
>>> following:
>>>
>>> 	* Patch 1: Adds the missing flag field to vcpu_time_info.
>>> 	* Patch 2: Adds a new clocksource based on TSC
>>> 	* Patch 3, 4: Cleanups for patch 5
>>> 	* Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT.
>>> 	* Patch 6: Same as 5 before but for other platform timers
>>>
>>> I have some doubts on the correctness of Patch 6 but was the only
>>> solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using
>>> other clocksources (i.e. != tsc). The test was running time-warp-test,
>>> that constantly calls clock_gettime/gettimeofday on every CPU. It
>>> measures a delta with the previous returned value and mark a warp if
>>> it's negative. I measured it during periods of 1h and 6h and check how
>>> many warps and their values (alongside the amount of time skew).
>>> Measurements are included in individual patches.
>>>
>>> Note that most of the testing has been done with Linux 4.4 in which
>>> these warps/skew could be easily observed as vdso would use each vCPU
>>> pvti. Though Linux 4.5 will change this behaviour and use only the
>>> vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage.
>>>
>>> I've been using it for a while in the past weeks with no issues so far
>>> though still requires testing on bad TSC machines. But I would like to
>>> get folks comments/suggestions if this is the correct approach in
>>> implementing this flag.
>>
>> If you have some test instructions, I can probably find some bad TSC 
>> machines amongst the selection of older hardware in the XenServer test pool.
> Thanks! I have a bad TSC machine for testing already, and these tests would be
> to make sure there aren't regressions, since with a bad TSC
> PVCLOCK_TSC_STABLE_BIT wouldn't be set. Most of my tests were on a Broadwell CPU
> single socket.
Ping: Or would you prefer resubmitting with the comments so far? I just didn't
do so because I didn't get comments on the main parts of this series (Patch
2,5,6). Thanks, and sorry for the noise!

Joao

>>
>> ~Andrew
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2016-02-22 21:14     ` Joao Martins
@ 2016-02-23  8:09       ` Jan Beulich
  2016-02-23 12:13         ` Joao Martins
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2016-02-23  8:09 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel, Keir Fraser, Haozhong Zhang

>>> On 22.02.16 at 22:14, <joao.m.martins@oracle.com> wrote:

(Please trim you replies.)

> Ping: Or would you prefer resubmitting with the comments so far?

Either way is fine. It's on my list of thing to look at, but being RFC
it ranges priority wise behind various other things; I've been hoping
to get to it a number of times, but sadly didn't.

Jan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2016-02-23  8:09       ` Jan Beulich
@ 2016-02-23 12:13         ` Joao Martins
  0 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2016-02-23 12:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel, Keir Fraser, Haozhong Zhang



On 02/23/2016 08:09 AM, Jan Beulich wrote:
>>>> On 22.02.16 at 22:14, <joao.m.martins@oracle.com> wrote:
> 
> (Please trim you replies.)
> 
>> Ping: Or would you prefer resubmitting with the comments so far?
> 
> Either way is fine. It's on my list of thing to look at, but being RFC
> it ranges priority wise behind various other things; I've been hoping
> to get to it a number of times, but sadly didn't.
> 
OK, I guess I'll wait then for the remaining patches. Thanks for the heads up!

Joao

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-02-23 12:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-28 16:59 [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
2015-12-28 16:59 ` [PATCH RFC 1/6] public/xen.h: add flags field to vcpu_time_info Joao Martins
2016-01-25 20:11   ` Konrad Rzeszutek Wilk
2016-01-26 10:31     ` Joao Martins
2015-12-28 16:59 ` [PATCH RFC 2/6] x86/time: implement tsc as clocksource Joao Martins
2015-12-29 15:11   ` Andrew Cooper
2015-12-29 17:37     ` Joao Martins
2015-12-28 16:59 ` [PATCH RFC 3/6] x86/time: streamline platform time init on plt_init() Joao Martins
2016-01-25 20:26   ` Konrad Rzeszutek Wilk
2016-01-26 10:31     ` Joao Martins
2015-12-28 16:59 ` [PATCH RFC 4/6] x86/time: refactor read_platform_stime() Joao Martins
2015-12-28 16:59 ` [PATCH RFC 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
2015-12-28 16:59 ` [PATCH RFC 6/6] x86/time: convert counter to tsc for non-tsc clocksource Joao Martins
2015-12-29 14:58 ` [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Andrew Cooper
2015-12-29 17:37   ` Joao Martins
2016-02-22 21:14     ` Joao Martins
2016-02-23  8:09       ` Jan Beulich
2016-02-23 12:13         ` Joao Martins
2015-12-30  3:47   ` Haozhong Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.