All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
@ 2016-08-24 12:43 Joao Martins
  2016-08-24 12:43 ` [PATCH v3 1/6] x86/time: refactor init_platform_time() Joao Martins
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel

Hey!

This is v3 on the pvclock TSC stable bit series.

Complete changelog on individual patches but overall is addressing
Jan's comments, plus some other changes regarding the recent monotonicity
improvements on x86/time.

Series is divided as follows:

 R      * Patch 1: Small refactor around init_platform_time to reuse
                   initialization code when switching to TSC.
 U      * Patch 2: Adds a new clocksource based on TSC
 U,U    * Patch 3, 4: Adjustments for patch 5
 U      * Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT
 N      * Patch 6: Document new clocksource

[ R := Reviewed-by ;; U := Updated ;; N := New ]

I kept the series the same but a fundamental difference from previous
versions is that I stop clocksource=tsc from being used at all if hotplug is
possible. To facilitate the review I kept it on Patch 5 as originally posted,
whereas clocksource is added in Patch 2. But if preferred I can merge these two.

The main benefit of this series is two-fold:
 
 1. Provide the guarantee of monotonic results on xen own system time as seen
 by any cpu when using TSC as clocksource.

 2. Provide this same guarantee to guests and thus set the
 TSC_STABLE_BIT (both FreeBSD and Linux support it) which then allows guests to
 skip expensive monotonicity check between PV CPU time infos. Plus, on Linux
 specifically this also means that it could support vDSO which greatly increases
 performance (x10) for gettimeofday and clock_gettime since it would no
 longer need to do the system call to get a reliable snapshot of system time.
 For a reference on my laptop the speed of gettimeofday under xen pvclock is 
 ~2 Mops/sec (Million ops per sec) whereas with vDSO it's on the range
 of ~22 Mops/sec on <= 4.4 kernels and ~37 Mops on >= 4.5.
 
 Doing a long running time warp test for the past days on a dual-socket Haswell
 machine and I haven't yet seen time going backwards.

Thanks!
Joao

Joao Martins (6):
  x86/time: refactor init_platform_time()
  x86/time: implement tsc as clocksource
  x86/time: streamline platform time init on plt_update()
  x86/time: refactor read_platform_stime()
  x86/time: implement PVCLOCK_TSC_STABLE_BIT
  docs: update clocksource option

 docs/misc/xen-command-line.markdown |   6 +-
 xen/arch/x86/platform_hypercall.c   |   3 +-
 xen/arch/x86/time.c                 | 226 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/time.h          |   1 +
 4 files changed, 205 insertions(+), 31 deletions(-)

-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 1/6] x86/time: refactor init_platform_time()
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
@ 2016-08-24 12:43 ` Joao Martins
  2016-08-25 10:03   ` Jan Beulich
  2016-08-24 12:43 ` [PATCH v3 2/6] x86/time: implement tsc as clocksource Joao Martins
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

And accomodate platform time source initialization in
try_platform_time(). This is a preparatory patch for deferring
TSC clocksource initialization to the stage where all CPUS are
up (verify_tsc_reliability init call).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v2:
 - Remove pointless intializer and replace it with the
 platform_time init return.
---
 xen/arch/x86/time.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index b316f23..6750e46 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -576,6 +576,24 @@ static void resume_platform_timer(void)
     plt_stamp = plt_src.read_counter();
 }
 
+static s64 __init try_platform_timer(struct platform_timesource *pts)
+{
+    s64 rc = pts->init(pts);
+
+    if ( rc <= 0 )
+        return rc;
+
+    plt_mask = (u64)~0ull >> (64 - pts->counter_bits);
+
+    set_time_scale(&plt_scale, pts->frequency);
+
+    plt_overflow_period = scale_delta(
+        1ull << (pts->counter_bits - 1), &plt_scale);
+    plt_src = *pts;
+
+    return rc;
+}
+
 static u64 __init init_platform_timer(void)
 {
     static struct platform_timesource * __initdata plt_timers[] = {
@@ -593,7 +611,7 @@ static u64 __init init_platform_timer(void)
             pts = plt_timers[i];
             if ( !strcmp(opt_clocksource, pts->id) )
             {
-                rc = pts->init(pts);
+                rc = try_platform_timer(pts);
                 break;
             }
         }
@@ -609,21 +627,13 @@ static u64 __init init_platform_timer(void)
         for ( i = 0; i < ARRAY_SIZE(plt_timers); i++ )
         {
             pts = plt_timers[i];
-            if ( (rc = pts->init(pts)) > 0 )
+            if ( (rc = try_platform_timer(pts)) > 0 )
                 break;
         }
     }
 
     BUG_ON(rc <= 0);
 
-    plt_mask = (u64)~0ull >> (64 - pts->counter_bits);
-
-    set_time_scale(&plt_scale, pts->frequency);
-
-    plt_overflow_period = scale_delta(
-        1ull << (pts->counter_bits-1), &plt_scale);
-    plt_src = *pts;
-
     printk("Platform timer is %s %s\n",
            freq_string(pts->frequency), pts->name);
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
  2016-08-24 12:43 ` [PATCH v3 1/6] x86/time: refactor init_platform_time() Joao Martins
@ 2016-08-24 12:43 ` Joao Martins
  2016-08-25 10:06   ` Jan Beulich
  2016-08-24 12:43 ` [PATCH v3 3/6] x86/time: streamline platform time init on plt_update() Joao Martins
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

Recent x86/time changes improved a lot of the monotonicity in xen timekeeping,
making it much harder to observe time going backwards.  Although platform timer
can't be expected to be perfectly in sync with TSC and so get_s_time won't be
guaranteed to always return monotonically increasing values across cpus. This
is the case in some of the boxes I am testing with, observing sometimes ~100
warps (of very few nanoseconds each) after a few hours.

This patch introduces support for using TSC as platform time source
which is the highest resolution time and most performant to get (~20
nsecs). Though there are also several problems associated with its
usage, and there isn't a complete (and architecturally defined)
guarantee that all machines will provide reliable and monotonic TSC in
all cases (I believe Intel to be the only that can guarantee that?) For
this reason it's set with less priority when compared to HPET unless
adminstrator changes "clocksource" boot option to "tsc". Initializing
TSC clocksource requires all CPUs up to have the tsc reliability checks
performed. init_xen_time is called before all CPUs are up, and so we
would start with HPET at boot time, and switch later to TSC. The switch
then happens on verify_tsc_reliability initcall that is invoked when all
CPUs are up. When attempting to initialize TSC we also check for time
warps and if it has invariant TSC. And in case none of these conditions
are met, we keep the clocksource that was previously initialized on
init_xen_time. Note that while we deem reliable a CONSTANT_TSC with no
deep C-states, it might not always be the case, so we're conservative
and allow TSC to be used as platform timer only with invariant TSC.

Since b64438c7c ("x86/time: use correct (local) time stamp in
constant-TSC calibration fast path") updates to cpu time use local
stamps, which means platform timer is only used to seed the initial cpu
time. With clocksource=tsc there is no need to be in sync with another
clocksource, so we reseed the local/master stamps to be values of TSC
and update the platform time stamps accordingly. Time calibration is set
to 1sec after we switch to TSC, thus these stamps are reseeded to also
ensure monotonic returning values right after the point we switch to
TSC. This is also to avoid the possibility of having inconsistent
readings in this short period (i.e. until calibration fires).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v2:
 - Suggest "HPET switching to TSC" only as an example as otherwise it
 would be misleading on platforms not having one.
 - Change init_tsctimer to skip all the tests and assume it's called
 only on reliable TSC conditions and no warps observed. Tidy
 initialization on verify_tsc_reliability as suggested by Konrad.
 - CONSTANT_TSC and max_cstate <= 2 case removed and only allow tsc
   clocksource in invariant TSC boxes.
 - Prefer omit !=0 on init_platform_timer for tsc case.
 - Change comment on init_platform_timer.
 - Add comment on plt_tsc declaration.
 - Reinit CPU time for all online cpus instead of just CPU 0.
 - Use rdtsc_ordered() as opposed to rdtsc()
 - Remove tsc_freq variable and set plt_tsc clocksource frequency
 with the refined tsc calibration.
 - Rework a bit the commit message.

Changes since v1:
 - s/printk/printk(XENLOG_INFO
 - Remove extra space on inner brackets
 - Add missing space around brackets
 - Defer TSC initialization when all CPUs are up.

Changes since RFC:
 - Spelling fixes in the commit message.
 - Remove unused clocksource_is_tsc variable and introduce it instead
 on the patch that uses it.
 - Move plt_tsc from second to last in the available clocksources.
---
 xen/arch/x86/time.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 81 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 6750e46..b2a11a8 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -475,6 +475,30 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
 }
 
 /************************************************************
+ * PLATFORM TIMER 4: TSC
+ */
+
+static s64 __init init_tsctimer(struct platform_timesource *pts)
+{
+    return pts->frequency;
+}
+
+static u64 read_tsc(void)
+{
+    return rdtsc_ordered();
+}
+
+static struct platform_timesource __initdata plt_tsc =
+{
+    .id = "tsc",
+    .name = "TSC",
+    .read_counter = read_tsc,
+    .counter_bits = 64,
+    /* Not called by init_platform_timer as it is not on the plt_timers array */
+    .init = init_tsctimer,
+};
+
+/************************************************************
  * GENERIC PLATFORM TIMER INFRASTRUCTURE
  */
 
@@ -576,6 +600,21 @@ static void resume_platform_timer(void)
     plt_stamp = plt_src.read_counter();
 }
 
+static void __init reset_platform_timer(void)
+{
+    /* Deactivate any timers running */
+    kill_timer(&plt_overflow_timer);
+    kill_timer(&calibration_timer);
+
+    /* Reset counters and stamps */
+    spin_lock_irq(&platform_timer_lock);
+    plt_stamp = 0;
+    plt_stamp64 = 0;
+    platform_timer_stamp = 0;
+    stime_platform_stamp = 0;
+    spin_unlock_irq(&platform_timer_lock);
+}
+
 static s64 __init try_platform_timer(struct platform_timesource *pts)
 {
     s64 rc = pts->init(pts);
@@ -583,6 +622,10 @@ static s64 __init try_platform_timer(struct platform_timesource *pts)
     if ( rc <= 0 )
         return rc;
 
+    /* We have a platform timesource already so reset it */
+    if ( plt_src.counter_bits != 0 )
+        reset_platform_timer();
+
     plt_mask = (u64)~0ull >> (64 - pts->counter_bits);
 
     set_time_scale(&plt_scale, pts->frequency);
@@ -604,7 +647,9 @@ static u64 __init init_platform_timer(void)
     unsigned int i;
     s64 rc = -1;
 
-    if ( opt_clocksource[0] != '\0' )
+    /* clocksource=tsc is initialized via __initcalls (when CPUs are up). */
+    if ( (opt_clocksource[0] != '\0') &&
+         (strcmp(opt_clocksource, "tsc")) )
     {
         for ( i = 0; i < ARRAY_SIZE(plt_timers); i++ )
         {
@@ -1481,6 +1526,40 @@ static int __init verify_tsc_reliability(void)
                    __func__);
             setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
         }
+        else if ( !strcmp(opt_clocksource, "tsc") )
+        {
+            int cpu;
+
+            if ( try_platform_timer(&plt_tsc) <= 0 )
+                return 0;
+
+            /*
+             * Platform timer has changed and CPU time will only be updated
+             * after we set again the calibration timer, which means we need to
+             * seed again each local CPU time. At this stage TSC is known to be
+             * reliable i.e. monotonically increasing across all CPUs so this
+             * lets us remove the skew between platform timer and TSC, since
+             * these are now effectively the same.
+             */
+            for_each_online_cpu( cpu )
+            {
+                struct cpu_time *t = &per_cpu(cpu_time, cpu);
+
+                t->stamp.local_tsc = boot_tsc_stamp;
+                t->stamp.local_stime = 0;
+                t->stamp.local_stime = get_s_time_fixed(boot_tsc_stamp);
+                t->stamp.master_stime = t->stamp.local_stime;
+            }
+
+            platform_timer_stamp = plt_stamp64;
+            stime_platform_stamp = get_s_time_fixed(plt_stamp64);
+
+            printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
+                   freq_string(plt_src.frequency));
+
+            init_timer(&calibration_timer, time_calibration, NULL, 0);
+            set_timer(&calibration_timer, NOW() + EPOCH);
+        }
     }
 
     return 0;
@@ -1528,6 +1607,7 @@ void __init early_time_init(void)
 
     preinit_pit();
     tmp = init_platform_timer();
+    plt_tsc.frequency = tmp;
 
     set_time_scale(&t->tsc_scale, tmp);
     t->stamp.local_tsc = boot_tsc_stamp;
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
  2016-08-24 12:43 ` [PATCH v3 1/6] x86/time: refactor init_platform_time() Joao Martins
  2016-08-24 12:43 ` [PATCH v3 2/6] x86/time: implement tsc as clocksource Joao Martins
@ 2016-08-24 12:43 ` Joao Martins
  2016-08-25 10:13   ` Jan Beulich
  2016-08-24 12:43 ` [PATCH v3 4/6] x86/time: refactor read_platform_stime() Joao Martins
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

And use to initialize platform time solely for clocksource=tsc,
as opposed to initializing platform overflow timer, which would
only fire in ~180 years (on 2.2 Ghz Broadwell processor).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v2:
 - Remove pointless intializer and replace it with the
 platform_time init return.
 - s/plt_init/plt_update/g
---
 xen/arch/x86/time.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index b2a11a8..a03127e 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -526,17 +526,31 @@ static s_time_t __read_platform_stime(u64 platform_time)
     return (stime_platform_stamp + scale_delta(diff, &plt_scale));
 }
 
+static void __plt_update(void)
+{
+    u64 count;
+
+    ASSERT(spin_is_locked(&platform_timer_lock));
+    count = plt_src.read_counter();
+    plt_stamp64 += (count - plt_stamp) & plt_mask;
+    plt_stamp = count;
+}
+
+static void plt_update(void)
+{
+    spin_lock_irq(&platform_timer_lock);
+    __plt_update();
+    spin_unlock_irq(&platform_timer_lock);
+}
+
 static void plt_overflow(void *unused)
 {
     int i;
-    u64 count;
     s_time_t now, plt_now, plt_wrap;
 
     spin_lock_irq(&platform_timer_lock);
 
-    count = plt_src.read_counter();
-    plt_stamp64 += (count - plt_stamp) & plt_mask;
-    plt_stamp = count;
+    __plt_update();
 
     now = NOW();
     plt_wrap = __read_platform_stime(plt_stamp64);
@@ -630,10 +644,21 @@ static s64 __init try_platform_timer(struct platform_timesource *pts)
 
     set_time_scale(&plt_scale, pts->frequency);
 
-    plt_overflow_period = scale_delta(
-        1ull << (pts->counter_bits - 1), &plt_scale);
     plt_src = *pts;
 
+    if ( pts == &plt_tsc )
+    {
+        plt_update();
+    }
+    else
+    {
+        plt_overflow_period = scale_delta(
+            1ull << (pts->counter_bits - 1), &plt_scale);
+
+        printk(XENLOG_INFO "Platform timer overflow period is %lu msecs\n",
+               plt_overflow_period / MILLISECS(1));
+    }
+
     return rc;
 }
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/6] x86/time: refactor read_platform_stime()
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (2 preceding siblings ...)
  2016-08-24 12:43 ` [PATCH v3 3/6] x86/time: streamline platform time init on plt_update() Joao Martins
@ 2016-08-24 12:43 ` Joao Martins
  2016-08-25 10:17   ` Jan Beulich
  2016-08-24 12:43 ` [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

To fetch the last read from the clocksource which was used to
calculate system_time. In the case of clocksource=tsc we will
use it to set tsc_timestamp.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v2:
 - s/plt_stamp_counter/plt_counter/g
 - Move conditional write of stamp out of the locked region.

Changes since v1:
 - Add missing spaces between brackets
 - Move plt_stamp_counter to read_platform_stime()
 - Add signature change in time_calibration_std_rendezvous()
---
 xen/arch/x86/time.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index a03127e..57c1b47 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -575,18 +575,22 @@ static void plt_overflow(void *unused)
     set_timer(&plt_overflow_timer, NOW() + plt_overflow_period);
 }
 
-static s_time_t read_platform_stime(void)
+static s_time_t read_platform_stime(u64 *stamp)
 {
-    u64 count;
+    u64 plt_counter, count;
     s_time_t stime;
 
     ASSERT(!local_irq_is_enabled());
 
     spin_lock(&platform_timer_lock);
-    count = plt_stamp64 + ((plt_src.read_counter() - plt_stamp) & plt_mask);
+    plt_counter = plt_src.read_counter();
+    count = plt_stamp64 + ((plt_counter - plt_stamp) & plt_mask);
     stime = __read_platform_stime(count);
     spin_unlock(&platform_timer_lock);
 
+    if ( stamp )
+        *stamp = plt_counter;
+
     return stime;
 }
 
@@ -731,7 +735,7 @@ void cstate_restore_tsc(void)
     if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
         return;
 
-    write_tsc(stime2tsc(read_platform_stime()));
+    write_tsc(stime2tsc(read_platform_stime(NULL)));
 }
 
 /***************************************************************************
@@ -1050,7 +1054,7 @@ int cpu_frequency_change(u64 freq)
 
     local_irq_disable();
     /* Platform time /first/, as we may be delayed by platform_timer_lock. */
-    t->stamp.master_stime = read_platform_stime();
+    t->stamp.master_stime = read_platform_stime(NULL);
     curr_tsc = rdtsc_ordered();
     /* TSC-extrapolated time may be bogus after frequency change. */
     /*t->stamp.local_stime = get_s_time_fixed(curr_tsc);*/
@@ -1355,7 +1359,7 @@ static void time_calibration_tsc_rendezvous(void *_r)
 
             if ( r->master_stime == 0 )
             {
-                r->master_stime = read_platform_stime();
+                r->master_stime = read_platform_stime(NULL);
                 r->master_tsc_stamp = rdtsc_ordered();
             }
             atomic_inc(&r->semaphore);
@@ -1395,7 +1399,7 @@ static void time_calibration_std_rendezvous(void *_r)
     {
         while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
             cpu_relax();
-        r->master_stime = read_platform_stime();
+        r->master_stime = read_platform_stime(NULL);
         smp_wmb(); /* write r->master_stime /then/ signal */
         atomic_inc(&r->semaphore);
     }
@@ -1434,7 +1438,7 @@ void time_latch_stamps(void)
     unsigned long flags;
 
     local_irq_save(flags);
-    ap_bringup_ref.master_stime = read_platform_stime();
+    ap_bringup_ref.master_stime = read_platform_stime(NULL);
     ap_bringup_ref.local_tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
@@ -1452,7 +1456,7 @@ void init_percpu_time(void)
     t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
 
     local_irq_save(flags);
-    now = read_platform_stime();
+    now = read_platform_stime(NULL);
     tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (3 preceding siblings ...)
  2016-08-24 12:43 ` [PATCH v3 4/6] x86/time: refactor read_platform_stime() Joao Martins
@ 2016-08-24 12:43 ` Joao Martins
  2016-08-25 10:37   ` Jan Beulich
  2016-08-24 12:43 ` [PATCH v3 6/6] docs: update clocksource option Joao Martins
  2016-08-24 12:50 ` [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
  6 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

This patch proposes relying on host TSC synchronization and
passthrough to the guest, when running on a TSC-safe platform. On
time_calibration we retrieve the platform time in ns and the counter
read by the clocksource that was used to compute system time. We
introduce a new rendezous function which doesn't require
synchronization between master and slave CPUS and just reads
calibration_rendezvous struct and writes it down the stime and stamp
to the cpu_calibration struct to be used later on. We can guarantee that
on a platform with a constant and reliable TSC, that the time read on
vcpu B right after A is bigger independently of the VCPU calibration
error. Since pvclock time infos are monotonic as seen by any vCPU set
PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
IIUC, this is similar to how it's implemented on KVM.

This also changes clocksource=tsc initialization to be used only when CPU
hotplug isn't meant to be performed on the host, which will either be when max
vcpus and num_present_cpu are the same.  This is because a newly hotplugged CPU
may not satisfy the condition of having all TSCs synchronized - so when having
tsc clocksource being used we allow offlining CPUs but not onlining any ones
back. Should note that I've yet to see time going backwards in a long running
test in the past few days (in a dual socket machine), plus few other tests I
did on older platforms.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes since v2:
 - Add XEN_ prefix to pvclock flags.
 - Adapter time_calibration_rendezvous_tail to have the case of setting master
 tsc/stime and use it for the nop_rendezvous.
 - Removed hotplug CPU option that was added in v1
 - Prevent online of CPUs when clocksource is tsc.
 - Remove use_tsc_stable_bit, since clocksource is only used to seed
 values. So instead we test if hotplug is possible, and prevent clocksource=tsc
 to be used.
 - Remove 1st paragrah of commit message since the behaviour described
   no longer applies since b64438c.

Changes since v1:
 - Change approach to skip std_rendezvous by introducing a
   nop_rendezvous
 - Change commit message reflecting the change above.
 - Use TSC_STABLE_BIT only if cpu hotplug isn't possible.
 - Add command line option to override it if no cpu hotplug is
 intended.
---
 xen/arch/x86/platform_hypercall.c |  3 +-
 xen/arch/x86/time.c               | 59 +++++++++++++++++++++++++++++++++++----
 xen/include/asm-x86/time.h        |  1 +
 3 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
index 780f22d..edef334 100644
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -631,7 +631,8 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
         if ( ret )
             break;
 
-        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
+        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
+             host_tsc_is_clocksource() )
         {
             ret = -EINVAL;
             break;
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 57c1b47..81db255 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
 
 static s64 __init init_tsctimer(struct platform_timesource *pts)
 {
+    if ( nr_cpu_ids != num_present_cpus() )
+    {
+        printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
+                           "not using TSC as clocksource\n");
+        return 0;
+    }
+
     return pts->frequency;
 }
 
@@ -955,6 +962,8 @@ static void __update_vcpu_system_time(struct vcpu *v, int force)
 
     _u.tsc_timestamp = tsc_stamp;
     _u.system_time   = t->stamp.local_stime;
+    if ( host_tsc_is_clocksource() )
+        _u.flags    |= XEN_PVCLOCK_TSC_STABLE_BIT;
 
     if ( is_hvm_domain(d) )
         _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
@@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
 };
 
 static void
-time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
+time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
+                                 bool_t master_tsc)
 {
     struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
-    c->local_tsc    = rdtsc_ordered();
-    c->local_stime  = get_s_time_fixed(c->local_tsc);
+    if ( master_tsc )
+    {
+        c->local_tsc    = r->master_tsc_stamp;
+        c->local_stime  = r->master_stime;
+    }
+    else
+    {
+        c->local_tsc    = rdtsc_ordered();
+        c->local_stime  = get_s_time_fixed(c->local_tsc);
+    }
+
     c->master_stime = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
@@ -1386,7 +1405,7 @@ static void time_calibration_tsc_rendezvous(void *_r)
         }
     }
 
-    time_calibration_rendezvous_tail(r);
+    time_calibration_rendezvous_tail(r, false);
 }
 
 /* Ordinary rendezvous function which does not modify TSC values. */
@@ -1411,7 +1430,18 @@ static void time_calibration_std_rendezvous(void *_r)
         smp_rmb(); /* receive signal /then/ read r->master_stime */
     }
 
-    time_calibration_rendezvous_tail(r);
+    time_calibration_rendezvous_tail(r, false);
+}
+
+/*
+ * Rendezvous function used when clocksource is TSC and
+ * no CPU hotplug will be performed.
+ */
+static void time_calibration_nop_rendezvous(void *rv)
+{
+    const struct calibration_rendezvous *r = rv;
+
+    time_calibration_rendezvous_tail(r, true);
 }
 
 static void (*time_calibration_rendezvous_fn)(void *) =
@@ -1423,6 +1453,13 @@ static void time_calibration(void *unused)
         .semaphore = ATOMIC_INIT(0)
     };
 
+    if ( host_tsc_is_clocksource() )
+    {
+        local_irq_disable();
+        r.master_stime = read_platform_stime(&r.master_tsc_stamp);
+        local_irq_enable();
+    }
+
     cpumask_copy(&r.cpu_calibration_map, &cpu_online_map);
 
     /* @wait=1 because we must wait for all cpus before freeing @r. */
@@ -1586,6 +1623,13 @@ static int __init verify_tsc_reliability(void)
             printk(XENLOG_INFO "Switched to Platform timer %s TSC\n",
                    freq_string(plt_src.frequency));
 
+            /*
+             * We won't do CPU Hotplug and TSC clocksource is being used which
+             * means we have a reliable TSC, plus we don't sync with any other
+             * clocksource so no need for rendezvous.
+             */
+            time_calibration_rendezvous_fn = time_calibration_nop_rendezvous;
+
             init_timer(&calibration_timer, time_calibration, NULL, 0);
             set_timer(&calibration_timer, NOW() + EPOCH);
         }
@@ -1885,6 +1929,11 @@ void pv_soft_rdtsc(struct vcpu *v, struct cpu_user_regs *regs, int rdtscp)
              (d->arch.tsc_mode == TSC_MODE_PVRDTSCP) ? d->arch.incarnation : 0;
 }
 
+bool_t host_tsc_is_clocksource(void)
+{
+    return plt_src.read_counter == read_tsc;
+}
+
 int host_tsc_is_safe(void)
 {
     return boot_cpu_has(X86_FEATURE_TSC_RELIABLE);
diff --git a/xen/include/asm-x86/time.h b/xen/include/asm-x86/time.h
index 971883a..bc3debc 100644
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -69,6 +69,7 @@ void tsc_get_info(struct domain *d, uint32_t *tsc_mode, uint64_t *elapsed_nsec,
 
 void force_update_vcpu_system_time(struct vcpu *v);
 
+bool_t host_tsc_is_clocksource(void);
 int host_tsc_is_safe(void);
 void cpuid_time_leaf(uint32_t sub_idx, uint32_t *eax, uint32_t *ebx,
                      uint32_t *ecx, uint32_t *edx);
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 6/6] docs: update clocksource option
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (4 preceding siblings ...)
  2016-08-24 12:43 ` [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
@ 2016-08-24 12:43 ` Joao Martins
  2016-08-25 10:38   ` Jan Beulich
  2016-08-24 12:50 ` [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
  6 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

Add TSC as another clocksource that can be used, plus
a mention to the maxcpus parameter and that it requires
being explicitly set.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>

New in v3.
---
 docs/misc/xen-command-line.markdown | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 3a250cb..210927f 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -264,9 +264,13 @@ minimum of 32M, subject to a suitably aligned and sized contiguous
 region of memory being available.
 
 ### clocksource
-> `= pit | hpet | acpi`
+> `= pit | hpet | acpi | tsc`
 
 If set, override Xen's default choice for the platform timer.
+Having TSC as platform timer requires being explicitly set.
+This is because TSC can only be safely used if CPU hotplug isn't performed on
+the system. In some platforms, "maxcpus" parameter may require further
+adjustment to the number of online cpus.
 
 ### cmci-threshold
 > `= <integer>`
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support
  2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
                   ` (5 preceding siblings ...)
  2016-08-24 12:43 ` [PATCH v3 6/6] docs: update clocksource option Joao Martins
@ 2016-08-24 12:50 ` Joao Martins
  6 siblings, 0 replies; 35+ messages in thread
From: Joao Martins @ 2016-08-24 12:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Jan Beulich

[Missed CC-ing the maintainers in the cover letter, my apologies]

On 08/24/2016 01:43 PM, Joao Martins wrote:
> Hey!
> 
> This is v3 on the pvclock TSC stable bit series.
> 
> Complete changelog on individual patches but overall is addressing
> Jan's comments, plus some other changes regarding the recent monotonicity
> improvements on x86/time.
> 
> Series is divided as follows:
> 
>  R      * Patch 1: Small refactor around init_platform_time to reuse
>                    initialization code when switching to TSC.
>  U      * Patch 2: Adds a new clocksource based on TSC
>  U,U    * Patch 3, 4: Adjustments for patch 5
>  U      * Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT
>  N      * Patch 6: Document new clocksource
> 
> [ R := Reviewed-by ;; U := Updated ;; N := New ]
> 
> I kept the series the same but a fundamental difference from previous
> versions is that I stop clocksource=tsc from being used at all if hotplug is
> possible. To facilitate the review I kept it on Patch 5 as originally posted,
> whereas clocksource is added in Patch 2. But if preferred I can merge these two.
> 
> The main benefit of this series is two-fold:
>  
>  1. Provide the guarantee of monotonic results on xen own system time as seen
>  by any cpu when using TSC as clocksource.
> 
>  2. Provide this same guarantee to guests and thus set the
>  TSC_STABLE_BIT (both FreeBSD and Linux support it) which then allows guests to
>  skip expensive monotonicity check between PV CPU time infos. Plus, on Linux
>  specifically this also means that it could support vDSO which greatly increases
>  performance (x10) for gettimeofday and clock_gettime since it would no
>  longer need to do the system call to get a reliable snapshot of system time.
>  For a reference on my laptop the speed of gettimeofday under xen pvclock is 
>  ~2 Mops/sec (Million ops per sec) whereas with vDSO it's on the range
>  of ~22 Mops/sec on <= 4.4 kernels and ~37 Mops on >= 4.5.
>  
>  Doing a long running time warp test for the past days on a dual-socket Haswell
>  machine and I haven't yet seen time going backwards.
> 
> Thanks!
> Joao
> 
> Joao Martins (6):
>   x86/time: refactor init_platform_time()
>   x86/time: implement tsc as clocksource
>   x86/time: streamline platform time init on plt_update()
>   x86/time: refactor read_platform_stime()
>   x86/time: implement PVCLOCK_TSC_STABLE_BIT
>   docs: update clocksource option
> 
>  docs/misc/xen-command-line.markdown |   6 +-
>  xen/arch/x86/platform_hypercall.c   |   3 +-
>  xen/arch/x86/time.c                 | 226 +++++++++++++++++++++++++++++++-----
>  xen/include/asm-x86/time.h          |   1 +
>  4 files changed, 205 insertions(+), 31 deletions(-)
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/6] x86/time: refactor init_platform_time()
  2016-08-24 12:43 ` [PATCH v3 1/6] x86/time: refactor init_platform_time() Joao Martins
@ 2016-08-25 10:03   ` Jan Beulich
  2016-08-26 14:54     ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-25 10:03 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> And accomodate platform time source initialization in
> try_platform_time(). This is a preparatory patch for deferring
> TSC clocksource initialization to the stage where all CPUS are
> up (verify_tsc_reliability init call).
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

But not really useful to apply imo without the next patch.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-24 12:43 ` [PATCH v3 2/6] x86/time: implement tsc as clocksource Joao Martins
@ 2016-08-25 10:06   ` Jan Beulich
  2016-08-26 15:11     ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-25 10:06 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> This patch introduces support for using TSC as platform time source
> which is the highest resolution time and most performant to get (~20
> nsecs).

Is this indeed still the case with the recently added fences?

> Though there are also several problems associated with its
> usage, and there isn't a complete (and architecturally defined)
> guarantee that all machines will provide reliable and monotonic TSC in
> all cases (I believe Intel to be the only that can guarantee that?) For
> this reason it's set with less priority when compared to HPET unless
> adminstrator changes "clocksource" boot option to "tsc". Initializing
> TSC clocksource requires all CPUs up to have the tsc reliability checks
> performed. init_xen_time is called before all CPUs are up, and so we
> would start with HPET at boot time, and switch later to TSC. The switch
> then happens on verify_tsc_reliability initcall that is invoked when all
> CPUs are up. When attempting to initialize TSC we also check for time
> warps and if it has invariant TSC. And in case none of these conditions
> are met,

DYM "in case any of these conditions is not met"?

> we keep the clocksource that was previously initialized on
> init_xen_time. Note that while we deem reliable a CONSTANT_TSC with no
> deep C-states, it might not always be the case, so we're conservative
> and allow TSC to be used as platform timer only with invariant TSC.
> 
> Since b64438c7c ("x86/time: use correct (local) time stamp in
> constant-TSC calibration fast path") updates to cpu time use local
> stamps, which means platform timer is only used to seed the initial cpu
> time. With clocksource=tsc there is no need to be in sync with another
> clocksource, so we reseed the local/master stamps to be values of TSC
> and update the platform time stamps accordingly. Time calibration is set
> to 1sec after we switch to TSC, thus these stamps are reseeded to also
> ensure monotonic returning values right after the point we switch to
> TSC. This is also to avoid the possibility of having inconsistent
> readings in this short period (i.e. until calibration fires).
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> Changes since v2:
>  - Suggest "HPET switching to TSC" only as an example as otherwise it
>  would be misleading on platforms not having one.

What does this refer to? The description still unconditionally talks
about HPET.

>  - Change init_tsctimer to skip all the tests and assume it's called
>  only on reliable TSC conditions and no warps observed. Tidy
>  initialization on verify_tsc_reliability as suggested by Konrad.

Which means that contrary to what you say in the cover letter
there's nothing to prevent it from being used when CPU hotplug
is possible. I don't think you should skip basic sanity checks, and
place entirely on the admin the burden of knowing whether the
option is safe to use.

> +static struct platform_timesource __initdata plt_tsc =
> +{
> +    .id = "tsc",
> +    .name = "TSC",
> +    .read_counter = read_tsc,
> +    .counter_bits = 64,
> +    /* Not called by init_platform_timer as it is not on the plt_timers array */
> +    .init = init_tsctimer,

I consider this comment misleading: It took me several rounds of
looking until I understood that you don't mean to say the pointer
gets filled here only to not leave a NULL one around. I'd prefer if
you removed it.

> @@ -604,7 +647,9 @@ static u64 __init init_platform_timer(void)
>      unsigned int i;
>      s64 rc = -1;
>  
> -    if ( opt_clocksource[0] != '\0' )
> +    /* clocksource=tsc is initialized via __initcalls (when CPUs are up). */
> +    if ( (opt_clocksource[0] != '\0') &&
> +         (strcmp(opt_clocksource, "tsc")) )

Stray parentheses around a function call.

> @@ -1481,6 +1526,40 @@ static int __init verify_tsc_reliability(void)
>                     __func__);
>              setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>          }
> +        else if ( !strcmp(opt_clocksource, "tsc") )
> +        {
> +            int cpu;

unsigned int

> +            if ( try_platform_timer(&plt_tsc) <= 0 )
> +                return 0;

If you merged this into the if() above you could avoid this extra
return path, which in turn lowers the risk of something getting
added to the tail of the function and then not being taken care of
here.

> +            /*
> +             * Platform timer has changed and CPU time will only be updated
> +             * after we set again the calibration timer, which means we need to
> +             * seed again each local CPU time. At this stage TSC is known to be
> +             * reliable i.e. monotonically increasing across all CPUs so this
> +             * lets us remove the skew between platform timer and TSC, since
> +             * these are now effectively the same.
> +             */
> +            for_each_online_cpu( cpu )

Please decide whether you consider for_each_online_cpu a
control keyword like item. If you do, a blank is missing prior to the
opening parenthesis. If you don't, the blanks inside the parentheses
need to be dropped.

> +            {
> +                struct cpu_time *t = &per_cpu(cpu_time, cpu);
> +
> +                t->stamp.local_tsc = boot_tsc_stamp;
> +                t->stamp.local_stime = 0;
> +                t->stamp.local_stime = get_s_time_fixed(boot_tsc_stamp);
> +                t->stamp.master_stime = t->stamp.local_stime;
> +            }

Without any synchronization, how "good" are the chances that
this updating would race with something the other CPUs do?

> @@ -1528,6 +1607,7 @@ void __init early_time_init(void)
>  
>      preinit_pit();
>      tmp = init_platform_timer();
> +    plt_tsc.frequency = tmp;

How can this be the TSC frequency? init_platform_timer()
specifically skips the TSC. And I can see no other place where
plt_tsc.frequency gets initialized. Am I overlooking something?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-24 12:43 ` [PATCH v3 3/6] x86/time: streamline platform time init on plt_update() Joao Martins
@ 2016-08-25 10:13   ` Jan Beulich
  2016-08-26 15:12     ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-25 10:13 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> And use to initialize platform time solely for clocksource=tsc,
> as opposed to initializing platform overflow timer, which would
> only fire in ~180 years (on 2.2 Ghz Broadwell processor).

Do we really want to risk this time period going down by two orders
of magnitude? Is there anything that's really expensive in setting the
overflow timer in the far distant future?

> Changes since v2:
>  - Remove pointless intializer and replace it with the
>  platform_time init return.

Does this really apply to this patch?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -526,17 +526,31 @@ static s_time_t __read_platform_stime(u64 
> platform_time)
>      return (stime_platform_stamp + scale_delta(diff, &plt_scale));
>  }
>  
> +static void __plt_update(void)

A single leading underscore only, please.

> @@ -630,10 +644,21 @@ static s64 __init try_platform_timer(struct platform_timesource *pts)
>  
>      set_time_scale(&plt_scale, pts->frequency);
>  
> -    plt_overflow_period = scale_delta(
> -        1ull << (pts->counter_bits - 1), &plt_scale);
>      plt_src = *pts;
>  
> +    if ( pts == &plt_tsc )
> +    {
> +        plt_update();
> +    }

Unnecessary braces.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] x86/time: refactor read_platform_stime()
  2016-08-24 12:43 ` [PATCH v3 4/6] x86/time: refactor read_platform_stime() Joao Martins
@ 2016-08-25 10:17   ` Jan Beulich
  2016-08-26 15:13     ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-25 10:17 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> To fetch the last read from the clocksource which was used to
> calculate system_time.

DYM "To allow the caller to fetch ..."?

> In the case of clocksource=tsc we will
> use it to set tsc_timestamp.

I don't see how this relates to the patch here.

The change itself looks reasonable assuming its a prereq for a
subsequent patch.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-24 12:43 ` [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
@ 2016-08-25 10:37   ` Jan Beulich
  2016-08-26 15:44     ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-25 10:37 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> This patch proposes relying on host TSC synchronization and
> passthrough to the guest, when running on a TSC-safe platform. On
> time_calibration we retrieve the platform time in ns and the counter
> read by the clocksource that was used to compute system time. We
> introduce a new rendezous function which doesn't require
> synchronization between master and slave CPUS and just reads
> calibration_rendezvous struct and writes it down the stime and stamp
> to the cpu_calibration struct to be used later on. We can guarantee that
> on a platform with a constant and reliable TSC, that the time read on
> vcpu B right after A is bigger independently of the VCPU calibration
> error. Since pvclock time infos are monotonic as seen by any vCPU set
> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
> IIUC, this is similar to how it's implemented on KVM.

Without any tools side change, how is it guaranteed that a guest
which observed the stable bit won't get migrated to a host not
providing that guarantee?

> Changes since v2:
>  - Add XEN_ prefix to pvclock flags.
>  - Adapter time_calibration_rendezvous_tail to have the case of setting master
>  tsc/stime and use it for the nop_rendezvous.
>  - Removed hotplug CPU option that was added in v1
>  - Prevent online of CPUs when clocksource is tsc.

Some of the above listed changes don't seem to belong here, but
rather in one of the earlier patches.

> --- a/xen/arch/x86/platform_hypercall.c
> +++ b/xen/arch/x86/platform_hypercall.c
> @@ -631,7 +631,8 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
>          if ( ret )
>              break;
>  
> -        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
> +        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
> +             host_tsc_is_clocksource() )

I have to admit that I consider the "host" here confusing: What other
TSCs could one think of on x86? Maybe clocksource_is_tsc()?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>  
>  static s64 __init init_tsctimer(struct platform_timesource *pts)
>  {
> +    if ( nr_cpu_ids != num_present_cpus() )
> +    {
> +        printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
> +                           "not using TSC as clocksource\n");

Please don't split log messages across lines. Keep the XENLOG_INFO
on one line, and then the whole literal on the next.

Also you/we will need to take some measures to avoid this triggering
on systems where extra MADT entries exist just as dummies (perhaps
to ease firmware implementation, as was the case prior to commit
0875433389 ["hvmloader: limit CPUs exposed to guests"] with our
own ACPI tables we present to HVM guests).

> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
>  };
>  
>  static void
> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
> +                                 bool_t master_tsc)

Please use plain"bool" in new code. And then I'm not convinced this
refactoring is better than simply having the new rendezvous function
not call time_calibration_rendezvous_tail().

>  {
>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>  
> -    c->local_tsc    = rdtsc_ordered();
> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
> +    if ( master_tsc )
> +    {
> +        c->local_tsc    = r->master_tsc_stamp;

Doesn't this require the TSCs to run in perfect sync (not even off
wrt each other by a single cycle)? Is such even possible on multi
socket systems? I.e. how would multiple chips get into such a
mode in the first place, considering their TSCs can't possibly start
ticking at exactly the same (perhaps even sub-)nanosecond?

> +static void time_calibration_nop_rendezvous(void *rv)
> +{
> +    const struct calibration_rendezvous *r = rv;
> +
> +    time_calibration_rendezvous_tail(r, true);
>  }

I don't see what you need the local variable for, unless you
stopped calling time_calibration_rendezvous_tail() as suggested
above.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] docs: update clocksource option
  2016-08-24 12:43 ` [PATCH v3 6/6] docs: update clocksource option Joao Martins
@ 2016-08-25 10:38   ` Jan Beulich
  2016-08-26 15:13     ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-25 10:38 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
> Add TSC as another clocksource that can be used, plus
> a mention to the maxcpus parameter and that it requires
> being explicitly set.

This belongs in the patch introducing the option.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/6] x86/time: refactor init_platform_time()
  2016-08-25 10:03   ` Jan Beulich
@ 2016-08-26 14:54     ` Joao Martins
  0 siblings, 0 replies; 35+ messages in thread
From: Joao Martins @ 2016-08-26 14:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel



On 08/25/2016 11:03 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> And accomodate platform time source initialization in
>> try_platform_time(). This is a preparatory patch for deferring
>> TSC clocksource initialization to the stage where all CPUS are
>> up (verify_tsc_reliability init call).
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> But not really useful to apply imo without the next patch.
> 
I also agree with you. Thanks!

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-25 10:06   ` Jan Beulich
@ 2016-08-26 15:11     ` Joao Martins
  2016-08-29  9:36       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-26 15:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/25/2016 11:06 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> This patch introduces support for using TSC as platform time source
>> which is the highest resolution time and most performant to get (~20
>> nsecs).
> 
> Is this indeed still the case with the recently added fences?
Hmm, may be not as fast with the added fences, But definitely the fastest time
source. If you prefer I can remove the mention to time taken.

> 
>> Though there are also several problems associated with its
>> usage, and there isn't a complete (and architecturally defined)
>> guarantee that all machines will provide reliable and monotonic TSC in
>> all cases (I believe Intel to be the only that can guarantee that?) For
>> this reason it's set with less priority when compared to HPET unless
>> adminstrator changes "clocksource" boot option to "tsc". Initializing
>> TSC clocksource requires all CPUs up to have the tsc reliability checks
>> performed. init_xen_time is called before all CPUs are up, and so we
>> would start with HPET at boot time, and switch later to TSC. The switch
>> then happens on verify_tsc_reliability initcall that is invoked when all
>> CPUs are up. When attempting to initialize TSC we also check for time
>> warps and if it has invariant TSC. And in case none of these conditions
>> are met,
> 
> DYM "in case any of these conditions is not met"?
Yeah I will change it. My english isn't at my best these days.

> 
>> we keep the clocksource that was previously initialized on
>> init_xen_time. Note that while we deem reliable a CONSTANT_TSC with no
>> deep C-states, it might not always be the case, so we're conservative
>> and allow TSC to be used as platform timer only with invariant TSC.
>>
>> Since b64438c7c ("x86/time: use correct (local) time stamp in
>> constant-TSC calibration fast path") updates to cpu time use local
>> stamps, which means platform timer is only used to seed the initial cpu
>> time. With clocksource=tsc there is no need to be in sync with another
>> clocksource, so we reseed the local/master stamps to be values of TSC
>> and update the platform time stamps accordingly. Time calibration is set
>> to 1sec after we switch to TSC, thus these stamps are reseeded to also
>> ensure monotonic returning values right after the point we switch to
>> TSC. This is also to avoid the possibility of having inconsistent
>> readings in this short period (i.e. until calibration fires).
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Cc: Jan Beulich <jbeulich@suse.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> Changes since v2:
>>  - Suggest "HPET switching to TSC" only as an example as otherwise it
>>  would be misleading on platforms not having one.
> 
> What does this refer to? The description still unconditionally talks
> about HPET.
I remember adding "For example" in the beginning of the sentence. This was clearly a
distraction on my end (you also found another small mistake in the changelog of patch
3, despite having addressed the comment).

> 
>>  - Change init_tsctimer to skip all the tests and assume it's called
>>  only on reliable TSC conditions and no warps observed. Tidy
>>  initialization on verify_tsc_reliability as suggested by Konrad.
> 
> Which means that contrary to what you say in the cover letter
> there's nothing to prevent it from being used when CPU hotplug
> is possible.
Probably the cover letter wasn't completely clear on this. I mention that I split it
between Patch 2 and 5 (intent was for easier review), and you can see that I add
check in patch 5, plus preventing online of CPUs too.

> I don't think you should skip basic sanity checks, and
> place entirely on the admin the burden of knowing whether the
> option is safe to use.
Neither do I think it should be skipped. We mainly don't duplicate these checks. In
the end I am just running init_tsctimer, in the TSC_RELIABLE block and if no warps
are observed. So putting that in init_tsctimer would just duplicate the previously
done checks. Or would you still prefer as done in previous version i.e. all checks in
both init_tsctimer and verify_tsc_reliability?


>> +static struct platform_timesource __initdata plt_tsc =
>> +{
>> +    .id = "tsc",
>> +    .name = "TSC",
>> +    .read_counter = read_tsc,
>> +    .counter_bits = 64,
>> +    /* Not called by init_platform_timer as it is not on the plt_timers array */
>> +    .init = init_tsctimer,
> 
> I consider this comment misleading: It took me several rounds of
> looking until I understood that you don't mean to say the pointer
> gets filled here only to not leave a NULL one around. I'd prefer if
> you removed it.
> 
OK.

>> @@ -604,7 +647,9 @@ static u64 __init init_platform_timer(void)
>>      unsigned int i;
>>      s64 rc = -1;
>>  
>> -    if ( opt_clocksource[0] != '\0' )
>> +    /* clocksource=tsc is initialized via __initcalls (when CPUs are up). */
>> +    if ( (opt_clocksource[0] != '\0') &&
>> +         (strcmp(opt_clocksource, "tsc")) )
> 
> Stray parentheses around a function call.
Fixed.

> 
>> @@ -1481,6 +1526,40 @@ static int __init verify_tsc_reliability(void)
>>                     __func__);
>>              setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>>          }
>> +        else if ( !strcmp(opt_clocksource, "tsc") )
>> +        {
>> +            int cpu;
> 
> unsigned int
Fixed.

> 
>> +            if ( try_platform_timer(&plt_tsc) <= 0 )
>> +                return 0;
> 
> If you merged this into the if() above you could avoid this extra
> return path, which in turn lowers the risk of something getting
> added to the tail of the function and then not being taken care of
> here.
Good point, Fixed.

> 
>> +            /*
>> +             * Platform timer has changed and CPU time will only be updated
>> +             * after we set again the calibration timer, which means we need to
>> +             * seed again each local CPU time. At this stage TSC is known to be
>> +             * reliable i.e. monotonically increasing across all CPUs so this
>> +             * lets us remove the skew between platform timer and TSC, since
>> +             * these are now effectively the same.
>> +             */
>> +            for_each_online_cpu( cpu )
> 
> Please decide whether you consider for_each_online_cpu a
> control keyword like item. If you do, a blank is missing prior to the
> opening parenthesis. If you don't, the blanks inside the parentheses
> need to be dropped.
I will go with the control keyword like style.

>> +            {
>> +                struct cpu_time *t = &per_cpu(cpu_time, cpu);
>> +
>> +                t->stamp.local_tsc = boot_tsc_stamp;
>> +                t->stamp.local_stime = 0;
>> +                t->stamp.local_stime = get_s_time_fixed(boot_tsc_stamp);
>> +                t->stamp.master_stime = t->stamp.local_stime;
>> +            }
> 
> Without any synchronization, how "good" are the chances that
> this updating would race with something the other CPUs do?

I assumed that at this stage init calls are still being invoked that we update all
cpus time infos, but maybe it's a misconception. I can have this part in one function
and have it done on_selected_cpus() and wait until all cpu time structures get
updated. That perhaps would be better?

> 
>> @@ -1528,6 +1607,7 @@ void __init early_time_init(void)
>>  
>>      preinit_pit();
>>      tmp = init_platform_timer();
>> +    plt_tsc.frequency = tmp;
> 
> How can this be the TSC frequency? init_platform_timer()
> specifically skips the TSC. And I can see no other place where
> plt_tsc.frequency gets initialized. Am I overlooking something?
> 
That's the calibrated TSC frequency. Before I was setting pts->frequency in
init_tsctimer through a temporary variable called tsc_freq. So I thought I could just
drop the variable and set plt_tsc directly. The difference though from previous
versions is that since commit 9334029 this value is returned from platform time
source init() and calibrated against platform timer, instead of always against PIT.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-25 10:13   ` Jan Beulich
@ 2016-08-26 15:12     ` Joao Martins
  2016-08-29  9:41       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-26 15:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/25/2016 11:13 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> And use to initialize platform time solely for clocksource=tsc,
>> as opposed to initializing platform overflow timer, which would
>> only fire in ~180 years (on 2.2 Ghz Broadwell processor).
> 
> Do we really want to risk this time period going down by two orders
> of magnitude? Is there anything that's really expensive in setting the
> overflow timer in the far distant future?
It wasn't about cost but rather setting the timer in a so distant future. I could
decrease to an year time, month or day. But do you think we really need that overflow
handling for TSC?

>> Changes since v2:
>>  - Remove pointless intializer and replace it with the
>>  platform_time init return.
> 
> Does this really apply to this patch?
Oh no, The comment should have been something like:

"Remove clocksource_is_tsc in favor of comparing pts against plt_tsc"

> 
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -526,17 +526,31 @@ static s_time_t __read_platform_stime(u64 
>> platform_time)
>>      return (stime_platform_stamp + scale_delta(diff, &plt_scale));
>>  }
>>  
>> +static void __plt_update(void)
> 
> A single leading underscore only, please.
Fixed.

> 
>> @@ -630,10 +644,21 @@ static s64 __init try_platform_timer(struct platform_timesource *pts)
>>  
>>      set_time_scale(&plt_scale, pts->frequency);
>>  
>> -    plt_overflow_period = scale_delta(
>> -        1ull << (pts->counter_bits - 1), &plt_scale);
>>      plt_src = *pts;
>>  
>> +    if ( pts == &plt_tsc )
>> +    {
>> +        plt_update();
>> +    }
> 
> Unnecessary braces.
Fixed.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] x86/time: refactor read_platform_stime()
  2016-08-25 10:17   ` Jan Beulich
@ 2016-08-26 15:13     ` Joao Martins
  2016-08-29  9:42       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-26 15:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel



On 08/25/2016 11:17 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> To fetch the last read from the clocksource which was used to
>> calculate system_time.
> 
> DYM "To allow the caller to fetch ..."?
Yeap, sounds better that way.

> 
>> In the case of clocksource=tsc we will
>> use it to set tsc_timestamp.
> 
> I don't see how this relates to the patch here.
> 
> The change itself looks reasonable assuming its a prereq for a
> subsequent patch.
OK, I can remove it, and could perhaps add this instead?

"This is a prerequisite for a subsequent patch that will use this last read."

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 6/6] docs: update clocksource option
  2016-08-25 10:38   ` Jan Beulich
@ 2016-08-26 15:13     ` Joao Martins
  0 siblings, 0 replies; 35+ messages in thread
From: Joao Martins @ 2016-08-26 15:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel



On 08/25/2016 11:38 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> Add TSC as another clocksource that can be used, plus
>> a mention to the maxcpus parameter and that it requires
>> being explicitly set.
> 
> This belongs in the patch introducing the option.

OK, I merged this back into patch 2.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-25 10:37   ` Jan Beulich
@ 2016-08-26 15:44     ` Joao Martins
  2016-08-29 10:06       ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-26 15:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>> This patch proposes relying on host TSC synchronization and
>> passthrough to the guest, when running on a TSC-safe platform. On
>> time_calibration we retrieve the platform time in ns and the counter
>> read by the clocksource that was used to compute system time. We
>> introduce a new rendezous function which doesn't require
>> synchronization between master and slave CPUS and just reads
>> calibration_rendezvous struct and writes it down the stime and stamp
>> to the cpu_calibration struct to be used later on. We can guarantee that
>> on a platform with a constant and reliable TSC, that the time read on
>> vcpu B right after A is bigger independently of the VCPU calibration
>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>> IIUC, this is similar to how it's implemented on KVM.
> 
> Without any tools side change, how is it guaranteed that a guest
> which observed the stable bit won't get migrated to a host not
> providing that guarantee?
Do you want to prevent migration in such cases? The worst that can happen is that the
guest might need to fallback to a system call if this bit is 0 and would keep doing
so if the bit is 0.

>> Changes since v2:
>>  - Add XEN_ prefix to pvclock flags.
>>  - Adapter time_calibration_rendezvous_tail to have the case of setting master
>>  tsc/stime and use it for the nop_rendezvous.
>>  - Removed hotplug CPU option that was added in v1
>>  - Prevent online of CPUs when clocksource is tsc.
> 
> Some of the above listed changes don't seem to belong here, but
> rather in one of the earlier patches.
OK - as mentioned in patch two this was intended. Will merge these changes into patch 2.

> 
>> --- a/xen/arch/x86/platform_hypercall.c
>> +++ b/xen/arch/x86/platform_hypercall.c
>> @@ -631,7 +631,8 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
>>          if ( ret )
>>              break;
>>  
>> -        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
>> +        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
>> +             host_tsc_is_clocksource() )
> 
> I have to admit that I consider the "host" here confusing: What other
> TSCs could one think of on x86? Maybe clocksource_is_tsc()?
Hmm, didn't thought of any other TSC, I was just merely follow the existent naming
style in the header ("host_tsc_is_safe()"). I am also fine with clocksource_is_tsc().

> 
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>  
>>  static s64 __init init_tsctimer(struct platform_timesource *pts)
>>  {
>> +    if ( nr_cpu_ids != num_present_cpus() )
>> +    {
>> +        printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
>> +                           "not using TSC as clocksource\n");
> 
> Please don't split log messages across lines. Keep the XENLOG_INFO
> on one line, and then the whole literal on the next.
> 
Fixed.

> Also you/we will need to take some measures to avoid this triggering
> on systems where extra MADT entries exist just as dummies (perhaps
> to ease firmware implementation, as was the case prior to commit
> 0875433389 ["hvmloader: limit CPUs exposed to guests"] with our
> own ACPI tables we present to HVM guests).
OK. I think my laptop might be one of those but I need to
check (at least do need to adjust maxcpus= to use clocksource=tsc, but that's the
only case. Other boxes I don't need to)

> 
>> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
>>  };
>>  
>>  static void
>> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
>> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
>> +                                 bool_t master_tsc)
> 
> Please use plain"bool" in new code.
OK.

> And then I'm not convinced this
> refactoring is better than simply having the new rendezvous function
> not call time_calibration_rendezvous_tail().
I will move it to the nop_rendezvous.

> 
>>  {
>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>  
>> -    c->local_tsc    = rdtsc_ordered();
>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>> +    if ( master_tsc )
>> +    {
>> +        c->local_tsc    = r->master_tsc_stamp;
> 
> Doesn't this require the TSCs to run in perfect sync (not even off
> wrt each other by a single cycle)? Is such even possible on multi
> socket systems? I.e. how would multiple chips get into such a
> mode in the first place, considering their TSCs can't possibly start
> ticking at exactly the same (perhaps even sub-)nanosecond?
They do require to be in sync with multi-sockets, otherwise this wouldn't work.
Invariant TSC only refers to cores in a package, but multi-socket is up to board
vendors (no manuals mention this guarantee across sockets). That one of the reasons
TSC is such a burden :(

Looking at datasheets (on the oldest processor I was testing this) it mentions this note:

"In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
relative to BCLK, as outlined in Table 2-26.".

[0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5600-vol-1-datasheet.pdf

The BCLK looks to be the global reference clock shared across sockets IIUC used in
the PLLs in the individual packages (to generate the signal where the TSC is
extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
readings of these datasheets ). But If it was a box with TSC skewed among sockets,
wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
potentially fast enough to catch any oddities? Our docs
(https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
something along these lines on multi-socket systems. And Linux tsc code seems to
assume that Intel boxes have synchronized TSCs across sockets [1] and that the
exceptions cases should mark tsc=skewed (we also have such parameter).

[1] arch/x86/kernel/tsc.c#L1094

As reassurance I've been running tests for days long (currently in almost a week on
2-socket system) and I'll keep running to see if it catches any issues or time going
backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
would represent only a tiny fraction of what x86 has available these days.

Other than the things above I am not sure how to go about this :( Should we start
adjusting the TSCs if we find disparities or skew is observed on the long run? Or
allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
take on this? Appreciate your feedback.

>> +static void time_calibration_nop_rendezvous(void *rv)
>> +{
>> +    const struct calibration_rendezvous *r = rv;
>> +
>> +    time_calibration_rendezvous_tail(r, true);
>>  }
> 
> I don't see what you need the local variable for, unless you
> stopped calling time_calibration_rendezvous_tail() as suggested
> above.
I like the one above as you suggested.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-26 15:11     ` Joao Martins
@ 2016-08-29  9:36       ` Jan Beulich
  2016-08-30 12:08         ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-29  9:36 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 26.08.16 at 17:11, <joao.m.martins@oracle.com> wrote:
> On 08/25/2016 11:06 AM, Jan Beulich wrote:
>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>> This patch introduces support for using TSC as platform time source
>>> which is the highest resolution time and most performant to get (~20
>>> nsecs).
>> 
>> Is this indeed still the case with the recently added fences?
> Hmm, may be not as fast with the added fences, But definitely the fastest 
> time
> source. If you prefer I can remove the mention to time taken.

I'd say either you re-measure it with the added fences, or remove it
as potentially being stale/wrong.

>>>  - Change init_tsctimer to skip all the tests and assume it's called
>>>  only on reliable TSC conditions and no warps observed. Tidy
>>>  initialization on verify_tsc_reliability as suggested by Konrad.
>> 
>> Which means that contrary to what you say in the cover letter
>> there's nothing to prevent it from being used when CPU hotplug
>> is possible.
> Probably the cover letter wasn't completely clear on this. I mention that I split it
> between Patch 2 and 5 (intent was for easier review), and you can see that I add
> check in patch 5, plus preventing online of CPUs too.
> 
>> I don't think you should skip basic sanity checks, and
>> place entirely on the admin the burden of knowing whether the
>> option is safe to use.
> Neither do I think it should be skipped. We mainly don't duplicate these checks. In
> the end I am just running init_tsctimer, in the TSC_RELIABLE block and if no warps
> are observed. So putting that in init_tsctimer would just duplicate the previously
> done checks. Or would you still prefer as done in previous version i.e. all checks in
> both init_tsctimer and verify_tsc_reliability?

I'm not sure they're needed in both places; what you need to make
certain is that they're reliably gone through, and that this happens
early enough.

As to breaking this off into the later patch - please don't forget
that this (as any) series may get applied in pieces, so deferring a
crucial check to a later patch is not acceptable. If you mean things
to be split for easier review you may check whether you can find
a way to add the check in q prereq patch.

>>> +            {
>>> +                struct cpu_time *t = &per_cpu(cpu_time, cpu);
>>> +
>>> +                t->stamp.local_tsc = boot_tsc_stamp;
>>> +                t->stamp.local_stime = 0;
>>> +                t->stamp.local_stime = get_s_time_fixed(boot_tsc_stamp);
>>> +                t->stamp.master_stime = t->stamp.local_stime;
>>> +            }
>> 
>> Without any synchronization, how "good" are the chances that
>> this updating would race with something the other CPUs do?
> 
> I assumed that at this stage init calls are still being invoked that we update all
> cpus time infos, but maybe it's a misconception. I can have this part in one function
> and have it done on_selected_cpus() and wait until all cpu time structures get
> updated. That perhaps would be better?

I think so - even if the risk of thee being a race right now is rather
low, that way you'd avoid this becoming a problem if secondary
CPUs get made do something other than idling at this point in time.

>>> @@ -1528,6 +1607,7 @@ void __init early_time_init(void)
>>>  
>>>      preinit_pit();
>>>      tmp = init_platform_timer();
>>> +    plt_tsc.frequency = tmp;
>> 
>> How can this be the TSC frequency? init_platform_timer()
>> specifically skips the TSC. And I can see no other place where
>> plt_tsc.frequency gets initialized. Am I overlooking something?
>> 
> That's the calibrated TSC frequency. Before I was setting pts->frequency in
> init_tsctimer through a temporary variable called tsc_freq. So I thought I could just
> drop the variable and set plt_tsc directly. The difference though from previous
> versions is that since commit 9334029 this value is returned from platform time
> source init() and calibrated against platform timer, instead of always against PIT.

This doesn't seem to answer my primary question: Where does
plt_tsc.frequency get initialized now? try_platform_timer() and
init_platform_timer() don't - PIT and ACPI timer use static
initializers, and HEPT gets taken care of in init_hpet(), and hence so
should init_tsctimer() do (instead of just returning this apparently
never initialized value). And that's unrelated to what ->init() returns.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-26 15:12     ` Joao Martins
@ 2016-08-29  9:41       ` Jan Beulich
  2016-08-30 12:10         ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-29  9:41 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 26.08.16 at 17:12, <joao.m.martins@oracle.com> wrote:
> On 08/25/2016 11:13 AM, Jan Beulich wrote:
>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>> And use to initialize platform time solely for clocksource=tsc,
>>> as opposed to initializing platform overflow timer, which would
>>> only fire in ~180 years (on 2.2 Ghz Broadwell processor).
>> 
>> Do we really want to risk this time period going down by two orders
>> of magnitude? Is there anything that's really expensive in setting the
>> overflow timer in the far distant future?
> It wasn't about cost but rather setting the timer in a so distant future. I could
> decrease to an year time, month or day. But do you think we really need that overflow
> handling for TSC?

I think we shouldn't introduce latent issues, no matter how unlikely
we may deem it for them to actually become active ones. Just
consider how unlikely it would have seemed to someone in the
i586 days (when the TSC got introduced) that clock speeds might
ever cross the 4GHz boundary. As a consequence long ago Linux
did use a 32-bit variable for it. It got noticed early enough before
processors got really close to that boundary, but it demonstrates
the fundamental issue. And yes, processor speeds have stopped
to always grow (at least in the x86 space), but that's not a rule
set in stone.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/6] x86/time: refactor read_platform_stime()
  2016-08-26 15:13     ` Joao Martins
@ 2016-08-29  9:42       ` Jan Beulich
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2016-08-29  9:42 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 26.08.16 at 17:13, <joao.m.martins@oracle.com> wrote:
> On 08/25/2016 11:17 AM, Jan Beulich wrote:
>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>> In the case of clocksource=tsc we will
>>> use it to set tsc_timestamp.
>> 
>> I don't see how this relates to the patch here.
>> 
>> The change itself looks reasonable assuming its a prereq for a
>> subsequent patch.
> OK, I can remove it, and could perhaps add this instead?
> 
> "This is a prerequisite for a subsequent patch that will use this last 
> read."

Sounds good.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-26 15:44     ` Joao Martins
@ 2016-08-29 10:06       ` Jan Beulich
  2016-08-30 12:26         ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-29 10:06 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>> This patch proposes relying on host TSC synchronization and
>>> passthrough to the guest, when running on a TSC-safe platform. On
>>> time_calibration we retrieve the platform time in ns and the counter
>>> read by the clocksource that was used to compute system time. We
>>> introduce a new rendezous function which doesn't require
>>> synchronization between master and slave CPUS and just reads
>>> calibration_rendezvous struct and writes it down the stime and stamp
>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>> on a platform with a constant and reliable TSC, that the time read on
>>> vcpu B right after A is bigger independently of the VCPU calibration
>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>> IIUC, this is similar to how it's implemented on KVM.
>> 
>> Without any tools side change, how is it guaranteed that a guest
>> which observed the stable bit won't get migrated to a host not
>> providing that guarantee?
> Do you want to prevent migration in such cases? The worst that can happen is that the
> guest might need to fallback to a system call if this bit is 0 and would keep doing
> so if the bit is 0.

Whether migration needs preventing I'm not sure; all I was trying
to indicate is that there seem to be pieces missing wrt migration.
As to the guest falling back to a system call - are guest kernels and
(as far as as affected) applications required to cope with the flag
changing from 1 to 0 behind their back?

>>>  {
>>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>>  
>>> -    c->local_tsc    = rdtsc_ordered();
>>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>>> +    if ( master_tsc )
>>> +    {
>>> +        c->local_tsc    = r->master_tsc_stamp;
>> 
>> Doesn't this require the TSCs to run in perfect sync (not even off
>> wrt each other by a single cycle)? Is such even possible on multi
>> socket systems? I.e. how would multiple chips get into such a
>> mode in the first place, considering their TSCs can't possibly start
>> ticking at exactly the same (perhaps even sub-)nanosecond?
> They do require to be in sync with multi-sockets, otherwise this wouldn't work.

"In sync" may mean two things: Ticking at exactly the same rate, or
(more strict) holding the exact same values at all times.

> Invariant TSC only refers to cores in a package, but multi-socket is up to board
> vendors (no manuals mention this guarantee across sockets). That one of the reasons
> TSC is such a burden :(
> 
> Looking at datasheets (on the oldest processor I was testing this) it 
> mentions this note:
> 
> "In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
> in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
> rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
> relative to BCLK, as outlined in Table 2-26.".

Hmm, a dual socket system is certainly still one of the easier ones to
deal with. 600ps means 18cm difference in signaling paths, which on
larger systems (and namely ones composed of mostly independent
nodes) I could easily seem getting exceeded. That can certainly be
compensated (e.g. by deasserting RESET# at different times for
different sockets), but I'd then still question the accuracy.

> [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5 
> 600-vol-1-datasheet.pdf
> 
> The BCLK looks to be the global reference clock shared across sockets IIUC used in
> the PLLs in the individual packages (to generate the signal where the TSC is
> extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
> readings of these datasheets ). But If it was a box with TSC skewed among sockets,
> wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
> potentially fast enough to catch any oddities?

That's my main fear: The check can't possibly determine whether TSCs
are in perfect sync, it can only check an approximation. Perhaps even
worse than the multi-node consideration here is hyper-threading, as
that makes it fundamentally impossible that all threads within one core
execute the same operation at exactly the same time. Not to speak of
the various odd cache effects which I did observe while doing the
measurements for my series (e.g. the second thread speculating the
TSC reads much farther than the primary ones, presumably because
the primary ones first needed to get the I-cache populated).

> Our docs
> (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
> something along these lines on multi-socket systems. And Linux tsc code seems to
> assume that Intel boxes have synchronized TSCs across sockets [1] and that the
> exceptions cases should mark tsc=skewed (we also have such parameter).
> 
> [1] arch/x86/kernel/tsc.c#L1094

Referring back to what I've said above: Does this mean exact same
tick rate, or exact same values?

> As reassurance I've been running tests for days long (currently in almost a week on
> 2-socket system) and I'll keep running to see if it catches any issues or time going
> backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
> would represent only a tiny fraction of what x86 has available these days.

A truly interesting case would be, as mentioned, a box composed of
individual nodes. Not sure whether that 8-socket one you mention
would meet that.

> Other than the things above I am not sure how to go about this :( Should we start
> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
> take on this? Appreciate your feedback.

At least as an initial approach requiring affinities to be limited to a
single socket would seem like a good compromise, provided HT
aspects don't have a bad effect (in which case also excluding HT
may be required). I'd also be fine with command line options
allowing to further relax that, but a simple "clocksource=tsc"
should imo result in a setup which from all we can tell will work as
intended.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-29  9:36       ` Jan Beulich
@ 2016-08-30 12:08         ` Joao Martins
  2016-08-30 12:30           ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-30 12:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/29/2016 10:36 AM, Jan Beulich wrote:
>>>> On 26.08.16 at 17:11, <joao.m.martins@oracle.com> wrote:
>> On 08/25/2016 11:06 AM, Jan Beulich wrote:
>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>> This patch introduces support for using TSC as platform time source
>>>> which is the highest resolution time and most performant to get (~20
>>>> nsecs).
>>>
>>> Is this indeed still the case with the recently added fences?
>> Hmm, may be not as fast with the added fences, But definitely the fastest 
>> time
>> source. If you prefer I can remove the mention to time taken.
> 
> I'd say either you re-measure it with the added fences, or remove it
> as potentially being stale/wrong.
OK.

>>>>  - Change init_tsctimer to skip all the tests and assume it's called
>>>>  only on reliable TSC conditions and no warps observed. Tidy
>>>>  initialization on verify_tsc_reliability as suggested by Konrad.
>>>
>>> Which means that contrary to what you say in the cover letter
>>> there's nothing to prevent it from being used when CPU hotplug
>>> is possible.
>> Probably the cover letter wasn't completely clear on this. I mention that I split it
>> between Patch 2 and 5 (intent was for easier review), and you can see that I add
>> check in patch 5, plus preventing online of CPUs too.
>>
>>> I don't think you should skip basic sanity checks, and
>>> place entirely on the admin the burden of knowing whether the
>>> option is safe to use.
>> Neither do I think it should be skipped. We mainly don't duplicate these checks. In
>> the end I am just running init_tsctimer, in the TSC_RELIABLE block and if no warps
>> are observed. So putting that in init_tsctimer would just duplicate the previously
>> done checks. Or would you still prefer as done in previous version i.e. all checks in
>> both init_tsctimer and verify_tsc_reliability?
> 
> I'm not sure they're needed in both places; what you need to make
> certain is that they're reliably gone through, and that this happens
> early enough.
They are reliably gone through and we get to avoid duplication of checks. Unless
there's a preference to re-add these checks in init_tsctimer, I'll keep these as is.
verify_tsc_reliability(...) needs to perform this checks and init_tsctimer is only
called these reliable TSC conditions.

> As to breaking this off into the later patch - please don't forget
> that this (as any) series may get applied in pieces, so deferring a
> crucial check to a later patch is not acceptable. If you mean things
> to be split for easier review you may check whether you can find
> a way to add the check in q prereq patch.
OK, note taken. I'll get this check moved from patch 5 to here, as it's the place
where it should be.


>>>> +            {
>>>> +                struct cpu_time *t = &per_cpu(cpu_time, cpu);
>>>> +
>>>> +                t->stamp.local_tsc = boot_tsc_stamp;
>>>> +                t->stamp.local_stime = 0;
>>>> +                t->stamp.local_stime = get_s_time_fixed(boot_tsc_stamp);
>>>> +                t->stamp.master_stime = t->stamp.local_stime;
>>>> +            }
>>>
>>> Without any synchronization, how "good" are the chances that
>>> this updating would race with something the other CPUs do?
>>
>> I assumed that at this stage init calls are still being invoked that we update all
>> cpus time infos, but maybe it's a misconception. I can have this part in one function
>> and have it done on_selected_cpus() and wait until all cpu time structures get
>> updated. That perhaps would be better?
> 
> I think so - even if the risk of thee being a race right now is rather
> low, that way you'd avoid this becoming a problem if secondary
> CPUs get made do something other than idling at this point in time.
Indeed, I'll do it that way then.

>>>> @@ -1528,6 +1607,7 @@ void __init early_time_init(void)
>>>>  
>>>>      preinit_pit();
>>>>      tmp = init_platform_timer();
>>>> +    plt_tsc.frequency = tmp;
>>>
>>> How can this be the TSC frequency? init_platform_timer()
>>> specifically skips the TSC. And I can see no other place where
>>> plt_tsc.frequency gets initialized. Am I overlooking something?
>>>
>> That's the calibrated TSC frequency. Before I was setting pts->frequency in
>> init_tsctimer through a temporary variable called tsc_freq. So I thought I could just
>> drop the variable and set plt_tsc directly. The difference though from previous
>> versions is that since commit 9334029 this value is returned from platform time
>> source init() and calibrated against platform timer, instead of always against PIT.
> 
> This doesn't seem to answer my primary question: Where does
> plt_tsc.frequency get initialized now? try_platform_timer() and
> init_platform_timer() don't - PIT and ACPI timer use static
> initializers, and HEPT gets taken care of in init_hpet(), and hence so
> should init_tsctimer() do (instead of just returning this apparently
> never initialized value). And that's unrelated to what ->init() returns.

plt_tsc.frequency is certainly initialized in early_time_init(). And then on
try_platform_timer we have plt_src = *pts (pts would be a pointer to plt_tsc when
called from verify_tsc_reliability()).

IOW, effectively I changed from this:

#v2

static u64 tsc_freq;

static s64 __init init_tsctimer(struct platform_timesource *pts)
{
   ...
   pts->frequency = tsc_freq;
   return 1;
}

...

void __init early_time_init(void)
{
   u64 tmp = init_pit_and_calibrate_tsc();

   tsc_freq = tmp;
}

*To:*

#v3

static s64 __init init_tsctimer(struct platform_timesource *pts)
{
    return pts->frequency;
}


void __init early_time_init(void)
{
    ...
    tmp = init_platform_timer();
    plt_tsc.frequency = tmp;
}

Does this answer your question? Note that my purpose with the change, was to remove
the tsc_freq temporary variable. If it makes things less clear (as in doing things
differently from other platform timers) I can go back to v2 in this aspect.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-29  9:41       ` Jan Beulich
@ 2016-08-30 12:10         ` Joao Martins
  2016-08-30 12:31           ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-30 12:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/29/2016 10:41 AM, Jan Beulich wrote:
>>>> On 26.08.16 at 17:12, <joao.m.martins@oracle.com> wrote:
>> On 08/25/2016 11:13 AM, Jan Beulich wrote:
>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>> And use to initialize platform time solely for clocksource=tsc,
>>>> as opposed to initializing platform overflow timer, which would
>>>> only fire in ~180 years (on 2.2 Ghz Broadwell processor).
>>>
>>> Do we really want to risk this time period going down by two orders
>>> of magnitude? Is there anything that's really expensive in setting the
>>> overflow timer in the far distant future?
>> It wasn't about cost but rather setting the timer in a so distant future. I could
>> decrease to an year time, month or day. But do you think we really need that overflow
>> handling for TSC?
> 
> I think we shouldn't introduce latent issues, no matter how unlikely
> we may deem it for them to actually become active ones. Just
> consider how unlikely it would have seemed to someone in the
> i586 days (when the TSC got introduced) that clock speeds might
> ever cross the 4GHz boundary. As a consequence long ago Linux
> did use a 32-bit variable for it. It got noticed early enough before
> processors got really close to that boundary, but it demonstrates
> the fundamental issue. And yes, processor speeds have stopped
> to always grow (at least in the x86 space), but that's not a rule
> set in stone.

Thanks for the insight.

What would be a preferred period - probably capping it to a day/hour/minutes? Or
keeping the current calculated value? Maximum right now in current platform timers
are few minutes depending on the HPET frequency.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-29 10:06       ` Jan Beulich
@ 2016-08-30 12:26         ` Joao Martins
  2016-08-30 12:45           ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-08-30 12:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>> This patch proposes relying on host TSC synchronization and
>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>> time_calibration we retrieve the platform time in ns and the counter
>>>> read by the clocksource that was used to compute system time. We
>>>> introduce a new rendezous function which doesn't require
>>>> synchronization between master and slave CPUS and just reads
>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>> on a platform with a constant and reliable TSC, that the time read on
>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>> IIUC, this is similar to how it's implemented on KVM.
>>>
>>> Without any tools side change, how is it guaranteed that a guest
>>> which observed the stable bit won't get migrated to a host not
>>> providing that guarantee?
>> Do you want to prevent migration in such cases? The worst that can happen is that the
>> guest might need to fallback to a system call if this bit is 0 and would keep doing
>> so if the bit is 0.
> 
> Whether migration needs preventing I'm not sure; all I was trying
> to indicate is that there seem to be pieces missing wrt migration.
> As to the guest falling back to a system call - are guest kernels and
> (as far as as affected) applications required to cope with the flag
> changing from 1 to 0 behind their back?
It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
applications) always check this bit on every read to decide whether to fallback to a
system call. And same for pvclock code in the guest kernel on every read in both
Linux/FreeBSD to see whether to skip or not the monotonicity checks.

>>>>  {
>>>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>>>  
>>>> -    c->local_tsc    = rdtsc_ordered();
>>>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>>>> +    if ( master_tsc )
>>>> +    {
>>>> +        c->local_tsc    = r->master_tsc_stamp;
>>>
>>> Doesn't this require the TSCs to run in perfect sync (not even off
>>> wrt each other by a single cycle)? Is such even possible on multi
>>> socket systems? I.e. how would multiple chips get into such a
>>> mode in the first place, considering their TSCs can't possibly start
>>> ticking at exactly the same (perhaps even sub-)nanosecond?
>> They do require to be in sync with multi-sockets, otherwise this wouldn't work.
> 
> "In sync" may mean two things: Ticking at exactly the same rate, or
> (more strict) holding the exact same values at all times.
I meant the more strict one.

> 
>> Invariant TSC only refers to cores in a package, but multi-socket is up to board
>> vendors (no manuals mention this guarantee across sockets). That one of the reasons
>> TSC is such a burden :(
>>
>> Looking at datasheets (on the oldest processor I was testing this) it 
>> mentions this note:
>>
>> "In order In order to ensure Timestamp Counter (TSC) synchronization across sockets
>> in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK
>> rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
>> relative to BCLK, as outlined in Table 2-26.".
> 
> Hmm, a dual socket system is certainly still one of the easier ones to
> deal with. 600ps means 18cm difference in signaling paths, which on
> larger systems (and namely ones composed of mostly independent
> nodes) I could easily seem getting exceeded. That can certainly be
> compensated (e.g. by deasserting RESET# at different times for
> different sockets), but I'd then still question the accuracy.
Interesting, good point. FWIW the linux code doesn't deem multi-node systems as TSC
invariant/reliable.

> 
>> [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
>> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5 
>> 600-vol-1-datasheet.pdf
>>
>> The BCLK looks to be the global reference clock shared across sockets IIUC used in
>> the PLLs in the individual packages (to generate the signal where the TSC is
>> extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong
>> readings of these datasheets ). But If it was a box with TSC skewed among sockets,
>> wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't
>> potentially fast enough to catch any oddities?
> 
> That's my main fear: The check can't possibly determine whether TSCs
> are in perfect sync, it can only check an approximation. 
Indeed, and as we add more CPUs, the tsc reliability check will significantly slow
down, therefore minimizing this approximation, unless there's a significant skew.

> Perhaps even
> worse than the multi-node consideration here is hyper-threading, as
> that makes it fundamentally impossible that all threads within one core
> execute the same operation at exactly the same time. Not to speak of
> the various odd cache effects which I did observe while doing the
> measurements for my series (e.g. the second thread speculating the
> TSC reads much farther than the primary ones, presumably because
> the primary ones first needed to get the I-cache populated).
Hmmm, not sure how we could cope with TSC HT issues. In this patch, we propagate TSC
reads from platform timer on CPU 0 into the other CPUs, it would probably is
non-visible as there aren't TSC reads being done on multiple threads approximately at
the same time?

>> Our docs
>> (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
>> something along these lines on multi-socket systems. And Linux tsc code seems to
>> assume that Intel boxes have synchronized TSCs across sockets [1] and that the
>> exceptions cases should mark tsc=skewed (we also have such parameter).
>>
>> [1] arch/x86/kernel/tsc.c#L1094
> 
> Referring back to what I've said above: Does this mean exact same
> tick rate, or exact same values?
Here I also meant the invariant condition i.e exact same values.

> 
>> As reassurance I've been running tests for days long (currently in almost a week on
>> 2-socket system) and I'll keep running to see if it catches any issues or time going
>> backwards. Could also run in the biggest boxes we have with 8 sockets. But still it
>> would represent only a tiny fraction of what x86 has available these days.
> 
> A truly interesting case would be, as mentioned, a box composed of
> individual nodes. Not sure whether that 8-socket one you mention
> would meet that.
It's not a multi-node machine - but within single-node machines it's potentially the
worst case scenario.

>> Other than the things above I am not sure how to go about this :( Should we start
>> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
>> take on this? Appreciate your feedback.
> 
> At least as an initial approach requiring affinities to be limited to a
> single socket would seem like a good compromise, provided HT
> aspects don't have a bad effect (in which case also excluding HT
> may be required). I'd also be fine with command line options
> allowing to further relax that, but a simple "clocksource=tsc"
> should imo result in a setup which from all we can tell will work as
> intended.
Sounds reasonable, so unless command line options are specified we disallow TSC to be
clocksource on multi-socket systems. WRT to command line options, how about extending
"tsc" parameter to accept another possible value such as "global" or "socketsafe"?
Current values are "unstable" and "skewed".

Thanks so far for all the comments so far!
Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-30 12:08         ` Joao Martins
@ 2016-08-30 12:30           ` Jan Beulich
  2016-08-30 13:59             ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-30 12:30 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 30.08.16 at 14:08, <joao.m.martins@oracle.com> wrote:
> On 08/29/2016 10:36 AM, Jan Beulich wrote:
>>>>> On 26.08.16 at 17:11, <joao.m.martins@oracle.com> wrote:
>>> On 08/25/2016 11:06 AM, Jan Beulich wrote:
>>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>>>  - Change init_tsctimer to skip all the tests and assume it's called
>>>>>  only on reliable TSC conditions and no warps observed. Tidy
>>>>>  initialization on verify_tsc_reliability as suggested by Konrad.
>>>>
>>>> Which means that contrary to what you say in the cover letter
>>>> there's nothing to prevent it from being used when CPU hotplug
>>>> is possible.
>>> Probably the cover letter wasn't completely clear on this. I mention that I split it
>>> between Patch 2 and 5 (intent was for easier review), and you can see that I add
>>> check in patch 5, plus preventing online of CPUs too.
>>>
>>>> I don't think you should skip basic sanity checks, and
>>>> place entirely on the admin the burden of knowing whether the
>>>> option is safe to use.
>>> Neither do I think it should be skipped. We mainly don't duplicate these checks. In
>>> the end I am just running init_tsctimer, in the TSC_RELIABLE block and if no warps
>>> are observed. So putting that in init_tsctimer would just duplicate the previously
>>> done checks. Or would you still prefer as done in previous version i.e. all checks in
>>> both init_tsctimer and verify_tsc_reliability?
>> 
>> I'm not sure they're needed in both places; what you need to make
>> certain is that they're reliably gone through, and that this happens
>> early enough.
> They are reliably gone through and we get to avoid duplication of checks. Unless
> there's a preference to re-add these checks in init_tsctimer, I'll keep these as is.
> verify_tsc_reliability(...) needs to perform this checks and init_tsctimer is only
> called these reliable TSC conditions.

But please make sure there's a comment in (or ahead of)
init_tsctimer() pointing out where the apparently missing checks
are. This will help both review and future readers.

>>>>> @@ -1528,6 +1607,7 @@ void __init early_time_init(void)
>>>>>  
>>>>>      preinit_pit();
>>>>>      tmp = init_platform_timer();
>>>>> +    plt_tsc.frequency = tmp;
>>>>
>>>> How can this be the TSC frequency? init_platform_timer()
>>>> specifically skips the TSC. And I can see no other place where
>>>> plt_tsc.frequency gets initialized. Am I overlooking something?
>>>>
>>> That's the calibrated TSC frequency. Before I was setting pts->frequency in
>>> init_tsctimer through a temporary variable called tsc_freq. So I thought I could just
>>> drop the variable and set plt_tsc directly. The difference though from previous
>>> versions is that since commit 9334029 this value is returned from platform time
>>> source init() and calibrated against platform timer, instead of always against PIT.
>> 
>> This doesn't seem to answer my primary question: Where does
>> plt_tsc.frequency get initialized now? try_platform_timer() and
>> init_platform_timer() don't - PIT and ACPI timer use static
>> initializers, and HEPT gets taken care of in init_hpet(), and hence so
>> should init_tsctimer() do (instead of just returning this apparently
>> never initialized value). And that's unrelated to what ->init() returns.
> 
> plt_tsc.frequency is certainly initialized in early_time_init(). And then on
> try_platform_timer we have plt_src = *pts (pts would be a pointer to plt_tsc 
> when
> called from verify_tsc_reliability()).
> 
> IOW, effectively I changed from this:
> 
> #v2
> 
> static u64 tsc_freq;
> 
> static s64 __init init_tsctimer(struct platform_timesource *pts)
> {
>    ...
>    pts->frequency = tsc_freq;
>    return 1;
> }
> 
> ...
> 
> void __init early_time_init(void)
> {
>    u64 tmp = init_pit_and_calibrate_tsc();
> 
>    tsc_freq = tmp;
> }
> 
> *To:*
> 
> #v3
> 
> static s64 __init init_tsctimer(struct platform_timesource *pts)
> {
>     return pts->frequency;
> }
> 
> 
> void __init early_time_init(void)
> {
>     ...
>     tmp = init_platform_timer();
>     plt_tsc.frequency = tmp;
> }
> 
> Does this answer your question? Note that my purpose with the change, was to remove
> the tsc_freq temporary variable. If it makes things less clear (as in doing things
> differently from other platform timers) I can go back to v2 in this aspect.

Ah, I see now how I got confused. This once again depends on TSC
to possible become the platform timer only much later than when
early_time_init() runs.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-30 12:10         ` Joao Martins
@ 2016-08-30 12:31           ` Jan Beulich
  2016-09-09 16:32             ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-30 12:31 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 30.08.16 at 14:10, <joao.m.martins@oracle.com> wrote:
> What would be a preferred period - probably capping it to a day/hour/minutes? Or
> keeping the current calculated value? Maximum right now in current platform timers
> are few minutes depending on the HPET frequency.

Ideally I would see you just use the calculated value, unless that
causes problems elsewhere. Depending on such possible problems
would be what cap to alternatively use.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-30 12:26         ` Joao Martins
@ 2016-08-30 12:45           ` Jan Beulich
  2016-08-30 14:14             ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-08-30 12:45 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 30.08.16 at 14:26, <joao.m.martins@oracle.com> wrote:
> On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
>>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>>> This patch proposes relying on host TSC synchronization and
>>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>>> time_calibration we retrieve the platform time in ns and the counter
>>>>> read by the clocksource that was used to compute system time. We
>>>>> introduce a new rendezous function which doesn't require
>>>>> synchronization between master and slave CPUS and just reads
>>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>>> on a platform with a constant and reliable TSC, that the time read on
>>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>>> IIUC, this is similar to how it's implemented on KVM.
>>>>
>>>> Without any tools side change, how is it guaranteed that a guest
>>>> which observed the stable bit won't get migrated to a host not
>>>> providing that guarantee?
>>> Do you want to prevent migration in such cases? The worst that can happen is that the
>>> guest might need to fallback to a system call if this bit is 0 and would keep doing
>>> so if the bit is 0.
>> 
>> Whether migration needs preventing I'm not sure; all I was trying
>> to indicate is that there seem to be pieces missing wrt migration.
>> As to the guest falling back to a system call - are guest kernels and
>> (as far as as affected) applications required to cope with the flag
>> changing from 1 to 0 behind their back?
> It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
> applications) always check this bit on every read to decide whether to fallback to a
> system call. And same for pvclock code in the guest kernel on every read in both
> Linux/FreeBSD to see whether to skip or not the monotonicity checks.

Okay, but please make sure this is called out at least in the commit
message, if not in a code comment.

>> Perhaps even
>> worse than the multi-node consideration here is hyper-threading, as
>> that makes it fundamentally impossible that all threads within one core
>> execute the same operation at exactly the same time. Not to speak of
>> the various odd cache effects which I did observe while doing the
>> measurements for my series (e.g. the second thread speculating the
>> TSC reads much farther than the primary ones, presumably because
>> the primary ones first needed to get the I-cache populated).
> Hmmm, not sure how we could cope with TSC HT issues. In this patch, we propagate TSC
> reads from platform timer on CPU 0 into the other CPUs, it would probably is
> non-visible as there aren't TSC reads being done on multiple threads approximately at
> the same time?

Right - much depends on parameters the values of which we don't
even have an idea of. Like how frequently get hyperthreads get
switched within a core.

>>> Other than the things above I am not sure how to go about this :( Should we start
>>> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
>>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
>>> take on this? Appreciate your feedback.
>> 
>> At least as an initial approach requiring affinities to be limited to a
>> single socket would seem like a good compromise, provided HT
>> aspects don't have a bad effect (in which case also excluding HT
>> may be required). I'd also be fine with command line options
>> allowing to further relax that, but a simple "clocksource=tsc"
>> should imo result in a setup which from all we can tell will work as
>> intended.
> Sounds reasonable, so unless command line options are specified we disallow TSC to be
> clocksource on multi-socket systems. WRT to command line options, how about extending
> "tsc" parameter to accept another possible value such as "global" or "socketsafe"?
> Current values are "unstable" and "skewed".

What about "stable, "stable:socket" (and then perhaps also
"stable:node")?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/6] x86/time: implement tsc as clocksource
  2016-08-30 12:30           ` Jan Beulich
@ 2016-08-30 13:59             ` Joao Martins
  0 siblings, 0 replies; 35+ messages in thread
From: Joao Martins @ 2016-08-30 13:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel



On 08/30/2016 01:30 PM, Jan Beulich wrote:
>>>> On 30.08.16 at 14:08, <joao.m.martins@oracle.com> wrote:
>> On 08/29/2016 10:36 AM, Jan Beulich wrote:
>>>>>> On 26.08.16 at 17:11, <joao.m.martins@oracle.com> wrote:
>>>> On 08/25/2016 11:06 AM, Jan Beulich wrote:
>>>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>>>>  - Change init_tsctimer to skip all the tests and assume it's called
>>>>>>  only on reliable TSC conditions and no warps observed. Tidy
>>>>>>  initialization on verify_tsc_reliability as suggested by Konrad.
>>>>>
>>>>> Which means that contrary to what you say in the cover letter
>>>>> there's nothing to prevent it from being used when CPU hotplug
>>>>> is possible.
>>>> Probably the cover letter wasn't completely clear on this. I mention that I split it
>>>> between Patch 2 and 5 (intent was for easier review), and you can see that I add
>>>> check in patch 5, plus preventing online of CPUs too.
>>>>
>>>>> I don't think you should skip basic sanity checks, and
>>>>> place entirely on the admin the burden of knowing whether the
>>>>> option is safe to use.
>>>> Neither do I think it should be skipped. We mainly don't duplicate these checks. In
>>>> the end I am just running init_tsctimer, in the TSC_RELIABLE block and if no warps
>>>> are observed. So putting that in init_tsctimer would just duplicate the previously
>>>> done checks. Or would you still prefer as done in previous version i.e. all checks in
>>>> both init_tsctimer and verify_tsc_reliability?
>>>
>>> I'm not sure they're needed in both places; what you need to make
>>> certain is that they're reliably gone through, and that this happens
>>> early enough.
>> They are reliably gone through and we get to avoid duplication of checks. Unless
>> there's a preference to re-add these checks in init_tsctimer, I'll keep these as is.
>> verify_tsc_reliability(...) needs to perform this checks and init_tsctimer is only
>> called these reliable TSC conditions.
> 
> But please make sure there's a comment in (or ahead of)
> init_tsctimer() pointing out where the apparently missing checks
> are. This will help both review and future readers.
Okay, I'll add a comment in init_tsctimer().

>>>>>> @@ -1528,6 +1607,7 @@ void __init early_time_init(void)
>>>>>>  
>>>>>>      preinit_pit();
>>>>>>      tmp = init_platform_timer();
>>>>>> +    plt_tsc.frequency = tmp;
>>>>>
>>>>> How can this be the TSC frequency? init_platform_timer()
>>>>> specifically skips the TSC. And I can see no other place where
>>>>> plt_tsc.frequency gets initialized. Am I overlooking something?
>>>>>
>>>> That's the calibrated TSC frequency. Before I was setting pts->frequency in
>>>> init_tsctimer through a temporary variable called tsc_freq. So I thought I could just
>>>> drop the variable and set plt_tsc directly. The difference though from previous
>>>> versions is that since commit 9334029 this value is returned from platform time
>>>> source init() and calibrated against platform timer, instead of always against PIT.
>>>
>>> This doesn't seem to answer my primary question: Where does
>>> plt_tsc.frequency get initialized now? try_platform_timer() and
>>> init_platform_timer() don't - PIT and ACPI timer use static
>>> initializers, and HEPT gets taken care of in init_hpet(), and hence so
>>> should init_tsctimer() do (instead of just returning this apparently
>>> never initialized value). And that's unrelated to what ->init() returns.
>>
>> plt_tsc.frequency is certainly initialized in early_time_init(). And then on
>> try_platform_timer we have plt_src = *pts (pts would be a pointer to plt_tsc 
>> when
>> called from verify_tsc_reliability()).
>>
>> IOW, effectively I changed from this:
>>
>> #v2
>>
>> static u64 tsc_freq;
>>
>> static s64 __init init_tsctimer(struct platform_timesource *pts)
>> {
>>    ...
>>    pts->frequency = tsc_freq;
>>    return 1;
>> }
>>
>> ...
>>
>> void __init early_time_init(void)
>> {
>>    u64 tmp = init_pit_and_calibrate_tsc();
>>
>>    tsc_freq = tmp;
>> }
>>
>> *To:*
>>
>> #v3
>>
>> static s64 __init init_tsctimer(struct platform_timesource *pts)
>> {
>>     return pts->frequency;
>> }
>>
>>
>> void __init early_time_init(void)
>> {
>>     ...
>>     tmp = init_platform_timer();
>>     plt_tsc.frequency = tmp;
>> }
>>
>> Does this answer your question? Note that my purpose with the change, was to remove
>> the tsc_freq temporary variable. If it makes things less clear (as in doing things
>> differently from other platform timers) I can go back to v2 in this aspect.
> 
> Ah, I see now how I got confused. This once again depends on TSC
> to possible become the platform timer only much later than when
> early_time_init() runs.
>
True, but here we're only initializing frequency at this point, yet it's only used if
reliable tsc conditions do verify.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT
  2016-08-30 12:45           ` Jan Beulich
@ 2016-08-30 14:14             ` Joao Martins
  0 siblings, 0 replies; 35+ messages in thread
From: Joao Martins @ 2016-08-30 14:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/30/2016 01:45 PM, Jan Beulich wrote:
>>>> On 30.08.16 at 14:26, <joao.m.martins@oracle.com> wrote:
>> On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>>>> On 26.08.16 at 17:44, <joao.m.martins@oracle.com> wrote:
>>>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>>>> On 24.08.16 at 14:43, <joao.m.martins@oracle.com> wrote:
>>>>>> This patch proposes relying on host TSC synchronization and
>>>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>>>> time_calibration we retrieve the platform time in ns and the counter
>>>>>> read by the clocksource that was used to compute system time. We
>>>>>> introduce a new rendezous function which doesn't require
>>>>>> synchronization between master and slave CPUS and just reads
>>>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>>>> on a platform with a constant and reliable TSC, that the time read on
>>>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>>>> IIUC, this is similar to how it's implemented on KVM.
>>>>>
>>>>> Without any tools side change, how is it guaranteed that a guest
>>>>> which observed the stable bit won't get migrated to a host not
>>>>> providing that guarantee?
>>>> Do you want to prevent migration in such cases? The worst that can happen is that the
>>>> guest might need to fallback to a system call if this bit is 0 and would keep doing
>>>> so if the bit is 0.
>>>
>>> Whether migration needs preventing I'm not sure; all I was trying
>>> to indicate is that there seem to be pieces missing wrt migration.
>>> As to the guest falling back to a system call - are guest kernels and
>>> (as far as as affected) applications required to cope with the flag
>>> changing from 1 to 0 behind their back?
>> It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
>> applications) always check this bit on every read to decide whether to fallback to a
>> system call. And same for pvclock code in the guest kernel on every read in both
>> Linux/FreeBSD to see whether to skip or not the monotonicity checks.
> 
> Okay, but please make sure this is called out at least in the commit
> message, if not in a code comment.
Got it.

>>>> Other than the things above I am not sure how to go about this :( Should we start
>>>> adjusting the TSCs if we find disparities or skew is observed on the long run? Or
>>>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your
>>>> take on this? Appreciate your feedback.
>>>
>>> At least as an initial approach requiring affinities to be limited to a
>>> single socket would seem like a good compromise, provided HT
>>> aspects don't have a bad effect (in which case also excluding HT
>>> may be required). I'd also be fine with command line options
>>> allowing to further relax that, but a simple "clocksource=tsc"
>>> should imo result in a setup which from all we can tell will work as
>>> intended.
>> Sounds reasonable, so unless command line options are specified we disallow TSC to be
>> clocksource on multi-socket systems. WRT to command line options, how about extending
>> "tsc" parameter to accept another possible value such as "global" or "socketsafe"?
>> Current values are "unstable" and "skewed".
> 
> What about "stable, "stable:socket" (and then perhaps also
> "stable:node")?
Hmm, much nicer. Let me add these two options, alongside with the docs update wrt to
the tsc param. I'll probably do so in a separate patch in the series.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-08-30 12:31           ` Jan Beulich
@ 2016-09-09 16:32             ` Joao Martins
  2016-09-12  7:26               ` Jan Beulich
  0 siblings, 1 reply; 35+ messages in thread
From: Joao Martins @ 2016-09-09 16:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 08/30/2016 01:31 PM, Jan Beulich wrote:
>>>> On 30.08.16 at 14:10, <joao.m.martins@oracle.com> wrote:
>> What would be a preferred period - probably capping it to a day/hour/minutes? Or
>> keeping the current calculated value? Maximum right now in current platform timers
>> are few minutes depending on the HPET frequency.
> 
> Ideally I would see you just use the calculated value, unless that
> causes problems elsewhere. Depending on such possible problems
> would be what cap to alternatively use.

While sending v4 out, I spotted some issues with platform timer overflow
handling when we get close to u64 boundary. Specific to clocksource=tsc, and
setting up the plt_overflow, one would see at boot:

"Platform timer appears to have unexpectedly wrapped 10 or more times"

The counter doesn't wrap or stop at all.
But current overflowing checking goes like:

	count = plt_src.read_counter();
        plt_stamp64 += (count - plt_stamp) & plt_mask;

	now = NOW();
	plt_wrap = __read_platform_stime(plt_stamp64);
	plt_stamp = count;

	for (i=0;...)
	{

	plt_now = plt_wrap;
	plt_wrap = __read_platform_stime(plt_stamp64 + plt_mask + 1);

@@	if ( ABS(plt_wrap - now) > ABS(plt_now - now) )
            break; // didn't wrap around
	// did wrap
	plt_stamp64 += plt_mask + 1;

	}

	if (i!=0)
		"Platform timer appears to have unexpectedly wrapped "

Effectively the calculation in @@ doesn't handle the fact that for
clocksource=tsc, "ABS(plt_wrap - now)" would be == to "ABS(plt_now -
now)". Not that is right to be so, but TSC is a 64-bit counter and "plt_mask +
1" overflows the u64, turning the above condition into a comparison of same
value. For <= 32-bit platform timers the current math works fine, but reaching
the 64-bit boundary not so much. And then handling the wraparound further below
with "plt_stamp64 += plt_mask + 1" is effectively adding zeroes, which is
assuming that plt_stamp64/plt_stamp is alone enough to not represent any
discontinuity.

I am not sure we shouldn't handle this way at least for clocksource=tsc: we only
see issues when the delta of the two stamps overflows a u64 (computed with:
plt_stamp64 + (read_counter() - plt_stamp)). Keeping those two stamps updated
more often and we won't see issues. When the wraparound happens (in lots lots of
years away) we could not only update plt_stamp64 and plt_stamp, but also
increment stime_platform_stamp and platform_timer_stamp. This lets us compensate
when the stamps wraparound since for stime_platform_stamp (in ns) that would
only happen after STIME_MAX. Or as a simpler alternative, keeping this patch and
update plt_stamp64 (zeroing this one out) plus plt_stamp on
platform_time_calibration() as additional change.

Would that sound reasonable - am I overlooking something? To some extent this
might also applicable to the general case, although platform timer is now only
used for initial seeding so probably a non-visible issue.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-09-09 16:32             ` Joao Martins
@ 2016-09-12  7:26               ` Jan Beulich
  2016-09-12 10:35                 ` Joao Martins
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2016-09-12  7:26 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, xen-devel

>>> On 09.09.16 at 18:32, <joao.m.martins@oracle.com> wrote:
> Would that sound reasonable - am I overlooking something? To some extent this
> might also applicable to the general case, although platform timer is now only
> used for initial seeding so probably a non-visible issue.

Wouldn't it already help to simply make TSC a 62- or 63-bit counter,
masking off the high bit(s) during reads?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 3/6] x86/time: streamline platform time init on plt_update()
  2016-09-12  7:26               ` Jan Beulich
@ 2016-09-12 10:35                 ` Joao Martins
  0 siblings, 0 replies; 35+ messages in thread
From: Joao Martins @ 2016-09-12 10:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 09/12/2016 08:26 AM, Jan Beulich wrote:
>>>> On 09.09.16 at 18:32, <joao.m.martins@oracle.com> wrote:
>> Would that sound reasonable - am I overlooking something? To some extent this
>> might also applicable to the general case, although platform timer is now only
>> used for initial seeding so probably a non-visible issue.
> 
> Wouldn't it already help to simply make TSC a 62- or 63-bit counter,
> masking off the high bit(s) during reads?

Yeap - That indeed would be the simplest solution, as we currently mask out the
difference between stamps. Tested it and changed counter_bits to be 63 (amended
in patch 2).

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-09-12 10:33 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-24 12:43 [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins
2016-08-24 12:43 ` [PATCH v3 1/6] x86/time: refactor init_platform_time() Joao Martins
2016-08-25 10:03   ` Jan Beulich
2016-08-26 14:54     ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 2/6] x86/time: implement tsc as clocksource Joao Martins
2016-08-25 10:06   ` Jan Beulich
2016-08-26 15:11     ` Joao Martins
2016-08-29  9:36       ` Jan Beulich
2016-08-30 12:08         ` Joao Martins
2016-08-30 12:30           ` Jan Beulich
2016-08-30 13:59             ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 3/6] x86/time: streamline platform time init on plt_update() Joao Martins
2016-08-25 10:13   ` Jan Beulich
2016-08-26 15:12     ` Joao Martins
2016-08-29  9:41       ` Jan Beulich
2016-08-30 12:10         ` Joao Martins
2016-08-30 12:31           ` Jan Beulich
2016-09-09 16:32             ` Joao Martins
2016-09-12  7:26               ` Jan Beulich
2016-09-12 10:35                 ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 4/6] x86/time: refactor read_platform_stime() Joao Martins
2016-08-25 10:17   ` Jan Beulich
2016-08-26 15:13     ` Joao Martins
2016-08-29  9:42       ` Jan Beulich
2016-08-24 12:43 ` [PATCH v3 5/6] x86/time: implement PVCLOCK_TSC_STABLE_BIT Joao Martins
2016-08-25 10:37   ` Jan Beulich
2016-08-26 15:44     ` Joao Martins
2016-08-29 10:06       ` Jan Beulich
2016-08-30 12:26         ` Joao Martins
2016-08-30 12:45           ` Jan Beulich
2016-08-30 14:14             ` Joao Martins
2016-08-24 12:43 ` [PATCH v3 6/6] docs: update clocksource option Joao Martins
2016-08-25 10:38   ` Jan Beulich
2016-08-26 15:13     ` Joao Martins
2016-08-24 12:50 ` [PATCH v3 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.