[PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more)
@ 2016-06-15  9:50 Jan Beulich
  2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
                   ` (8 more replies)
  0 siblings, 9 replies; 39+ messages in thread
From: Jan Beulich @ 2016-06-15  9:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

1: x86/time: adjust local system time initialization
2: x86: also generate assembler usable equates for synthesized features
3: x86/time: introduce and use rdtsc_ordered()
4: x86/time: calibrate TSC against platform timer
5: x86/time: correctly honor late clearing of TSC related feature flags
6: x86/time: support 32-bit wide ACPI PM timer
7: x86/time: fold recurring code
8: x86/time: group time stamps into a structure

The first patch was submitted on its own before; I'd like it to be
looked at in the context of this entire series though, and hence I'm
including it here again.

Expectations towards the improvements of what is now patch 4
didn't get fulfilled (it brought down the skew only be a factor of
two in my measurements), but to "level" that improvements by
patch 3 were significantly larger than expected (hence the two
got switched in order, albeit I can't say whether the improvements
by 3 would also be as prominent without patch 4, as I simply never
measured 3 alone): I've observed the TSC value read without the
ordering fence to be up to several hundred clocks off compared to
the fenced variant.

I'd like to note though that even with the full series in place I'm
still seeing some skew, and measurement over about 24h suggests
that what is left now is actually increasing over time (which
obviously is bad). What looks particularly odd to me is that over
longer periods of time maximum skew between hyperthreads
stays relatively stable on a certain core, but eventually another
core may "overtake"). So far I did not have a good idea of what
might be causing this (short of suspecting constant TSC not really
being all that constant).

Another observed oddity is that with this debugging code

@@ -1259,6 +1327,14 @@ time_calibration_rendezvous_tail(const s
     c->local_tsc    = rdtsc_ordered();
     c->local_stime  = get_s_time_fixed(c->local_tsc);
     c->master_stime = r->master_stime;
+{//temp
+ s64 delta = rdtsc_ordered() - c->local_tsc;
+ static s64 delta_max = 100;
+ if(delta > delta_max) {
+  delta_max = delta;
+  printk("CPU%02u: delta=%ld\n", smp_processor_id(), delta);
+ }
+}

     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
 }

added on top of the series deltas of beyond 20,000 get logged.
Afaict (I admit I didn't check) interrupts are off here, so I can't
see where such spikes would result from. Is a watchdog NMI a
possible explanation for this? If so, what other bad effects on
time calculations might result from such events? And how would
one explain spikes of both around 13,000 and 22,000 with this
(I'd expect individual watchdog NMIs to always take about the
same time)?

Signed-off-by: Jan Beulich <jbeulich@suse.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
@ 2016-06-15 10:26 ` Jan Beulich
  2016-06-15 10:32   ` Jan Beulich
                     ` (2 more replies)
  2016-06-15 10:26 ` [PATCH 2/8] x86: also generate assembler usable equates for synthesized features Jan Beulich
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 3420 bytes --]

Using the bare return value from read_platform_stime() is not suitable
when local_time_calibration() is going to use its fast path: Divergence
of several dozen microseconds between NOW() return values on different
CPUs results when platform and local time don't stay in close sync.

Latch local and platform time on the CPU initiating AP bringup, such
that the AP can use these values to seed its stime_local_stamp with as
little of an error as possible. The boot CPU, otoh, can simply
calculate the correct initial value (other CPUs could do so too with
even greater accuracy than the approach being introduced, but that can
work only if all CPUs' TSCs start ticking at the same time, which
generally can't be assumed to be the case on multi-socket systems).

This slightly defers init_percpu_time() (moved ahead by commit
dd2658f966 ["x86/time: initialise time earlier during
start_secondary()"]) in order to reduce as much as possible the gap
between populating the stamps and consuming them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -328,12 +328,12 @@ void start_secondary(void *unused)
 
     percpu_traps_init();
 
-    init_percpu_time();
-
     cpu_init();
 
     smp_callin();
 
+    init_percpu_time();
+
     setup_secondary_APIC_clock();
 
     /*
@@ -996,6 +996,8 @@ int __cpu_up(unsigned int cpu)
     if ( (ret = do_boot_cpu(apicid, cpu)) != 0 )
         return ret;
 
+    time_latch_stamps();
+
     set_cpu_state(CPU_STATE_ONLINE);
     while ( !cpu_online(cpu) )
     {
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1328,21 +1328,51 @@ static void time_calibration(void *unuse
                      &r, 1);
 }
 
+static struct {
+    s_time_t local_stime, master_stime;
+} ap_bringup_ref;
+
+void time_latch_stamps(void) {
+    unsigned long flags;
+    u64 tsc;
+
+    local_irq_save(flags);
+    ap_bringup_ref.master_stime = read_platform_stime();
+    tsc = rdtsc();
+    local_irq_restore(flags);
+
+    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
+}
+
 void init_percpu_time(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
     unsigned long flags;
+    u64 tsc;
     s_time_t now;
 
     /* Initial estimate for TSC rate. */
     t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
 
     local_irq_save(flags);
-    t->local_tsc_stamp = rdtsc();
     now = read_platform_stime();
+    tsc = rdtsc();
     local_irq_restore(flags);
 
     t->stime_master_stamp = now;
+    /*
+     * To avoid a discontinuity (TSC and platform clock can't be expected
+     * to be in perfect sync), initialization here needs to match up with
+     * local_time_calibration()'s decision whether to use its fast path.
+     */
+    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
+    {
+        if ( system_state < SYS_STATE_smp_boot )
+            now = get_s_time_fixed(tsc);
+        else
+            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
+    }
+    t->local_tsc_stamp    = tsc;
     t->stime_local_stamp  = now;
 }
 
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -40,6 +40,7 @@ int time_suspend(void);
 int time_resume(void);
 
 void init_percpu_time(void);
+void time_latch_stamps(void);
 
 struct ioreq;
 int hwdom_pit_access(struct ioreq *ioreq);




[-- Attachment #2: x86-time-init-local-stime.patch --]
[-- Type: text/plain, Size: 3467 bytes --]

x86/time: adjust local system time initialization

Using the bare return value from read_platform_stime() is not suitable
when local_time_calibration() is going to use its fast path: Divergence
of several dozen microseconds between NOW() return values on different
CPUs results when platform and local time don't stay in close sync.

Latch local and platform time on the CPU initiating AP bringup, such
that the AP can use these values to seed its stime_local_stamp with as
little of an error as possible. The boot CPU, otoh, can simply
calculate the correct initial value (other CPUs could do so too with
even greater accuracy than the approach being introduced, but that can
work only if all CPUs' TSCs start ticking at the same time, which
generally can't be assumed to be the case on multi-socket systems).

This slightly defers init_percpu_time() (moved ahead by commit
dd2658f966 ["x86/time: initialise time earlier during
start_secondary()"]) in order to reduce as much as possible the gap
between populating the stamps and consuming them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -328,12 +328,12 @@ void start_secondary(void *unused)
 
     percpu_traps_init();
 
-    init_percpu_time();
-
     cpu_init();
 
     smp_callin();
 
+    init_percpu_time();
+
     setup_secondary_APIC_clock();
 
     /*
@@ -996,6 +996,8 @@ int __cpu_up(unsigned int cpu)
     if ( (ret = do_boot_cpu(apicid, cpu)) != 0 )
         return ret;
 
+    time_latch_stamps();
+
     set_cpu_state(CPU_STATE_ONLINE);
     while ( !cpu_online(cpu) )
     {
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1328,21 +1328,51 @@ static void time_calibration(void *unuse
                      &r, 1);
 }
 
+static struct {
+    s_time_t local_stime, master_stime;
+} ap_bringup_ref;
+
+void time_latch_stamps(void) {
+    unsigned long flags;
+    u64 tsc;
+
+    local_irq_save(flags);
+    ap_bringup_ref.master_stime = read_platform_stime();
+    tsc = rdtsc();
+    local_irq_restore(flags);
+
+    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
+}
+
 void init_percpu_time(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
     unsigned long flags;
+    u64 tsc;
     s_time_t now;
 
     /* Initial estimate for TSC rate. */
     t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
 
     local_irq_save(flags);
-    t->local_tsc_stamp = rdtsc();
     now = read_platform_stime();
+    tsc = rdtsc();
     local_irq_restore(flags);
 
     t->stime_master_stamp = now;
+    /*
+     * To avoid a discontinuity (TSC and platform clock can't be expected
+     * to be in perfect sync), initialization here needs to match up with
+     * local_time_calibration()'s decision whether to use its fast path.
+     */
+    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
+    {
+        if ( system_state < SYS_STATE_smp_boot )
+            now = get_s_time_fixed(tsc);
+        else
+            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
+    }
+    t->local_tsc_stamp    = tsc;
     t->stime_local_stamp  = now;
 }
 
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -40,6 +40,7 @@ int time_suspend(void);
 int time_resume(void);
 
 void init_percpu_time(void);
+void time_latch_stamps(void);
 
 struct ioreq;
 int hwdom_pit_access(struct ioreq *ioreq);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/8] x86: also generate assembler usable equates for synthesized features
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
  2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
@ 2016-06-15 10:26 ` Jan Beulich
  2016-06-20 12:50   ` Andrew Cooper
  2016-06-15 10:27 ` [PATCH 3/8] x86/time: introduce and use rdtsc_ordered() Jan Beulich
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 4715 bytes --]

... to make it possible to base alternative instruction patching upon
such.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -219,7 +219,8 @@ long arch_do_sysctl(
         }
 
         /* Clip the number of entries. */
-        nr = min(sysctl->u.cpu_featureset.nr_features, FSCAPINTS);
+        nr = min_t(unsigned int, sysctl->u.cpu_featureset.nr_features,
+                   FSCAPINTS);
 
         /* Look up requested featureset. */
         if ( sysctl->u.cpu_featureset.index < ARRAY_SIZE(featureset_table) )
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -3,8 +3,23 @@
  *
  * Defines x86 CPU feature bits
  */
+#if defined(XEN_CPUFEATURE)
 
-#ifndef __ASM_I386_CPUFEATURE_H
+/* Other features, Xen-defined mapping. */
+/* This range is used for feature bits which conflict or are synthesized */
+XEN_CPUFEATURE(CONSTANT_TSC,    (FSCAPINTS+0)*32+ 0) /* TSC ticks at a constant rate */
+XEN_CPUFEATURE(NONSTOP_TSC,     (FSCAPINTS+0)*32+ 1) /* TSC does not stop in C states */
+XEN_CPUFEATURE(ARAT,            (FSCAPINTS+0)*32+ 2) /* Always running APIC timer */
+XEN_CPUFEATURE(ARCH_PERFMON,    (FSCAPINTS+0)*32+ 3) /* Intel Architectural PerfMon */
+XEN_CPUFEATURE(TSC_RELIABLE,    (FSCAPINTS+0)*32+ 4) /* TSC is known to be reliable */
+XEN_CPUFEATURE(XTOPOLOGY,       (FSCAPINTS+0)*32+ 5) /* cpu topology enum extensions */
+XEN_CPUFEATURE(CPUID_FAULTING,  (FSCAPINTS+0)*32+ 6) /* cpuid faulting */
+XEN_CPUFEATURE(CLFLUSH_MONITOR, (FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */
+XEN_CPUFEATURE(APERFMPERF,      (FSCAPINTS+0)*32+ 8) /* APERFMPERF */
+
+#define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */
+
+#elif !defined(__ASM_I386_CPUFEATURE_H)
 #ifndef X86_FEATURES_ONLY
 #define __ASM_I386_CPUFEATURE_H
 #endif
@@ -12,20 +28,6 @@
 #include <xen/const.h>
 #include <asm/cpuid.h>
 
-#define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */
-
-/* Other features, Xen-defined mapping. */
-/* This range is used for feature bits which conflict or are synthesized */
-#define X86_FEATURE_CONSTANT_TSC	((FSCAPINTS+0)*32+ 0) /* TSC ticks at a constant rate */
-#define X86_FEATURE_NONSTOP_TSC		((FSCAPINTS+0)*32+ 1) /* TSC does not stop in C states */
-#define X86_FEATURE_ARAT		((FSCAPINTS+0)*32+ 2) /* Always running APIC timer */
-#define X86_FEATURE_ARCH_PERFMON	((FSCAPINTS+0)*32+ 3) /* Intel Architectural PerfMon */
-#define X86_FEATURE_TSC_RELIABLE	((FSCAPINTS+0)*32+ 4) /* TSC is known to be reliable */
-#define X86_FEATURE_XTOPOLOGY		((FSCAPINTS+0)*32+ 5) /* cpu topology enum extensions */
-#define X86_FEATURE_CPUID_FAULTING	((FSCAPINTS+0)*32+ 6) /* cpuid faulting */
-#define X86_FEATURE_CLFLUSH_MONITOR	((FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */
-#define X86_FEATURE_APERFMPERF		((FSCAPINTS+0)*32+ 8) /* APERFMPERF */
-
 #define cpufeat_word(idx)	((idx) / 32)
 #define cpufeat_bit(idx)	((idx) % 32)
 #define cpufeat_mask(idx)	(_AC(1, U) << cpufeat_bit(idx))
--- a/xen/include/asm-x86/cpufeatureset.h
+++ b/xen/include/asm-x86/cpufeatureset.h
@@ -3,19 +3,25 @@
 
 #ifndef __ASSEMBLY__
 
+#include <xen/stringify.h>
+
 #define XEN_CPUFEATURE(name, value) X86_FEATURE_##name = value,
 enum {
 #include <public/arch-x86/cpufeatureset.h>
+#include <asm/cpufeature.h>
 };
 #undef XEN_CPUFEATURE
 
-#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " #value);
+#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \
+                                         __stringify(value));
 #include <public/arch-x86/cpufeatureset.h>
+#include <asm/cpufeature.h>
 
 #else /* !__ASSEMBLY__ */
 
 #define XEN_CPUFEATURE(name, value) .equ X86_FEATURE_##name, value
 #include <public/arch-x86/cpufeatureset.h>
+#include <asm/cpufeature.h>
 
 #endif /* __ASSEMBLY__ */
 
--- a/xen/include/asm-x86/cpuid.h
+++ b/xen/include/asm-x86/cpuid.h
@@ -1,12 +1,13 @@
 #ifndef __X86_CPUID_H__
 #define __X86_CPUID_H__
 
-#include <asm/cpufeatureset.h>
 #include <asm/cpuid-autogen.h>
-#include <asm/percpu.h>
 
 #define FSCAPINTS FEATURESET_NR_ENTRIES
 
+#include <asm/cpufeatureset.h>
+#include <asm/percpu.h>
+
 #define FEATURESET_1d     0 /* 0x00000001.edx      */
 #define FEATURESET_1c     1 /* 0x00000001.ecx      */
 #define FEATURESET_e1d    2 /* 0x80000001.edx      */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -291,7 +291,7 @@ def write_results(state):
 
     state.output.write(
 """
-#define FEATURESET_NR_ENTRIES %sU
+#define FEATURESET_NR_ENTRIES %s
 
 #define CPUID_COMMON_1D_FEATURES %s
 



[-- Attachment #2: x86-internal-feat-equates.patch --]
[-- Type: text/plain, Size: 4783 bytes --]

x86: also generate assembler usable equates for synthesized features

... to make it possible to base alternative instruction patching upon
such.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -219,7 +219,8 @@ long arch_do_sysctl(
         }
 
         /* Clip the number of entries. */
-        nr = min(sysctl->u.cpu_featureset.nr_features, FSCAPINTS);
+        nr = min_t(unsigned int, sysctl->u.cpu_featureset.nr_features,
+                   FSCAPINTS);
 
         /* Look up requested featureset. */
         if ( sysctl->u.cpu_featureset.index < ARRAY_SIZE(featureset_table) )
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -3,8 +3,23 @@
  *
  * Defines x86 CPU feature bits
  */
+#if defined(XEN_CPUFEATURE)
 
-#ifndef __ASM_I386_CPUFEATURE_H
+/* Other features, Xen-defined mapping. */
+/* This range is used for feature bits which conflict or are synthesized */
+XEN_CPUFEATURE(CONSTANT_TSC,    (FSCAPINTS+0)*32+ 0) /* TSC ticks at a constant rate */
+XEN_CPUFEATURE(NONSTOP_TSC,     (FSCAPINTS+0)*32+ 1) /* TSC does not stop in C states */
+XEN_CPUFEATURE(ARAT,            (FSCAPINTS+0)*32+ 2) /* Always running APIC timer */
+XEN_CPUFEATURE(ARCH_PERFMON,    (FSCAPINTS+0)*32+ 3) /* Intel Architectural PerfMon */
+XEN_CPUFEATURE(TSC_RELIABLE,    (FSCAPINTS+0)*32+ 4) /* TSC is known to be reliable */
+XEN_CPUFEATURE(XTOPOLOGY,       (FSCAPINTS+0)*32+ 5) /* cpu topology enum extensions */
+XEN_CPUFEATURE(CPUID_FAULTING,  (FSCAPINTS+0)*32+ 6) /* cpuid faulting */
+XEN_CPUFEATURE(CLFLUSH_MONITOR, (FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */
+XEN_CPUFEATURE(APERFMPERF,      (FSCAPINTS+0)*32+ 8) /* APERFMPERF */
+
+#define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */
+
+#elif !defined(__ASM_I386_CPUFEATURE_H)
 #ifndef X86_FEATURES_ONLY
 #define __ASM_I386_CPUFEATURE_H
 #endif
@@ -12,20 +28,6 @@
 #include <xen/const.h>
 #include <asm/cpuid.h>
 
-#define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */
-
-/* Other features, Xen-defined mapping. */
-/* This range is used for feature bits which conflict or are synthesized */
-#define X86_FEATURE_CONSTANT_TSC	((FSCAPINTS+0)*32+ 0) /* TSC ticks at a constant rate */
-#define X86_FEATURE_NONSTOP_TSC		((FSCAPINTS+0)*32+ 1) /* TSC does not stop in C states */
-#define X86_FEATURE_ARAT		((FSCAPINTS+0)*32+ 2) /* Always running APIC timer */
-#define X86_FEATURE_ARCH_PERFMON	((FSCAPINTS+0)*32+ 3) /* Intel Architectural PerfMon */
-#define X86_FEATURE_TSC_RELIABLE	((FSCAPINTS+0)*32+ 4) /* TSC is known to be reliable */
-#define X86_FEATURE_XTOPOLOGY		((FSCAPINTS+0)*32+ 5) /* cpu topology enum extensions */
-#define X86_FEATURE_CPUID_FAULTING	((FSCAPINTS+0)*32+ 6) /* cpuid faulting */
-#define X86_FEATURE_CLFLUSH_MONITOR	((FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */
-#define X86_FEATURE_APERFMPERF		((FSCAPINTS+0)*32+ 8) /* APERFMPERF */
-
 #define cpufeat_word(idx)	((idx) / 32)
 #define cpufeat_bit(idx)	((idx) % 32)
 #define cpufeat_mask(idx)	(_AC(1, U) << cpufeat_bit(idx))
--- a/xen/include/asm-x86/cpufeatureset.h
+++ b/xen/include/asm-x86/cpufeatureset.h
@@ -3,19 +3,25 @@
 
 #ifndef __ASSEMBLY__
 
+#include <xen/stringify.h>
+
 #define XEN_CPUFEATURE(name, value) X86_FEATURE_##name = value,
 enum {
 #include <public/arch-x86/cpufeatureset.h>
+#include <asm/cpufeature.h>
 };
 #undef XEN_CPUFEATURE
 
-#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " #value);
+#define XEN_CPUFEATURE(name, value) asm (".equ X86_FEATURE_" #name ", " \
+                                         __stringify(value));
 #include <public/arch-x86/cpufeatureset.h>
+#include <asm/cpufeature.h>
 
 #else /* !__ASSEMBLY__ */
 
 #define XEN_CPUFEATURE(name, value) .equ X86_FEATURE_##name, value
 #include <public/arch-x86/cpufeatureset.h>
+#include <asm/cpufeature.h>
 
 #endif /* __ASSEMBLY__ */
 
--- a/xen/include/asm-x86/cpuid.h
+++ b/xen/include/asm-x86/cpuid.h
@@ -1,12 +1,13 @@
 #ifndef __X86_CPUID_H__
 #define __X86_CPUID_H__
 
-#include <asm/cpufeatureset.h>
 #include <asm/cpuid-autogen.h>
-#include <asm/percpu.h>
 
 #define FSCAPINTS FEATURESET_NR_ENTRIES
 
+#include <asm/cpufeatureset.h>
+#include <asm/percpu.h>
+
 #define FEATURESET_1d     0 /* 0x00000001.edx      */
 #define FEATURESET_1c     1 /* 0x00000001.ecx      */
 #define FEATURESET_e1d    2 /* 0x80000001.edx      */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -291,7 +291,7 @@ def write_results(state):
 
     state.output.write(
 """
-#define FEATURESET_NR_ENTRIES %sU
+#define FEATURESET_NR_ENTRIES %s
 
 #define CPUID_COMMON_1D_FEATURES %s
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/8] x86/time: introduce and use rdtsc_ordered()
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
  2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
  2016-06-15 10:26 ` [PATCH 2/8] x86: also generate assembler usable equates for synthesized features Jan Beulich
@ 2016-06-15 10:27 ` Jan Beulich
  2016-06-20 12:59   ` Andrew Cooper
  2016-06-15 10:28 ` [PATCH 4/8] x86/time: calibrate TSC against platform timer Jan Beulich
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 6196 bytes --]

Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
use it in trivial call sites") and earlier ones it builds upon, let's
make sure timing loops don't have their rdtsc()-s re-ordered, as that
would harm precision of the result (values were observed to be several
hundred clocks off without this adjustment).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1137,7 +1137,7 @@ static int __init calibrate_APIC_clock(v
     /*
      * We wrapped around just now. Let's start:
      */
-    t1 = rdtsc();
+    t1 = rdtsc_ordered();
     tt1 = apic_read(APIC_TMCCT);
 
     /*
@@ -1147,7 +1147,7 @@ static int __init calibrate_APIC_clock(v
         wait_8254_wraparound();
 
     tt2 = apic_read(APIC_TMCCT);
-    t2 = rdtsc();
+    t2 = rdtsc_ordered();
 
     /*
      * The APIC bus clock counter is 32 bits only, it
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -541,6 +541,9 @@ static void init_amd(struct cpuinfo_x86
 			wrmsr_amd_safe(0xc001100d, l, h & ~1);
 	}
 
+	/* MFENCE stops RDTSC speculation */
+	__set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability);
+
 	switch(c->x86)
 	{
 	case 0xf ... 0x17:
--- a/xen/arch/x86/delay.c
+++ b/xen/arch/x86/delay.c
@@ -21,10 +21,10 @@ void __udelay(unsigned long usecs)
     unsigned long ticks = usecs * (cpu_khz / 1000);
     unsigned long s, e;
 
-    s = rdtsc();
+    s = rdtsc_ordered();
     do
     {
         rep_nop();
-        e = rdtsc();
+        e = rdtsc_ordered();
     } while ((e-s) < ticks);
 }
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -123,7 +123,7 @@ static void synchronize_tsc_master(unsig
 
     for ( i = 1; i <= 5; i++ )
     {
-        tsc_value = rdtsc();
+        tsc_value = rdtsc_ordered();
         wmb();
         atomic_inc(&tsc_count);
         while ( atomic_read(&tsc_count) != (i<<1) )
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -257,10 +257,10 @@ static u64 init_pit_and_calibrate_tsc(vo
     outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
     outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
 
-    start = rdtsc();
+    start = rdtsc_ordered();
     for ( count = 0; (inb(0x61) & 0x20) == 0; count++ )
         continue;
-    end = rdtsc();
+    end = rdtsc_ordered();
 
     /* Error if the CTC doesn't behave itself. */
     if ( count == 0 )
@@ -760,7 +760,7 @@ s_time_t get_s_time_fixed(u64 at_tsc)
     if ( at_tsc )
         tsc = at_tsc;
     else
-        tsc = rdtsc();
+        tsc = rdtsc_ordered();
     delta = tsc - t->local_tsc_stamp;
     now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
 
@@ -933,7 +933,7 @@ int cpu_frequency_change(u64 freq)
     /* TSC-extrapolated time may be bogus after frequency change. */
     /*t->stime_local_stamp = get_s_time();*/
     t->stime_local_stamp = t->stime_master_stamp;
-    curr_tsc = rdtsc();
+    curr_tsc = rdtsc_ordered();
     t->local_tsc_stamp = curr_tsc;
     set_time_scale(&t->tsc_scale, freq);
     local_irq_enable();
@@ -1248,7 +1248,7 @@ static void time_calibration_tsc_rendezv
             if ( r->master_stime == 0 )
             {
                 r->master_stime = read_platform_stime();
-                r->master_tsc_stamp = rdtsc();
+                r->master_tsc_stamp = rdtsc_ordered();
             }
             atomic_inc(&r->semaphore);
 
@@ -1274,7 +1274,7 @@ static void time_calibration_tsc_rendezv
         }
     }
 
-    c->local_tsc_stamp = rdtsc();
+    c->local_tsc_stamp = rdtsc_ordered();
     c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
@@ -1304,7 +1304,7 @@ static void time_calibration_std_rendezv
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    c->local_tsc_stamp = rdtsc();
+    c->local_tsc_stamp = rdtsc_ordered();
     c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
@@ -1338,7 +1338,7 @@ void time_latch_stamps(void) {
 
     local_irq_save(flags);
     ap_bringup_ref.master_stime = read_platform_stime();
-    tsc = rdtsc();
+    tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
     ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
@@ -1356,7 +1356,7 @@ void init_percpu_time(void)
 
     local_irq_save(flags);
     now = read_platform_stime();
-    tsc = rdtsc();
+    tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
     t->stime_master_stamp = now;
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -16,6 +16,7 @@ XEN_CPUFEATURE(XTOPOLOGY,       (FSCAPIN
 XEN_CPUFEATURE(CPUID_FAULTING,  (FSCAPINTS+0)*32+ 6) /* cpuid faulting */
 XEN_CPUFEATURE(CLFLUSH_MONITOR, (FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */
 XEN_CPUFEATURE(APERFMPERF,      (FSCAPINTS+0)*32+ 8) /* APERFMPERF */
+XEN_CPUFEATURE(MFENCE_RDTSC,    (FSCAPINTS+0)*32+ 9) /* MFENCE synchronizes RDTSC */
 
 #define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */
 
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -80,6 +80,22 @@ static inline uint64_t rdtsc(void)
     return ((uint64_t)high << 32) | low;
 }
 
+static inline uint64_t rdtsc_ordered(void)
+{
+	/*
+	 * The RDTSC instruction is not ordered relative to memory access.
+	 * The Intel SDM and the AMD APM are both vague on this point, but
+	 * empirically an RDTSC instruction can be speculatively executed
+	 * before prior loads.  An RDTSC immediately after an appropriate
+	 * barrier appears to be ordered as a normal load, that is, it
+	 * provides the same ordering guarantees as reading from a global
+	 * memory location that some other imaginary CPU is updating
+	 * continuously with a time stamp.
+	 */
+	alternative("lfence", "mfence", X86_FEATURE_MFENCE_RDTSC);
+	return rdtsc();
+}
+
 #define __write_tsc(val) wrmsrl(MSR_IA32_TSC, val)
 #define write_tsc(val) ({                                       \
     /* Reliable TSCs are in lockstep across all CPUs. We should \



[-- Attachment #2: x86-RDTSC-ordered.patch --]
[-- Type: text/plain, Size: 6239 bytes --]

x86/time: introduce and use rdtsc_ordered()

Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
use it in trivial call sites") and earlier ones it builds upon, let's
make sure timing loops don't have their rdtsc()-s re-ordered, as that
would harm precision of the result (values were observed to be several
hundred clocks off without this adjustment).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1137,7 +1137,7 @@ static int __init calibrate_APIC_clock(v
     /*
      * We wrapped around just now. Let's start:
      */
-    t1 = rdtsc();
+    t1 = rdtsc_ordered();
     tt1 = apic_read(APIC_TMCCT);
 
     /*
@@ -1147,7 +1147,7 @@ static int __init calibrate_APIC_clock(v
         wait_8254_wraparound();
 
     tt2 = apic_read(APIC_TMCCT);
-    t2 = rdtsc();
+    t2 = rdtsc_ordered();
 
     /*
      * The APIC bus clock counter is 32 bits only, it
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -541,6 +541,9 @@ static void init_amd(struct cpuinfo_x86
 			wrmsr_amd_safe(0xc001100d, l, h & ~1);
 	}
 
+	/* MFENCE stops RDTSC speculation */
+	__set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability);
+
 	switch(c->x86)
 	{
 	case 0xf ... 0x17:
--- a/xen/arch/x86/delay.c
+++ b/xen/arch/x86/delay.c
@@ -21,10 +21,10 @@ void __udelay(unsigned long usecs)
     unsigned long ticks = usecs * (cpu_khz / 1000);
     unsigned long s, e;
 
-    s = rdtsc();
+    s = rdtsc_ordered();
     do
     {
         rep_nop();
-        e = rdtsc();
+        e = rdtsc_ordered();
     } while ((e-s) < ticks);
 }
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -123,7 +123,7 @@ static void synchronize_tsc_master(unsig
 
     for ( i = 1; i <= 5; i++ )
     {
-        tsc_value = rdtsc();
+        tsc_value = rdtsc_ordered();
         wmb();
         atomic_inc(&tsc_count);
         while ( atomic_read(&tsc_count) != (i<<1) )
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -257,10 +257,10 @@ static u64 init_pit_and_calibrate_tsc(vo
     outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
     outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
 
-    start = rdtsc();
+    start = rdtsc_ordered();
     for ( count = 0; (inb(0x61) & 0x20) == 0; count++ )
         continue;
-    end = rdtsc();
+    end = rdtsc_ordered();
 
     /* Error if the CTC doesn't behave itself. */
     if ( count == 0 )
@@ -760,7 +760,7 @@ s_time_t get_s_time_fixed(u64 at_tsc)
     if ( at_tsc )
         tsc = at_tsc;
     else
-        tsc = rdtsc();
+        tsc = rdtsc_ordered();
     delta = tsc - t->local_tsc_stamp;
     now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
 
@@ -933,7 +933,7 @@ int cpu_frequency_change(u64 freq)
     /* TSC-extrapolated time may be bogus after frequency change. */
     /*t->stime_local_stamp = get_s_time();*/
     t->stime_local_stamp = t->stime_master_stamp;
-    curr_tsc = rdtsc();
+    curr_tsc = rdtsc_ordered();
     t->local_tsc_stamp = curr_tsc;
     set_time_scale(&t->tsc_scale, freq);
     local_irq_enable();
@@ -1248,7 +1248,7 @@ static void time_calibration_tsc_rendezv
             if ( r->master_stime == 0 )
             {
                 r->master_stime = read_platform_stime();
-                r->master_tsc_stamp = rdtsc();
+                r->master_tsc_stamp = rdtsc_ordered();
             }
             atomic_inc(&r->semaphore);
 
@@ -1274,7 +1274,7 @@ static void time_calibration_tsc_rendezv
         }
     }
 
-    c->local_tsc_stamp = rdtsc();
+    c->local_tsc_stamp = rdtsc_ordered();
     c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
@@ -1304,7 +1304,7 @@ static void time_calibration_std_rendezv
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    c->local_tsc_stamp = rdtsc();
+    c->local_tsc_stamp = rdtsc_ordered();
     c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
@@ -1338,7 +1338,7 @@ void time_latch_stamps(void) {
 
     local_irq_save(flags);
     ap_bringup_ref.master_stime = read_platform_stime();
-    tsc = rdtsc();
+    tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
     ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
@@ -1356,7 +1356,7 @@ void init_percpu_time(void)
 
     local_irq_save(flags);
     now = read_platform_stime();
-    tsc = rdtsc();
+    tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
     t->stime_master_stamp = now;
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -16,6 +16,7 @@ XEN_CPUFEATURE(XTOPOLOGY,       (FSCAPIN
 XEN_CPUFEATURE(CPUID_FAULTING,  (FSCAPINTS+0)*32+ 6) /* cpuid faulting */
 XEN_CPUFEATURE(CLFLUSH_MONITOR, (FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */
 XEN_CPUFEATURE(APERFMPERF,      (FSCAPINTS+0)*32+ 8) /* APERFMPERF */
+XEN_CPUFEATURE(MFENCE_RDTSC,    (FSCAPINTS+0)*32+ 9) /* MFENCE synchronizes RDTSC */
 
 #define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */
 
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -80,6 +80,22 @@ static inline uint64_t rdtsc(void)
     return ((uint64_t)high << 32) | low;
 }
 
+static inline uint64_t rdtsc_ordered(void)
+{
+	/*
+	 * The RDTSC instruction is not ordered relative to memory access.
+	 * The Intel SDM and the AMD APM are both vague on this point, but
+	 * empirically an RDTSC instruction can be speculatively executed
+	 * before prior loads.  An RDTSC immediately after an appropriate
+	 * barrier appears to be ordered as a normal load, that is, it
+	 * provides the same ordering guarantees as reading from a global
+	 * memory location that some other imaginary CPU is updating
+	 * continuously with a time stamp.
+	 */
+	alternative("lfence", "mfence", X86_FEATURE_MFENCE_RDTSC);
+	return rdtsc();
+}
+
 #define __write_tsc(val) wrmsrl(MSR_IA32_TSC, val)
 #define write_tsc(val) ({                                       \
     /* Reliable TSCs are in lockstep across all CPUs. We should \

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 4/8] x86/time: calibrate TSC against platform timer
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
                   ` (2 preceding siblings ...)
  2016-06-15 10:27 ` [PATCH 3/8] x86/time: introduce and use rdtsc_ordered() Jan Beulich
@ 2016-06-15 10:28 ` Jan Beulich
  2016-06-20 14:20   ` Andrew Cooper
  2016-06-15 10:28 ` [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags Jan Beulich
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 9407 bytes --]

... instead of unconditionally against the PIT. This allows for local
and master system times to remain in better sync (which matters even
when, on any modern system, the master time is really used only during
secondary CPU bringup, as the error between the two is in fact
noticable in cross-CPU NOW() invocation montonicity).

This involves moving init_platform_timer() invocation into
early_time_init(), splitting out the few things which really need to be
done in init_xen_time().

In the course of this re-ordering also set the timer channel 2 gate low
after having finished calibration. This should be benign to overall
system operation, but appears to be the more clean state.

Also do away with open coded 8254 register manipulation from 8259 code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/i8259.c
+++ b/xen/arch/x86/i8259.c
@@ -359,12 +359,7 @@ void __init init_IRQ(void)
 
     apic_intr_init();
 
-    /* Set the clock to HZ Hz */
-#define CLOCK_TICK_RATE 1193182 /* crystal freq (Hz) */
-#define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
-    outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
-    outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
-    outb(LATCH >> 8, PIT_CH0);     /* MSB */
+    preinit_pit();
 
     setup_irq(2, 0, &cascade);
 }
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -59,7 +59,7 @@ struct platform_timesource {
     char *name;
     u64 frequency;
     u64 (*read_counter)(void);
-    int (*init)(struct platform_timesource *);
+    s64 (*init)(struct platform_timesource *);
     void (*resume)(struct platform_timesource *);
     int counter_bits;
 };
@@ -224,49 +224,18 @@ static struct irqaction __read_mostly ir
     timer_interrupt, "timer", NULL
 };
 
-/* ------ Calibrate the TSC ------- 
- * Return processor ticks per second / CALIBRATE_FRAC.
- */
-
 #define CLOCK_TICK_RATE 1193182 /* system crystal frequency (Hz) */
 #define CALIBRATE_FRAC  20      /* calibrate over 50ms */
-#define CALIBRATE_LATCH ((CLOCK_TICK_RATE+(CALIBRATE_FRAC/2))/CALIBRATE_FRAC)
+#define CALIBRATE_VALUE(freq) (((freq) + CALIBRATE_FRAC / 2) / CALIBRATE_FRAC)
 
-static u64 init_pit_and_calibrate_tsc(void)
+void preinit_pit(void)
 {
-    u64 start, end;
-    unsigned long count;
-
     /* Set PIT channel 0 to HZ Hz. */
 #define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
     outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
     outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
     outb(LATCH >> 8, PIT_CH0);     /* MSB */
-
-    /* Set the Gate high, disable speaker */
-    outb((inb(0x61) & ~0x02) | 0x01, 0x61);
-
-    /*
-     * Now let's take care of CTC channel 2
-     *
-     * Set the Gate high, program CTC channel 2 for mode 0, (interrupt on
-     * terminal count mode), binary count, load 5 * LATCH count, (LSB and MSB)
-     * to begin countdown.
-     */
-    outb(0xb0, PIT_MODE);           /* binary, mode 0, LSB/MSB, Ch 2 */
-    outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
-    outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
-
-    start = rdtsc_ordered();
-    for ( count = 0; (inb(0x61) & 0x20) == 0; count++ )
-        continue;
-    end = rdtsc_ordered();
-
-    /* Error if the CTC doesn't behave itself. */
-    if ( count == 0 )
-        return 0;
-
-    return ((end - start) * (u64)CALIBRATE_FRAC);
+#undef LATCH
 }
 
 void set_time_scale(struct time_scale *ts, u64 ticks_per_sec)
@@ -327,10 +296,49 @@ static u64 read_pit_count(void)
     return count32;
 }
 
-static int __init init_pit(struct platform_timesource *pts)
+static s64 __init init_pit(struct platform_timesource *pts)
 {
+    u8 portb = inb(0x61);
+    u64 start, end;
+    unsigned long count;
+
     using_pit = 1;
-    return 1;
+
+    /* Set the Gate high, disable speaker. */
+    outb((portb & ~0x02) | 0x01, 0x61);
+
+    /*
+     * Now let's take care of CTC channel 2: mode 0, (interrupt on
+     * terminal count mode), binary count, load CALIBRATE_LATCH count,
+     * (LSB and MSB) to begin countdown.
+     */
+#define CALIBRATE_LATCH CALIBRATE_VALUE(CLOCK_TICK_RATE)
+    outb(0xb0, PIT_MODE);                  /* binary, mode 0, LSB/MSB, Ch 2 */
+    outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
+    outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
+#undef CALIBRATE_LATCH
+
+    start = rdtsc_ordered();
+    for ( count = 0; !(inb(0x61) & 0x20); ++count )
+        continue;
+    end = rdtsc_ordered();
+
+    /* Set the Gate low, disable speaker. */
+    outb(portb & ~0x03, 0x61);
+
+    /* Error if the CTC doesn't behave itself. */
+    if ( count == 0 )
+        return 0;
+
+    return (end - start) * CALIBRATE_FRAC;
+}
+
+static void resume_pit(struct platform_timesource *pts)
+{
+    /* Set CTC channel 2 to mode 0 again; initial value does not matter. */
+    outb(0xb0, PIT_MODE); /* binary, mode 0, LSB/MSB, Ch 2 */
+    outb(0, PIT_CH2);     /* LSB of count */
+    outb(0, PIT_CH2);     /* MSB of count */
 }
 
 static struct platform_timesource __initdata plt_pit =
@@ -340,7 +348,8 @@ static struct platform_timesource __init
     .frequency = CLOCK_TICK_RATE,
     .read_counter = read_pit_count,
     .counter_bits = 32,
-    .init = init_pit
+    .init = init_pit,
+    .resume = resume_pit
 };
 
 /************************************************************
@@ -352,15 +361,26 @@ static u64 read_hpet_count(void)
     return hpet_read32(HPET_COUNTER);
 }
 
-static int __init init_hpet(struct platform_timesource *pts)
+static s64 __init init_hpet(struct platform_timesource *pts)
 {
-    u64 hpet_rate = hpet_setup();
+    u64 hpet_rate = hpet_setup(), start;
+    u32 count, target;
 
     if ( hpet_rate == 0 )
         return 0;
 
     pts->frequency = hpet_rate;
-    return 1;
+
+    count = hpet_read32(HPET_COUNTER);
+    start = rdtsc_ordered();
+    target = count + CALIBRATE_VALUE(hpet_rate);
+    if ( target < count )
+        while ( hpet_read32(HPET_COUNTER) >= count )
+            continue;
+    while ( hpet_read32(HPET_COUNTER) < target )
+        continue;
+
+    return (rdtsc_ordered() - start) * CALIBRATE_FRAC;
 }
 
 static void resume_hpet(struct platform_timesource *pts)
@@ -392,12 +412,24 @@ static u64 read_pmtimer_count(void)
     return inl(pmtmr_ioport);
 }
 
-static int __init init_pmtimer(struct platform_timesource *pts)
+static s64 __init init_pmtimer(struct platform_timesource *pts)
 {
+    u64 start;
+    u32 count, target, mask = 0xffffff;
+
     if ( pmtmr_ioport == 0 )
         return 0;
 
-    return 1;
+    count = inl(pmtmr_ioport) & mask;
+    start = rdtsc_ordered();
+    target = count + CALIBRATE_VALUE(ACPI_PM_FREQUENCY);
+    if ( target < count )
+        while ( (inl(pmtmr_ioport) & mask) >= count )
+            continue;
+    while ( (inl(pmtmr_ioport) & mask) < target )
+        continue;
+
+    return (rdtsc_ordered() - start) * CALIBRATE_FRAC;
 }
 
 static struct platform_timesource __initdata plt_pmtimer =
@@ -533,14 +565,15 @@ static void resume_platform_timer(void)
     plt_stamp = plt_src.read_counter();
 }
 
-static void __init init_platform_timer(void)
+static u64 __init init_platform_timer(void)
 {
     static struct platform_timesource * __initdata plt_timers[] = {
         &plt_hpet, &plt_pmtimer, &plt_pit
     };
 
     struct platform_timesource *pts = NULL;
-    int i, rc = -1;
+    unsigned int i;
+    s64 rc = -1;
 
     if ( opt_clocksource[0] != '\0' )
     {
@@ -578,15 +611,12 @@ static void __init init_platform_timer(v
 
     plt_overflow_period = scale_delta(
         1ull << (pts->counter_bits-1), &plt_scale);
-    init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
     plt_src = *pts;
-    plt_overflow(NULL);
-
-    platform_timer_stamp = plt_stamp64;
-    stime_platform_stamp = NOW();
 
     printk("Platform timer is %s %s\n",
            freq_string(pts->frequency), pts->name);
+
+    return rc;
 }
 
 u64 stime2tsc(s_time_t stime)
@@ -1478,7 +1508,11 @@ int __init init_xen_time(void)
     /* NB. get_cmos_time() can take over one second to execute. */
     do_settime(get_cmos_time(), 0, NOW());
 
-    init_platform_timer();
+    /* Finish platform timer initialization. */
+    init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
+    plt_overflow(NULL);
+    platform_timer_stamp = plt_stamp64;
+    stime_platform_stamp = NOW();
 
     init_percpu_time();
 
@@ -1493,7 +1527,10 @@ int __init init_xen_time(void)
 void __init early_time_init(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
-    u64 tmp = init_pit_and_calibrate_tsc();
+    u64 tmp;
+
+    preinit_pit();
+    tmp = init_platform_timer();
 
     set_time_scale(&t->tsc_scale, tmp);
     t->local_tsc_stamp = boot_tsc_stamp;
@@ -1602,7 +1639,7 @@ int time_suspend(void)
 
 int time_resume(void)
 {
-    init_pit_and_calibrate_tsc();
+    preinit_pit();
 
     resume_platform_timer();
 
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -39,6 +39,7 @@ mktime (unsigned int year, unsigned int
 int time_suspend(void);
 int time_resume(void);
 
+void preinit_pit(void);
 void init_percpu_time(void);
 void time_latch_stamps(void);
 



[-- Attachment #2: x86-time-TSC-calibrate-plt.patch --]
[-- Type: text/plain, Size: 9453 bytes --]

x86/time: calibrate TSC against platform timer

... instead of unconditionally against the PIT. This allows for local
and master system times to remain in better sync (which matters even
when, on any modern system, the master time is really used only during
secondary CPU bringup, as the error between the two is in fact
noticable in cross-CPU NOW() invocation montonicity).

This involves moving init_platform_timer() invocation into
early_time_init(), splitting out the few things which really need to be
done in init_xen_time().

In the course of this re-ordering also set the timer channel 2 gate low
after having finished calibration. This should be benign to overall
system operation, but appears to be the more clean state.

Also do away with open coded 8254 register manipulation from 8259 code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/i8259.c
+++ b/xen/arch/x86/i8259.c
@@ -359,12 +359,7 @@ void __init init_IRQ(void)
 
     apic_intr_init();
 
-    /* Set the clock to HZ Hz */
-#define CLOCK_TICK_RATE 1193182 /* crystal freq (Hz) */
-#define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
-    outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
-    outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
-    outb(LATCH >> 8, PIT_CH0);     /* MSB */
+    preinit_pit();
 
     setup_irq(2, 0, &cascade);
 }
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -59,7 +59,7 @@ struct platform_timesource {
     char *name;
     u64 frequency;
     u64 (*read_counter)(void);
-    int (*init)(struct platform_timesource *);
+    s64 (*init)(struct platform_timesource *);
     void (*resume)(struct platform_timesource *);
     int counter_bits;
 };
@@ -224,49 +224,18 @@ static struct irqaction __read_mostly ir
     timer_interrupt, "timer", NULL
 };
 
-/* ------ Calibrate the TSC ------- 
- * Return processor ticks per second / CALIBRATE_FRAC.
- */
-
 #define CLOCK_TICK_RATE 1193182 /* system crystal frequency (Hz) */
 #define CALIBRATE_FRAC  20      /* calibrate over 50ms */
-#define CALIBRATE_LATCH ((CLOCK_TICK_RATE+(CALIBRATE_FRAC/2))/CALIBRATE_FRAC)
+#define CALIBRATE_VALUE(freq) (((freq) + CALIBRATE_FRAC / 2) / CALIBRATE_FRAC)
 
-static u64 init_pit_and_calibrate_tsc(void)
+void preinit_pit(void)
 {
-    u64 start, end;
-    unsigned long count;
-
     /* Set PIT channel 0 to HZ Hz. */
 #define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
     outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
     outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
     outb(LATCH >> 8, PIT_CH0);     /* MSB */
-
-    /* Set the Gate high, disable speaker */
-    outb((inb(0x61) & ~0x02) | 0x01, 0x61);
-
-    /*
-     * Now let's take care of CTC channel 2
-     *
-     * Set the Gate high, program CTC channel 2 for mode 0, (interrupt on
-     * terminal count mode), binary count, load 5 * LATCH count, (LSB and MSB)
-     * to begin countdown.
-     */
-    outb(0xb0, PIT_MODE);           /* binary, mode 0, LSB/MSB, Ch 2 */
-    outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
-    outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
-
-    start = rdtsc_ordered();
-    for ( count = 0; (inb(0x61) & 0x20) == 0; count++ )
-        continue;
-    end = rdtsc_ordered();
-
-    /* Error if the CTC doesn't behave itself. */
-    if ( count == 0 )
-        return 0;
-
-    return ((end - start) * (u64)CALIBRATE_FRAC);
+#undef LATCH
 }
 
 void set_time_scale(struct time_scale *ts, u64 ticks_per_sec)
@@ -327,10 +296,49 @@ static u64 read_pit_count(void)
     return count32;
 }
 
-static int __init init_pit(struct platform_timesource *pts)
+static s64 __init init_pit(struct platform_timesource *pts)
 {
+    u8 portb = inb(0x61);
+    u64 start, end;
+    unsigned long count;
+
     using_pit = 1;
-    return 1;
+
+    /* Set the Gate high, disable speaker. */
+    outb((portb & ~0x02) | 0x01, 0x61);
+
+    /*
+     * Now let's take care of CTC channel 2: mode 0, (interrupt on
+     * terminal count mode), binary count, load CALIBRATE_LATCH count,
+     * (LSB and MSB) to begin countdown.
+     */
+#define CALIBRATE_LATCH CALIBRATE_VALUE(CLOCK_TICK_RATE)
+    outb(0xb0, PIT_MODE);                  /* binary, mode 0, LSB/MSB, Ch 2 */
+    outb(CALIBRATE_LATCH & 0xff, PIT_CH2); /* LSB of count */
+    outb(CALIBRATE_LATCH >> 8, PIT_CH2);   /* MSB of count */
+#undef CALIBRATE_LATCH
+
+    start = rdtsc_ordered();
+    for ( count = 0; !(inb(0x61) & 0x20); ++count )
+        continue;
+    end = rdtsc_ordered();
+
+    /* Set the Gate low, disable speaker. */
+    outb(portb & ~0x03, 0x61);
+
+    /* Error if the CTC doesn't behave itself. */
+    if ( count == 0 )
+        return 0;
+
+    return (end - start) * CALIBRATE_FRAC;
+}
+
+static void resume_pit(struct platform_timesource *pts)
+{
+    /* Set CTC channel 2 to mode 0 again; initial value does not matter. */
+    outb(0xb0, PIT_MODE); /* binary, mode 0, LSB/MSB, Ch 2 */
+    outb(0, PIT_CH2);     /* LSB of count */
+    outb(0, PIT_CH2);     /* MSB of count */
 }
 
 static struct platform_timesource __initdata plt_pit =
@@ -340,7 +348,8 @@ static struct platform_timesource __init
     .frequency = CLOCK_TICK_RATE,
     .read_counter = read_pit_count,
     .counter_bits = 32,
-    .init = init_pit
+    .init = init_pit,
+    .resume = resume_pit
 };
 
 /************************************************************
@@ -352,15 +361,26 @@ static u64 read_hpet_count(void)
     return hpet_read32(HPET_COUNTER);
 }
 
-static int __init init_hpet(struct platform_timesource *pts)
+static s64 __init init_hpet(struct platform_timesource *pts)
 {
-    u64 hpet_rate = hpet_setup();
+    u64 hpet_rate = hpet_setup(), start;
+    u32 count, target;
 
     if ( hpet_rate == 0 )
         return 0;
 
     pts->frequency = hpet_rate;
-    return 1;
+
+    count = hpet_read32(HPET_COUNTER);
+    start = rdtsc_ordered();
+    target = count + CALIBRATE_VALUE(hpet_rate);
+    if ( target < count )
+        while ( hpet_read32(HPET_COUNTER) >= count )
+            continue;
+    while ( hpet_read32(HPET_COUNTER) < target )
+        continue;
+
+    return (rdtsc_ordered() - start) * CALIBRATE_FRAC;
 }
 
 static void resume_hpet(struct platform_timesource *pts)
@@ -392,12 +412,24 @@ static u64 read_pmtimer_count(void)
     return inl(pmtmr_ioport);
 }
 
-static int __init init_pmtimer(struct platform_timesource *pts)
+static s64 __init init_pmtimer(struct platform_timesource *pts)
 {
+    u64 start;
+    u32 count, target, mask = 0xffffff;
+
     if ( pmtmr_ioport == 0 )
         return 0;
 
-    return 1;
+    count = inl(pmtmr_ioport) & mask;
+    start = rdtsc_ordered();
+    target = count + CALIBRATE_VALUE(ACPI_PM_FREQUENCY);
+    if ( target < count )
+        while ( (inl(pmtmr_ioport) & mask) >= count )
+            continue;
+    while ( (inl(pmtmr_ioport) & mask) < target )
+        continue;
+
+    return (rdtsc_ordered() - start) * CALIBRATE_FRAC;
 }
 
 static struct platform_timesource __initdata plt_pmtimer =
@@ -533,14 +565,15 @@ static void resume_platform_timer(void)
     plt_stamp = plt_src.read_counter();
 }
 
-static void __init init_platform_timer(void)
+static u64 __init init_platform_timer(void)
 {
     static struct platform_timesource * __initdata plt_timers[] = {
         &plt_hpet, &plt_pmtimer, &plt_pit
     };
 
     struct platform_timesource *pts = NULL;
-    int i, rc = -1;
+    unsigned int i;
+    s64 rc = -1;
 
     if ( opt_clocksource[0] != '\0' )
     {
@@ -578,15 +611,12 @@ static void __init init_platform_timer(v
 
     plt_overflow_period = scale_delta(
         1ull << (pts->counter_bits-1), &plt_scale);
-    init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
     plt_src = *pts;
-    plt_overflow(NULL);
-
-    platform_timer_stamp = plt_stamp64;
-    stime_platform_stamp = NOW();
 
     printk("Platform timer is %s %s\n",
            freq_string(pts->frequency), pts->name);
+
+    return rc;
 }
 
 u64 stime2tsc(s_time_t stime)
@@ -1478,7 +1508,11 @@ int __init init_xen_time(void)
     /* NB. get_cmos_time() can take over one second to execute. */
     do_settime(get_cmos_time(), 0, NOW());
 
-    init_platform_timer();
+    /* Finish platform timer initialization. */
+    init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
+    plt_overflow(NULL);
+    platform_timer_stamp = plt_stamp64;
+    stime_platform_stamp = NOW();
 
     init_percpu_time();
 
@@ -1493,7 +1527,10 @@ int __init init_xen_time(void)
 void __init early_time_init(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
-    u64 tmp = init_pit_and_calibrate_tsc();
+    u64 tmp;
+
+    preinit_pit();
+    tmp = init_platform_timer();
 
     set_time_scale(&t->tsc_scale, tmp);
     t->local_tsc_stamp = boot_tsc_stamp;
@@ -1602,7 +1639,7 @@ int time_suspend(void)
 
 int time_resume(void)
 {
-    init_pit_and_calibrate_tsc();
+    preinit_pit();
 
     resume_platform_timer();
 
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -39,6 +39,7 @@ mktime (unsigned int year, unsigned int
 int time_suspend(void);
 int time_resume(void);
 
+void preinit_pit(void);
 void init_percpu_time(void);
 void time_latch_stamps(void);
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
                   ` (3 preceding siblings ...)
  2016-06-15 10:28 ` [PATCH 4/8] x86/time: calibrate TSC against platform timer Jan Beulich
@ 2016-06-15 10:28 ` Jan Beulich
  2016-06-20 14:32   ` Andrew Cooper
  2016-06-15 10:29 ` [PATCH 6/8] x86/time: support 32-bit wide ACPI PM timer Jan Beulich
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 3125 bytes --]

As such clearing of flags may have an impact on the selected rendezvous
function, handle such in a central place.

But don't allow such feature flags to be cleared during CPU hotplug:
Platform and local system times may have diverged significantly by
then, potentially causing noticably (even if only temporary) strange
behavior. As we're anyway expecting only sufficiently similar CPUs to
appear during hotplug, this shouldn't be introducing new limitations.

Reported-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -1135,8 +1135,8 @@ static int mwait_idle_cpu_init(struct no
 		}
 
 		if (state > 2 && !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-		    !pm_idle_save)
-			setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+		    !pm_idle_save && system_state < SYS_STATE_active)
+			clear_tsc_cap(X86_FEATURE_TSC_RELIABLE);
 
 		cx = dev->states + dev->count;
 		cx->type = state;
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
                      &r, 1);
 }
 
+void __init clear_tsc_cap(unsigned int feature)
+{
+    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
+
+    if ( feature )
+        setup_clear_cpu_cap(feature);
+
+    /* If we have constant-rate TSCs then scale factor can be shared. */
+    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
+    {
+        /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
+        if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
+            rendezvous_fn = time_calibration_tsc_rendezvous;
+    }
+
+    time_calibration_rendezvous_fn = rendezvous_fn;
+}
+
 static struct {
     s_time_t local_stime, master_stime;
 } ap_bringup_ref;
@@ -1482,7 +1500,7 @@ static int __init verify_tsc_reliability
         {
             printk("%s: TSC warp detected, disabling TSC_RELIABLE\n",
                    __func__);
-            setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+            clear_tsc_cap(X86_FEATURE_TSC_RELIABLE);
         }
     }
 
@@ -1495,13 +1513,7 @@ int __init init_xen_time(void)
 {
     tsc_check_writability();
 
-    /* If we have constant-rate TSCs then scale factor can be shared. */
-    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
-    {
-        /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
-        if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
-            time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
-    }
+    clear_tsc_cap(0);
 
     open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
 
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -71,6 +71,7 @@ void tsc_get_info(struct domain *d, uint
 void force_update_vcpu_system_time(struct vcpu *v);
 
 int host_tsc_is_safe(void);
+void clear_tsc_cap(unsigned int feature);
 void cpuid_time_leaf(uint32_t sub_idx, uint32_t *eax, uint32_t *ebx,
                      uint32_t *ecx, uint32_t *edx);
 




[-- Attachment #2: x86-time-late-feature-disable.patch --]
[-- Type: text/plain, Size: 3191 bytes --]

x86/time: correctly honor late clearing of TSC related feature flags

As such clearing of flags may have an impact on the selected rendezvous
function, handle such in a central place.

But don't allow such feature flags to be cleared during CPU hotplug:
Platform and local system times may have diverged significantly by
then, potentially causing noticably (even if only temporary) strange
behavior. As we're anyway expecting only sufficiently similar CPUs to
appear during hotplug, this shouldn't be introducing new limitations.

Reported-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -1135,8 +1135,8 @@ static int mwait_idle_cpu_init(struct no
 		}
 
 		if (state > 2 && !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
-		    !pm_idle_save)
-			setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+		    !pm_idle_save && system_state < SYS_STATE_active)
+			clear_tsc_cap(X86_FEATURE_TSC_RELIABLE);
 
 		cx = dev->states + dev->count;
 		cx->type = state;
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
                      &r, 1);
 }
 
+void __init clear_tsc_cap(unsigned int feature)
+{
+    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
+
+    if ( feature )
+        setup_clear_cpu_cap(feature);
+
+    /* If we have constant-rate TSCs then scale factor can be shared. */
+    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
+    {
+        /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
+        if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
+            rendezvous_fn = time_calibration_tsc_rendezvous;
+    }
+
+    time_calibration_rendezvous_fn = rendezvous_fn;
+}
+
 static struct {
     s_time_t local_stime, master_stime;
 } ap_bringup_ref;
@@ -1482,7 +1500,7 @@ static int __init verify_tsc_reliability
         {
             printk("%s: TSC warp detected, disabling TSC_RELIABLE\n",
                    __func__);
-            setup_clear_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+            clear_tsc_cap(X86_FEATURE_TSC_RELIABLE);
         }
     }
 
@@ -1495,13 +1513,7 @@ int __init init_xen_time(void)
 {
     tsc_check_writability();
 
-    /* If we have constant-rate TSCs then scale factor can be shared. */
-    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
-    {
-        /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
-        if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
-            time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
-    }
+    clear_tsc_cap(0);
 
     open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
 
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -71,6 +71,7 @@ void tsc_get_info(struct domain *d, uint
 void force_update_vcpu_system_time(struct vcpu *v);
 
 int host_tsc_is_safe(void);
+void clear_tsc_cap(unsigned int feature);
 void cpuid_time_leaf(uint32_t sub_idx, uint32_t *eax, uint32_t *ebx,
                      uint32_t *ecx, uint32_t *edx);
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 6/8] x86/time: support 32-bit wide ACPI PM timer
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
                   ` (4 preceding siblings ...)
  2016-06-15 10:28 ` [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags Jan Beulich
@ 2016-06-15 10:29 ` Jan Beulich
  2016-07-04 15:40   ` Andrew Cooper
  2016-06-15 10:30 ` [PATCH 7/8] x86/time: fold recurring code Jan Beulich
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]

I have no idea why we didn't do so from the beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/acpi/boot.c
+++ b/xen/arch/x86/acpi/boot.c
@@ -481,22 +481,23 @@ static int __init acpi_parse_fadt(struct
 	if (fadt->header.revision >= FADT2_REVISION_ID) {
 		/* FADT rev. 2 */
 		if (fadt->xpm_timer_block.space_id ==
-		    ACPI_ADR_SPACE_SYSTEM_IO)
+		    ACPI_ADR_SPACE_SYSTEM_IO) {
 			pmtmr_ioport = fadt->xpm_timer_block.address;
-		/*
-		 * "X" fields are optional extensions to the original V1.0
-		 * fields, so we must selectively expand V1.0 fields if the
-		 * corresponding X field is zero.
-	 	 */
-		if (!pmtmr_ioport)
-			pmtmr_ioport = fadt->pm_timer_block;
-	} else {
-		/* FADT rev. 1 */
+			pmtmr_width = fadt->xpm_timer_block.bit_width;
+		}
+	}
+	/*
+	 * "X" fields are optional extensions to the original V1.0
+	 * fields, so we must selectively expand V1.0 fields if the
+	 * corresponding X field is zero.
+ 	 */
+	if (!pmtmr_ioport) {
 		pmtmr_ioport = fadt->pm_timer_block;
+		pmtmr_width = fadt->pm_timer_length == 4 ? 24 : 0;
 	}
 	if (pmtmr_ioport)
-		printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x\n",
-		       pmtmr_ioport);
+		printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x (%u bits)\n",
+		       pmtmr_ioport, pmtmr_width);
 #endif
 
 	acpi_smi_cmd       = fadt->smi_command;
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -403,6 +403,7 @@ static struct platform_timesource __init
  */
 
 u32 __read_mostly pmtmr_ioport;
+unsigned int __initdata pmtmr_width;
 
 /* ACPI PM timer ticks at 3.579545 MHz. */
 #define ACPI_PM_FREQUENCY 3579545
@@ -417,9 +418,15 @@ static s64 __init init_pmtimer(struct pl
     u64 start;
     u32 count, target, mask = 0xffffff;
 
-    if ( pmtmr_ioport == 0 )
+    if ( !pmtmr_ioport || !pmtmr_width )
         return 0;
 
+    if ( pmtmr_width == 32 )
+    {
+        pts->counter_bits = 32;
+        mask = 0xffffffff;
+    }
+
     count = inl(pmtmr_ioport) & mask;
     start = rdtsc_ordered();
     target = count + CALIBRATE_VALUE(ACPI_PM_FREQUENCY);
--- a/xen/include/asm-x86/acpi.h
+++ b/xen/include/asm-x86/acpi.h
@@ -146,6 +146,7 @@ extern u32 x86_acpiid_to_apicid[];
 #define INVALID_ACPIID		(-1U)
 
 extern u32 pmtmr_ioport;
+extern unsigned int pmtmr_width;
 
 int acpi_dmar_init(void);
 void acpi_mmcfg_init(void);




[-- Attachment #2: x86-PMTMR-32bit.patch --]
[-- Type: text/plain, Size: 2455 bytes --]

x86/time: support 32-bit wide ACPI PM timer

I have no idea why we didn't do so from the beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/acpi/boot.c
+++ b/xen/arch/x86/acpi/boot.c
@@ -481,22 +481,23 @@ static int __init acpi_parse_fadt(struct
 	if (fadt->header.revision >= FADT2_REVISION_ID) {
 		/* FADT rev. 2 */
 		if (fadt->xpm_timer_block.space_id ==
-		    ACPI_ADR_SPACE_SYSTEM_IO)
+		    ACPI_ADR_SPACE_SYSTEM_IO) {
 			pmtmr_ioport = fadt->xpm_timer_block.address;
-		/*
-		 * "X" fields are optional extensions to the original V1.0
-		 * fields, so we must selectively expand V1.0 fields if the
-		 * corresponding X field is zero.
-	 	 */
-		if (!pmtmr_ioport)
-			pmtmr_ioport = fadt->pm_timer_block;
-	} else {
-		/* FADT rev. 1 */
+			pmtmr_width = fadt->xpm_timer_block.bit_width;
+		}
+	}
+	/*
+	 * "X" fields are optional extensions to the original V1.0
+	 * fields, so we must selectively expand V1.0 fields if the
+	 * corresponding X field is zero.
+ 	 */
+	if (!pmtmr_ioport) {
 		pmtmr_ioport = fadt->pm_timer_block;
+		pmtmr_width = fadt->pm_timer_length == 4 ? 24 : 0;
 	}
 	if (pmtmr_ioport)
-		printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x\n",
-		       pmtmr_ioport);
+		printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x (%u bits)\n",
+		       pmtmr_ioport, pmtmr_width);
 #endif
 
 	acpi_smi_cmd       = fadt->smi_command;
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -403,6 +403,7 @@ static struct platform_timesource __init
  */
 
 u32 __read_mostly pmtmr_ioport;
+unsigned int __initdata pmtmr_width;
 
 /* ACPI PM timer ticks at 3.579545 MHz. */
 #define ACPI_PM_FREQUENCY 3579545
@@ -417,9 +418,15 @@ static s64 __init init_pmtimer(struct pl
     u64 start;
     u32 count, target, mask = 0xffffff;
 
-    if ( pmtmr_ioport == 0 )
+    if ( !pmtmr_ioport || !pmtmr_width )
         return 0;
 
+    if ( pmtmr_width == 32 )
+    {
+        pts->counter_bits = 32;
+        mask = 0xffffffff;
+    }
+
     count = inl(pmtmr_ioport) & mask;
     start = rdtsc_ordered();
     target = count + CALIBRATE_VALUE(ACPI_PM_FREQUENCY);
--- a/xen/include/asm-x86/acpi.h
+++ b/xen/include/asm-x86/acpi.h
@@ -146,6 +146,7 @@ extern u32 x86_acpiid_to_apicid[];
 #define INVALID_ACPIID		(-1U)
 
 extern u32 pmtmr_ioport;
+extern unsigned int pmtmr_width;
 
 int acpi_dmar_init(void);
 void acpi_mmcfg_init(void);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 7/8] x86/time: fold recurring code
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
                   ` (5 preceding siblings ...)
  2016-06-15 10:29 ` [PATCH 6/8] x86/time: support 32-bit wide ACPI PM timer Jan Beulich
@ 2016-06-15 10:30 ` Jan Beulich
  2016-07-04 15:43   ` Andrew Cooper
  2016-06-15 10:30 ` [PATCH 8/8] x86/time: group time stamps into a structure Jan Beulich
  2016-07-01  7:44 ` Ping: [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 2257 bytes --]

Common code between time_calibration_{std,tsc}_rendezvous() can better
live in a single place, eliminating the risk of adjusting one without
the other.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1263,6 +1263,18 @@ struct calibration_rendezvous {
     u64 master_tsc_stamp;
 };
 
+static void
+time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
+{
+    struct cpu_calibration *c = &this_cpu(cpu_calibration);
+
+    c->local_tsc_stamp = rdtsc_ordered();
+    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
+    c->stime_master_stamp = r->master_stime;
+
+    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
+}
+
 /*
  * Keep TSCs in sync when they run at the same rate, but may stop in
  * deep-sleep C states.
@@ -1270,7 +1282,6 @@ struct calibration_rendezvous {
 static void time_calibration_tsc_rendezvous(void *_r)
 {
     int i;
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
     struct calibration_rendezvous *r = _r;
     unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
 
@@ -1311,17 +1322,12 @@ static void time_calibration_tsc_rendezv
         }
     }
 
-    c->local_tsc_stamp = rdtsc_ordered();
-    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
-    c->stime_master_stamp = r->master_stime;
-
-    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
+    time_calibration_rendezvous_tail(r);
 }
 
 /* Ordinary rendezvous function which does not modify TSC values. */
 static void time_calibration_std_rendezvous(void *_r)
 {
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
     struct calibration_rendezvous *r = _r;
     unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
 
@@ -1341,11 +1347,7 @@ static void time_calibration_std_rendezv
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    c->local_tsc_stamp = rdtsc_ordered();
-    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
-    c->stime_master_stamp = r->master_stime;
-
-    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
+    time_calibration_rendezvous_tail(r);
 }
 
 static void (*time_calibration_rendezvous_fn)(void *) =




[-- Attachment #2: x86-time-calibration-fold.patch --]
[-- Type: text/plain, Size: 2284 bytes --]

x86/time: fold recurring code

Common code between time_calibration_{std,tsc}_rendezvous() can better
live in a single place, eliminating the risk of adjusting one without
the other.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1263,6 +1263,18 @@ struct calibration_rendezvous {
     u64 master_tsc_stamp;
 };
 
+static void
+time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
+{
+    struct cpu_calibration *c = &this_cpu(cpu_calibration);
+
+    c->local_tsc_stamp = rdtsc_ordered();
+    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
+    c->stime_master_stamp = r->master_stime;
+
+    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
+}
+
 /*
  * Keep TSCs in sync when they run at the same rate, but may stop in
  * deep-sleep C states.
@@ -1270,7 +1282,6 @@ struct calibration_rendezvous {
 static void time_calibration_tsc_rendezvous(void *_r)
 {
     int i;
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
     struct calibration_rendezvous *r = _r;
     unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
 
@@ -1311,17 +1322,12 @@ static void time_calibration_tsc_rendezv
         }
     }
 
-    c->local_tsc_stamp = rdtsc_ordered();
-    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
-    c->stime_master_stamp = r->master_stime;
-
-    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
+    time_calibration_rendezvous_tail(r);
 }
 
 /* Ordinary rendezvous function which does not modify TSC values. */
 static void time_calibration_std_rendezvous(void *_r)
 {
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
     struct calibration_rendezvous *r = _r;
     unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
 
@@ -1341,11 +1347,7 @@ static void time_calibration_std_rendezv
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
-    c->local_tsc_stamp = rdtsc_ordered();
-    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
-    c->stime_master_stamp = r->master_stime;
-
-    raise_softirq(TIME_CALIBRATE_SOFTIRQ);
+    time_calibration_rendezvous_tail(r);
 }
 
 static void (*time_calibration_rendezvous_fn)(void *) =

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 8/8] x86/time: group time stamps into a structure
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
                   ` (6 preceding siblings ...)
  2016-06-15 10:30 ` [PATCH 7/8] x86/time: fold recurring code Jan Beulich
@ 2016-06-15 10:30 ` Jan Beulich
  2016-07-04 15:57   ` Andrew Cooper
  2016-07-01  7:44 ` Ping: [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
  8 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

[-- Attachment #1: Type: text/plain, Size: 12028 bytes --]

If that had been done from the beginning, mistakes like the one
corrected in commit b64438c7c1 ("x86/time: use correct (local) time
stamp in constant-TSC calibration fast path") would likely never have
happened.

Also add a few "const" to make more obvious when things aren't expected
to change.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -47,10 +47,14 @@ unsigned long __read_mostly cpu_khz;  /*
 DEFINE_SPINLOCK(rtc_lock);
 unsigned long pit0_ticks;
 
+struct cpu_time_stamp {
+    u64 local_tsc;
+    s_time_t local_stime;
+    s_time_t master_stime;
+};
+
 struct cpu_time {
-    u64 local_tsc_stamp;
-    s_time_t stime_local_stamp;
-    s_time_t stime_master_stamp;
+    struct cpu_time_stamp stamp;
     struct time_scale tsc_scale;
 };
 
@@ -118,7 +122,7 @@ static inline u32 mul_frac(u32 multiplic
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
-u64 scale_delta(u64 delta, struct time_scale *scale)
+u64 scale_delta(u64 delta, const struct time_scale *scale)
 {
     u64 product;
 
@@ -635,11 +639,11 @@ u64 stime2tsc(s_time_t stime)
     t = &this_cpu(cpu_time);
     sys_to_tsc = scale_reciprocal(t->tsc_scale);
 
-    stime_delta = stime - t->stime_local_stamp;
+    stime_delta = stime - t->stamp.local_stime;
     if ( stime_delta < 0 )
         stime_delta = 0;
 
-    return t->local_tsc_stamp + scale_delta(stime_delta, &sys_to_tsc);
+    return t->stamp.local_tsc + scale_delta(stime_delta, &sys_to_tsc);
 }
 
 void cstate_restore_tsc(void)
@@ -793,7 +797,7 @@ static unsigned long get_cmos_time(void)
 
 s_time_t get_s_time_fixed(u64 at_tsc)
 {
-    struct cpu_time *t = &this_cpu(cpu_time);
+    const struct cpu_time *t = &this_cpu(cpu_time);
     u64 tsc, delta;
     s_time_t now;
 
@@ -801,8 +805,8 @@ s_time_t get_s_time_fixed(u64 at_tsc)
         tsc = at_tsc;
     else
         tsc = rdtsc_ordered();
-    delta = tsc - t->local_tsc_stamp;
-    now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
+    delta = tsc - t->stamp.local_tsc;
+    now = t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
 
     return now;
 }
@@ -821,7 +825,7 @@ uint64_t tsc_ticks2ns(uint64_t ticks)
 
 static void __update_vcpu_system_time(struct vcpu *v, int force)
 {
-    struct cpu_time       *t;
+    const struct cpu_time *t;
     struct vcpu_time_info *u, _u = {};
     struct domain *d = v->domain;
     s_time_t tsc_stamp;
@@ -834,7 +838,7 @@ static void __update_vcpu_system_time(st
 
     if ( d->arch.vtsc )
     {
-        s_time_t stime = t->stime_local_stamp;
+        s_time_t stime = t->stamp.local_stime;
 
         if ( is_hvm_domain(d) )
         {
@@ -856,20 +860,20 @@ static void __update_vcpu_system_time(st
     {
         if ( has_hvm_container_domain(d) && hvm_tsc_scaling_supported )
         {
-            tsc_stamp            = hvm_scale_tsc(d, t->local_tsc_stamp);
+            tsc_stamp            = hvm_scale_tsc(d, t->stamp.local_tsc);
             _u.tsc_to_system_mul = d->arch.vtsc_to_ns.mul_frac;
             _u.tsc_shift         = d->arch.vtsc_to_ns.shift;
         }
         else
         {
-            tsc_stamp            = t->local_tsc_stamp;
+            tsc_stamp            = t->stamp.local_tsc;
             _u.tsc_to_system_mul = t->tsc_scale.mul_frac;
             _u.tsc_shift         = t->tsc_scale.shift;
         }
     }
 
     _u.tsc_timestamp = tsc_stamp;
-    _u.system_time   = t->stime_local_stamp;
+    _u.system_time   = t->stamp.local_stime;
 
     if ( is_hvm_domain(d) )
         _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
@@ -969,12 +973,12 @@ int cpu_frequency_change(u64 freq)
 
     local_irq_disable();
     /* Platform time /first/, as we may be delayed by platform_timer_lock. */
-    t->stime_master_stamp = read_platform_stime();
-    /* TSC-extrapolated time may be bogus after frequency change. */
-    /*t->stime_local_stamp = get_s_time();*/
-    t->stime_local_stamp = t->stime_master_stamp;
+    t->stamp.master_stime = read_platform_stime();
     curr_tsc = rdtsc_ordered();
-    t->local_tsc_stamp = curr_tsc;
+    /* TSC-extrapolated time may be bogus after frequency change. */
+    /*t->stamp.local_stime = get_s_time_fixed(curr_tsc);*/
+    t->stamp.local_stime = t->stamp.master_stime;
+    t->stamp.local_tsc = curr_tsc;
     set_time_scale(&t->tsc_scale, freq);
     local_irq_enable();
 
@@ -991,28 +995,19 @@ int cpu_frequency_change(u64 freq)
 }
 
 /* Per-CPU communication between rendezvous IRQ and softirq handler. */
-struct cpu_calibration {
-    u64 local_tsc_stamp;
-    s_time_t stime_local_stamp;
-    s_time_t stime_master_stamp;
-};
-static DEFINE_PER_CPU(struct cpu_calibration, cpu_calibration);
+static DEFINE_PER_CPU(struct cpu_time_stamp, cpu_calibration);
 
 /* Softirq handler for per-CPU time calibration. */
 static void local_time_calibration(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
+    const struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
     /*
-     * System timestamps, extrapolated from local and master oscillators,
-     * taken during this calibration and the previous calibration.
+     * System (extrapolated from local and master oscillators) and TSC
+     * timestamps, taken during this calibration and the previous one.
      */
-    s_time_t prev_local_stime, curr_local_stime;
-    s_time_t prev_master_stime, curr_master_stime;
-
-    /* TSC timestamps taken during this calibration and prev calibration. */
-    u64 prev_tsc, curr_tsc;
+    struct cpu_time_stamp prev, curr;
 
     /*
      * System time and TSC ticks elapsed during the previous calibration
@@ -1021,9 +1016,6 @@ static void local_time_calibration(void)
     u64 stime_elapsed64, tsc_elapsed64;
     u32 stime_elapsed32, tsc_elapsed32;
 
-    /* The accumulated error in the local estimate. */
-    u64 local_stime_err;
-
     /* Error correction to slow down a fast local clock. */
     u32 error_factor = 0;
 
@@ -1037,40 +1029,34 @@ static void local_time_calibration(void)
     {
         /* Atomically read cpu_calibration struct and write cpu_time struct. */
         local_irq_disable();
-        t->local_tsc_stamp    = c->local_tsc_stamp;
-        t->stime_local_stamp  = c->stime_local_stamp;
-        t->stime_master_stamp = c->stime_master_stamp;
+        t->stamp = *c;
         local_irq_enable();
         update_vcpu_system_time(current);
         goto out;
     }
 
-    prev_tsc          = t->local_tsc_stamp;
-    prev_local_stime  = t->stime_local_stamp;
-    prev_master_stime = t->stime_master_stamp;
+    prev = t->stamp;
 
     /* Disabling IRQs ensures we atomically read cpu_calibration struct. */
     local_irq_disable();
-    curr_tsc          = c->local_tsc_stamp;
-    curr_local_stime  = c->stime_local_stamp;
-    curr_master_stime = c->stime_master_stamp;
+    curr = *c;
     local_irq_enable();
 
 #if 0
     printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
-           smp_processor_id(), prev_tsc, prev_local_stime, prev_master_stime);
+           smp_processor_id(), prev.local_tsc, prev.local_stime, prev.master_stime);
     printk("CUR%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64
            " -> %"PRId64"\n",
-           smp_processor_id(), curr_tsc, curr_local_stime, curr_master_stime,
-           curr_master_stime - curr_local_stime);
+           smp_processor_id(), curr.local_tsc, curr.local_stime, curr.master_stime,
+           curr.master_stime - curr.local_stime);
 #endif
 
     /* Local time warps forward if it lags behind master time. */
-    if ( curr_local_stime < curr_master_stime )
-        curr_local_stime = curr_master_stime;
+    if ( curr.local_stime < curr.master_stime )
+        curr.local_stime = curr.master_stime;
 
-    stime_elapsed64 = curr_master_stime - prev_master_stime;
-    tsc_elapsed64   = curr_tsc - prev_tsc;
+    stime_elapsed64 = curr.master_stime - prev.master_stime;
+    tsc_elapsed64   = curr.local_tsc - prev.local_tsc;
 
     /*
      * Weirdness can happen if we lose sync with the platform timer.
@@ -1084,9 +1070,10 @@ static void local_time_calibration(void)
      * clock (slow clocks are warped forwards). The scale factor is clamped
      * to >= 0.5.
      */
-    if ( curr_local_stime != curr_master_stime )
+    if ( curr.local_stime != curr.master_stime )
     {
-        local_stime_err = curr_local_stime - curr_master_stime;
+        u64 local_stime_err = curr.local_stime - curr.master_stime;
+
         if ( local_stime_err > EPOCH )
             local_stime_err = EPOCH;
         error_factor = div_frac(EPOCH, EPOCH + (u32)local_stime_err);
@@ -1139,9 +1126,7 @@ static void local_time_calibration(void)
     local_irq_disable();
     t->tsc_scale.mul_frac = calibration_mul_frac;
     t->tsc_scale.shift    = tsc_shift;
-    t->local_tsc_stamp    = curr_tsc;
-    t->stime_local_stamp  = curr_local_stime;
-    t->stime_master_stamp = curr_master_stime;
+    t->stamp              = curr;
     local_irq_enable();
 
     update_vcpu_system_time(current);
@@ -1269,11 +1254,11 @@ struct calibration_rendezvous {
 static void
 time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
 {
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
+    struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
-    c->local_tsc_stamp = rdtsc_ordered();
-    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
-    c->stime_master_stamp = r->master_stime;
+    c->local_tsc    = rdtsc_ordered();
+    c->local_stime  = get_s_time_fixed(c->local_tsc);
+    c->master_stime = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
 }
@@ -1388,20 +1373,17 @@ void __init clear_tsc_cap(unsigned int f
     time_calibration_rendezvous_fn = rendezvous_fn;
 }
 
-static struct {
-    s_time_t local_stime, master_stime;
-} ap_bringup_ref;
+static struct cpu_time_stamp ap_bringup_ref;
 
 void time_latch_stamps(void) {
     unsigned long flags;
-    u64 tsc;
 
     local_irq_save(flags);
     ap_bringup_ref.master_stime = read_platform_stime();
-    tsc = rdtsc_ordered();
+    ap_bringup_ref.local_tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
-    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
+    ap_bringup_ref.local_stime = get_s_time_fixed(ap_bringup_ref.local_tsc);
 }
 
 void init_percpu_time(void)
@@ -1419,7 +1401,7 @@ void init_percpu_time(void)
     tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
-    t->stime_master_stamp = now;
+    t->stamp.master_stime = now;
     /*
      * To avoid a discontinuity (TSC and platform clock can't be expected
      * to be in perfect sync), initialization here needs to match up with
@@ -1432,8 +1414,8 @@ void init_percpu_time(void)
         else
             now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
     }
-    t->local_tsc_stamp    = tsc;
-    t->stime_local_stamp  = now;
+    t->stamp.local_tsc   = tsc;
+    t->stamp.local_stime = now;
 }
 
 /*
@@ -1557,7 +1539,7 @@ void __init early_time_init(void)
     tmp = init_platform_timer();
 
     set_time_scale(&t->tsc_scale, tmp);
-    t->local_tsc_stamp = boot_tsc_stamp;
+    t->stamp.local_tsc = boot_tsc_stamp;
 
     do_div(tmp, 1000);
     cpu_khz = (unsigned long)tmp;
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -79,6 +79,6 @@ u64 stime2tsc(s_time_t stime);
 
 struct time_scale;
 void set_time_scale(struct time_scale *ts, u64 ticks_per_sec);
-u64 scale_delta(u64 delta, struct time_scale *scale);
+u64 scale_delta(u64 delta, const struct time_scale *scale);
 
 #endif /* __X86_TIME_H__ */



[-- Attachment #2: x86-time-calibration-pack.patch --]
[-- Type: text/plain, Size: 12072 bytes --]

x86/time: group time stamps into a structure

If that had been done from the beginning, mistakes like the one
corrected in commit b64438c7c1 ("x86/time: use correct (local) time
stamp in constant-TSC calibration fast path") would likely never have
happened.

Also add a few "const" to make more obvious when things aren't expected
to change.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -47,10 +47,14 @@ unsigned long __read_mostly cpu_khz;  /*
 DEFINE_SPINLOCK(rtc_lock);
 unsigned long pit0_ticks;
 
+struct cpu_time_stamp {
+    u64 local_tsc;
+    s_time_t local_stime;
+    s_time_t master_stime;
+};
+
 struct cpu_time {
-    u64 local_tsc_stamp;
-    s_time_t stime_local_stamp;
-    s_time_t stime_master_stamp;
+    struct cpu_time_stamp stamp;
     struct time_scale tsc_scale;
 };
 
@@ -118,7 +122,7 @@ static inline u32 mul_frac(u32 multiplic
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
-u64 scale_delta(u64 delta, struct time_scale *scale)
+u64 scale_delta(u64 delta, const struct time_scale *scale)
 {
     u64 product;
 
@@ -635,11 +639,11 @@ u64 stime2tsc(s_time_t stime)
     t = &this_cpu(cpu_time);
     sys_to_tsc = scale_reciprocal(t->tsc_scale);
 
-    stime_delta = stime - t->stime_local_stamp;
+    stime_delta = stime - t->stamp.local_stime;
     if ( stime_delta < 0 )
         stime_delta = 0;
 
-    return t->local_tsc_stamp + scale_delta(stime_delta, &sys_to_tsc);
+    return t->stamp.local_tsc + scale_delta(stime_delta, &sys_to_tsc);
 }
 
 void cstate_restore_tsc(void)
@@ -793,7 +797,7 @@ static unsigned long get_cmos_time(void)
 
 s_time_t get_s_time_fixed(u64 at_tsc)
 {
-    struct cpu_time *t = &this_cpu(cpu_time);
+    const struct cpu_time *t = &this_cpu(cpu_time);
     u64 tsc, delta;
     s_time_t now;
 
@@ -801,8 +805,8 @@ s_time_t get_s_time_fixed(u64 at_tsc)
         tsc = at_tsc;
     else
         tsc = rdtsc_ordered();
-    delta = tsc - t->local_tsc_stamp;
-    now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
+    delta = tsc - t->stamp.local_tsc;
+    now = t->stamp.local_stime + scale_delta(delta, &t->tsc_scale);
 
     return now;
 }
@@ -821,7 +825,7 @@ uint64_t tsc_ticks2ns(uint64_t ticks)
 
 static void __update_vcpu_system_time(struct vcpu *v, int force)
 {
-    struct cpu_time       *t;
+    const struct cpu_time *t;
     struct vcpu_time_info *u, _u = {};
     struct domain *d = v->domain;
     s_time_t tsc_stamp;
@@ -834,7 +838,7 @@ static void __update_vcpu_system_time(st
 
     if ( d->arch.vtsc )
     {
-        s_time_t stime = t->stime_local_stamp;
+        s_time_t stime = t->stamp.local_stime;
 
         if ( is_hvm_domain(d) )
         {
@@ -856,20 +860,20 @@ static void __update_vcpu_system_time(st
     {
         if ( has_hvm_container_domain(d) && hvm_tsc_scaling_supported )
         {
-            tsc_stamp            = hvm_scale_tsc(d, t->local_tsc_stamp);
+            tsc_stamp            = hvm_scale_tsc(d, t->stamp.local_tsc);
             _u.tsc_to_system_mul = d->arch.vtsc_to_ns.mul_frac;
             _u.tsc_shift         = d->arch.vtsc_to_ns.shift;
         }
         else
         {
-            tsc_stamp            = t->local_tsc_stamp;
+            tsc_stamp            = t->stamp.local_tsc;
             _u.tsc_to_system_mul = t->tsc_scale.mul_frac;
             _u.tsc_shift         = t->tsc_scale.shift;
         }
     }
 
     _u.tsc_timestamp = tsc_stamp;
-    _u.system_time   = t->stime_local_stamp;
+    _u.system_time   = t->stamp.local_stime;
 
     if ( is_hvm_domain(d) )
         _u.tsc_timestamp += v->arch.hvm_vcpu.cache_tsc_offset;
@@ -969,12 +973,12 @@ int cpu_frequency_change(u64 freq)
 
     local_irq_disable();
     /* Platform time /first/, as we may be delayed by platform_timer_lock. */
-    t->stime_master_stamp = read_platform_stime();
-    /* TSC-extrapolated time may be bogus after frequency change. */
-    /*t->stime_local_stamp = get_s_time();*/
-    t->stime_local_stamp = t->stime_master_stamp;
+    t->stamp.master_stime = read_platform_stime();
     curr_tsc = rdtsc_ordered();
-    t->local_tsc_stamp = curr_tsc;
+    /* TSC-extrapolated time may be bogus after frequency change. */
+    /*t->stamp.local_stime = get_s_time_fixed(curr_tsc);*/
+    t->stamp.local_stime = t->stamp.master_stime;
+    t->stamp.local_tsc = curr_tsc;
     set_time_scale(&t->tsc_scale, freq);
     local_irq_enable();
 
@@ -991,28 +995,19 @@ int cpu_frequency_change(u64 freq)
 }
 
 /* Per-CPU communication between rendezvous IRQ and softirq handler. */
-struct cpu_calibration {
-    u64 local_tsc_stamp;
-    s_time_t stime_local_stamp;
-    s_time_t stime_master_stamp;
-};
-static DEFINE_PER_CPU(struct cpu_calibration, cpu_calibration);
+static DEFINE_PER_CPU(struct cpu_time_stamp, cpu_calibration);
 
 /* Softirq handler for per-CPU time calibration. */
 static void local_time_calibration(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
+    const struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
     /*
-     * System timestamps, extrapolated from local and master oscillators,
-     * taken during this calibration and the previous calibration.
+     * System (extrapolated from local and master oscillators) and TSC
+     * timestamps, taken during this calibration and the previous one.
      */
-    s_time_t prev_local_stime, curr_local_stime;
-    s_time_t prev_master_stime, curr_master_stime;
-
-    /* TSC timestamps taken during this calibration and prev calibration. */
-    u64 prev_tsc, curr_tsc;
+    struct cpu_time_stamp prev, curr;
 
     /*
      * System time and TSC ticks elapsed during the previous calibration
@@ -1021,9 +1016,6 @@ static void local_time_calibration(void)
     u64 stime_elapsed64, tsc_elapsed64;
     u32 stime_elapsed32, tsc_elapsed32;
 
-    /* The accumulated error in the local estimate. */
-    u64 local_stime_err;
-
     /* Error correction to slow down a fast local clock. */
     u32 error_factor = 0;
 
@@ -1037,40 +1029,34 @@ static void local_time_calibration(void)
     {
         /* Atomically read cpu_calibration struct and write cpu_time struct. */
         local_irq_disable();
-        t->local_tsc_stamp    = c->local_tsc_stamp;
-        t->stime_local_stamp  = c->stime_local_stamp;
-        t->stime_master_stamp = c->stime_master_stamp;
+        t->stamp = *c;
         local_irq_enable();
         update_vcpu_system_time(current);
         goto out;
     }
 
-    prev_tsc          = t->local_tsc_stamp;
-    prev_local_stime  = t->stime_local_stamp;
-    prev_master_stime = t->stime_master_stamp;
+    prev = t->stamp;
 
     /* Disabling IRQs ensures we atomically read cpu_calibration struct. */
     local_irq_disable();
-    curr_tsc          = c->local_tsc_stamp;
-    curr_local_stime  = c->stime_local_stamp;
-    curr_master_stime = c->stime_master_stamp;
+    curr = *c;
     local_irq_enable();
 
 #if 0
     printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
-           smp_processor_id(), prev_tsc, prev_local_stime, prev_master_stime);
+           smp_processor_id(), prev.local_tsc, prev.local_stime, prev.master_stime);
     printk("CUR%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64
            " -> %"PRId64"\n",
-           smp_processor_id(), curr_tsc, curr_local_stime, curr_master_stime,
-           curr_master_stime - curr_local_stime);
+           smp_processor_id(), curr.local_tsc, curr.local_stime, curr.master_stime,
+           curr.master_stime - curr.local_stime);
 #endif
 
     /* Local time warps forward if it lags behind master time. */
-    if ( curr_local_stime < curr_master_stime )
-        curr_local_stime = curr_master_stime;
+    if ( curr.local_stime < curr.master_stime )
+        curr.local_stime = curr.master_stime;
 
-    stime_elapsed64 = curr_master_stime - prev_master_stime;
-    tsc_elapsed64   = curr_tsc - prev_tsc;
+    stime_elapsed64 = curr.master_stime - prev.master_stime;
+    tsc_elapsed64   = curr.local_tsc - prev.local_tsc;
 
     /*
      * Weirdness can happen if we lose sync with the platform timer.
@@ -1084,9 +1070,10 @@ static void local_time_calibration(void)
      * clock (slow clocks are warped forwards). The scale factor is clamped
      * to >= 0.5.
      */
-    if ( curr_local_stime != curr_master_stime )
+    if ( curr.local_stime != curr.master_stime )
     {
-        local_stime_err = curr_local_stime - curr_master_stime;
+        u64 local_stime_err = curr.local_stime - curr.master_stime;
+
         if ( local_stime_err > EPOCH )
             local_stime_err = EPOCH;
         error_factor = div_frac(EPOCH, EPOCH + (u32)local_stime_err);
@@ -1139,9 +1126,7 @@ static void local_time_calibration(void)
     local_irq_disable();
     t->tsc_scale.mul_frac = calibration_mul_frac;
     t->tsc_scale.shift    = tsc_shift;
-    t->local_tsc_stamp    = curr_tsc;
-    t->stime_local_stamp  = curr_local_stime;
-    t->stime_master_stamp = curr_master_stime;
+    t->stamp              = curr;
     local_irq_enable();
 
     update_vcpu_system_time(current);
@@ -1269,11 +1254,11 @@ struct calibration_rendezvous {
 static void
 time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
 {
-    struct cpu_calibration *c = &this_cpu(cpu_calibration);
+    struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
-    c->local_tsc_stamp = rdtsc_ordered();
-    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
-    c->stime_master_stamp = r->master_stime;
+    c->local_tsc    = rdtsc_ordered();
+    c->local_stime  = get_s_time_fixed(c->local_tsc);
+    c->master_stime = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
 }
@@ -1388,20 +1373,17 @@ void __init clear_tsc_cap(unsigned int f
     time_calibration_rendezvous_fn = rendezvous_fn;
 }
 
-static struct {
-    s_time_t local_stime, master_stime;
-} ap_bringup_ref;
+static struct cpu_time_stamp ap_bringup_ref;
 
 void time_latch_stamps(void) {
     unsigned long flags;
-    u64 tsc;
 
     local_irq_save(flags);
     ap_bringup_ref.master_stime = read_platform_stime();
-    tsc = rdtsc_ordered();
+    ap_bringup_ref.local_tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
-    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
+    ap_bringup_ref.local_stime = get_s_time_fixed(ap_bringup_ref.local_tsc);
 }
 
 void init_percpu_time(void)
@@ -1419,7 +1401,7 @@ void init_percpu_time(void)
     tsc = rdtsc_ordered();
     local_irq_restore(flags);
 
-    t->stime_master_stamp = now;
+    t->stamp.master_stime = now;
     /*
      * To avoid a discontinuity (TSC and platform clock can't be expected
      * to be in perfect sync), initialization here needs to match up with
@@ -1432,8 +1414,8 @@ void init_percpu_time(void)
         else
             now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
     }
-    t->local_tsc_stamp    = tsc;
-    t->stime_local_stamp  = now;
+    t->stamp.local_tsc   = tsc;
+    t->stamp.local_stime = now;
 }
 
 /*
@@ -1557,7 +1539,7 @@ void __init early_time_init(void)
     tmp = init_platform_timer();
 
     set_time_scale(&t->tsc_scale, tmp);
-    t->local_tsc_stamp = boot_tsc_stamp;
+    t->stamp.local_tsc = boot_tsc_stamp;
 
     do_div(tmp, 1000);
     cpu_khz = (unsigned long)tmp;
--- a/xen/include/asm-x86/time.h
+++ b/xen/include/asm-x86/time.h
@@ -79,6 +79,6 @@ u64 stime2tsc(s_time_t stime);
 
 struct time_scale;
 void set_time_scale(struct time_scale *ts, u64 ticks_per_sec);
-u64 scale_delta(u64 delta, struct time_scale *scale);
+u64 scale_delta(u64 delta, const struct time_scale *scale);
 
 #endif /* __X86_TIME_H__ */

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
@ 2016-06-15 10:32   ` Jan Beulich
  2016-06-15 22:51   ` Joao Martins
  2016-08-02 19:30   ` Andrew Cooper
  2 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2016-06-15 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Joao Martins

>>> On 15.06.16 at 12:26, <JBeulich@suse.com> wrote:

The title of this was (of course) meant to be

x86/time: adjust local system time initialization

Jan

> Using the bare return value from read_platform_stime() is not suitable
> when local_time_calibration() is going to use its fast path: Divergence
> of several dozen microseconds between NOW() return values on different
> CPUs results when platform and local time don't stay in close sync.
> 
> Latch local and platform time on the CPU initiating AP bringup, such
> that the AP can use these values to seed its stime_local_stamp with as
> little of an error as possible. The boot CPU, otoh, can simply
> calculate the correct initial value (other CPUs could do so too with
> even greater accuracy than the approach being introduced, but that can
> work only if all CPUs' TSCs start ticking at the same time, which
> generally can't be assumed to be the case on multi-socket systems).
> 
> This slightly defers init_percpu_time() (moved ahead by commit
> dd2658f966 ["x86/time: initialise time earlier during
> start_secondary()"]) in order to reduce as much as possible the gap
> between populating the stamps and consuming them.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -328,12 +328,12 @@ void start_secondary(void *unused)
>  
>      percpu_traps_init();
>  
> -    init_percpu_time();
> -
>      cpu_init();
>  
>      smp_callin();
>  
> +    init_percpu_time();
> +
>      setup_secondary_APIC_clock();
>  
>      /*
> @@ -996,6 +996,8 @@ int __cpu_up(unsigned int cpu)
>      if ( (ret = do_boot_cpu(apicid, cpu)) != 0 )
>          return ret;
>  
> +    time_latch_stamps();
> +
>      set_cpu_state(CPU_STATE_ONLINE);
>      while ( !cpu_online(cpu) )
>      {
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1328,21 +1328,51 @@ static void time_calibration(void *unuse
>                       &r, 1);
>  }
>  
> +static struct {
> +    s_time_t local_stime, master_stime;
> +} ap_bringup_ref;
> +
> +void time_latch_stamps(void) {
> +    unsigned long flags;
> +    u64 tsc;
> +
> +    local_irq_save(flags);
> +    ap_bringup_ref.master_stime = read_platform_stime();
> +    tsc = rdtsc();
> +    local_irq_restore(flags);
> +
> +    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
> +}
> +
>  void init_percpu_time(void)
>  {
>      struct cpu_time *t = &this_cpu(cpu_time);
>      unsigned long flags;
> +    u64 tsc;
>      s_time_t now;
>  
>      /* Initial estimate for TSC rate. */
>      t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
>  
>      local_irq_save(flags);
> -    t->local_tsc_stamp = rdtsc();
>      now = read_platform_stime();
> +    tsc = rdtsc();
>      local_irq_restore(flags);
>  
>      t->stime_master_stamp = now;
> +    /*
> +     * To avoid a discontinuity (TSC and platform clock can't be expected
> +     * to be in perfect sync), initialization here needs to match up with
> +     * local_time_calibration()'s decision whether to use its fast path.
> +     */
> +    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> +    {
> +        if ( system_state < SYS_STATE_smp_boot )
> +            now = get_s_time_fixed(tsc);
> +        else
> +            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
> +    }
> +    t->local_tsc_stamp    = tsc;
>      t->stime_local_stamp  = now;
>  }
>  
> --- a/xen/include/asm-x86/time.h
> +++ b/xen/include/asm-x86/time.h
> @@ -40,6 +40,7 @@ int time_suspend(void);
>  int time_resume(void);
>  
>  void init_percpu_time(void);
> +void time_latch_stamps(void);
>  
>  struct ioreq;
>  int hwdom_pit_access(struct ioreq *ioreq);




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
  2016-06-15 10:32   ` Jan Beulich
@ 2016-06-15 22:51   ` Joao Martins
  2016-06-16  8:27     ` Jan Beulich
  2016-08-02 19:30   ` Andrew Cooper
  2 siblings, 1 reply; 39+ messages in thread
From: Joao Martins @ 2016-06-15 22:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Andrew Cooper

On 06/15/2016 11:26 AM, Jan Beulich wrote:
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1328,21 +1328,51 @@ static void time_calibration(void *unuse
>                       &r, 1);
>  }
>  
> +static struct {
> +    s_time_t local_stime, master_stime;
> +} ap_bringup_ref;
> +
> +void time_latch_stamps(void) {
                                ^ style: shouldn't this bracket be on a new line?

> +    unsigned long flags;
> +    u64 tsc;
> +
> +    local_irq_save(flags);
> +    ap_bringup_ref.master_stime = read_platform_stime();
> +    tsc = rdtsc();
> +    local_irq_restore(flags);
> +
> +    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
> +}
> +
>  void init_percpu_time(void)
>  {
>      struct cpu_time *t = &this_cpu(cpu_time);
>      unsigned long flags;
> +    u64 tsc;
>      s_time_t now;
>  
>      /* Initial estimate for TSC rate. */
>      t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
>  
>      local_irq_save(flags);
> -    t->local_tsc_stamp = rdtsc();
>      now = read_platform_stime();
Do we need to take another read from the platform timer here? We could
just reuse the one on ap_bringup_ref.master_stime. See also below.

> +    tsc = rdtsc();
>      local_irq_restore(flags);
>  
>      t->stime_master_stamp = now;
> +    /*
> +     * To avoid a discontinuity (TSC and platform clock can't be expected
> +     * to be in perfect sync), initialization here needs to match up with
> +     * local_time_calibration()'s decision whether to use its fast path.
> +     */
> +    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
> +    {
> +        if ( system_state < SYS_STATE_smp_boot )
> +            now = get_s_time_fixed(tsc);
> +        else
> +            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
> +    }
> +    t->local_tsc_stamp    = tsc;

As far as my current experimentation (both v1 of this patch and the whole series on
v2), the source of the remaining warps could be fixed with this seeding. Though I
think this seeding might not yet be correct. Essentially the boot CPU you do
get_s_time_fixed(tsc), which basically gives you a value strictly calculated with TSC
since boot_tsc_stamp. For the other CPUs you append the time skew between TSC and
platform timer calculated before AP bringup, with a value just read on the platform
timer. Which I assume you take this as an approximation skew between TSC and platform
timer.

So, before this patch cpu_time is filled like this:

CPU 0: M0 , T0
CPU 1: M1 , T1
CPU 2: M2 , T2

With this patch it goes like:

CPU 0: L0 , T0
CPU 1: L1 = (M1 + (L - M)), T1
CPU 2: L2 = (M2 + (L - M)), T2

Given that,

1) values separated by commas are respectively local stime, tsc that are
written in cpu_time
2) M2 > M1 > M0 as values read from platform timer.
3) L and M solely as values calculated on CPU doing AP bringup.

After this seeding, local_time_calibration will increment the deltas between
calibrations every second, based entirely on a monotonically increasing TSC. Altough
I see two issues here: 1) appending to a new read from platform time which might
already be offsetted by the one taken from the boot CPU and 2) the skew you calculate
don't account for the current tsc. Which leads to local stime and tsc still not being
a valid pair for the secondary CPUs. So it would be possible that get_s_time(S0) on
CPU0 immediately followed by a get_s_time(S1) on CPU1 [S1 > S0] ends up returning a
value backwards IOW L0 + scale(S0 - T0) is bigger than L1 + scale(S1 - T1). That is
independently of the deltas being added on every calibration.

So how about we do the seeding in another way?

1) Relying on individual CPU TSC like you do on CPU 0:

     t->stamp.master_stime = now;
+    t->stamp.local_tsc = 0;
+    t->stamp.local_stime = 0;
+
     if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
     {
         now = get_s_time_fixed(tsc);
     }

(...)

Or 2) taking into account the skew between platform timer and tsc we take on
init_percpu_time. Diff below based on this series:

@@ -1394,11 +1508,14 @@ void init_percpu_time(void)
     t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;

     local_irq_save(flags);
-    now = read_platform_stime();
+    now = ap_bringup_ref.master_stime;
     tsc = rdtsc_ordered();
     local_irq_restore(flags);

     t->stamp.master_stime = now;
+    t->stamp.local_tsc = boot_tsc_stamp;
+    t->stamp.local_stime = 0;
+

     /*
      * To avoid a discontinuity (TSC and platform clock can't be expected
      * to be in perfect sync), initialization here needs to match up with
@@ -1406,10 +1523,12 @@ void init_percpu_time(void)
      */
     if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
     {
+        u64 stime = get_s_time_fixed(tsc);
+
         if ( system_state < SYS_STATE_smp_boot )
-            now = get_s_time_fixed(tsc);
+            now = stime;
         else
-            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
+            now += stime - ap_bringup_ref.master_stime;
     }

Second option follows a similar thinking of this patch, but on both ways the
local_stime will match the tsc on init_percpu_time, thus being a matched pair. I have
a test similar to check_tsc_warp but with get_s_time() and I no longer observe time
going backwards. But without the above I still observe it even on short periods.
Thoughts? (Sorry for the long answer)

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-15 22:51   ` Joao Martins
@ 2016-06-16  8:27     ` Jan Beulich
  2016-06-16 20:27       ` Joao Martins
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-16  8:27 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, Dario Faggioli, xen-devel

>>> On 16.06.16 at 00:51, <joao.m.martins@oracle.com> wrote:
> On 06/15/2016 11:26 AM, Jan Beulich wrote:
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -1328,21 +1328,51 @@ static void time_calibration(void *unuse
>>                       &r, 1);
>>  }
>>  
>> +static struct {
>> +    s_time_t local_stime, master_stime;
>> +} ap_bringup_ref;
>> +
>> +void time_latch_stamps(void) {
>                                 ^ style: shouldn't this bracket be on a new line?

Oh, yes, of course.

>> +    unsigned long flags;
>> +    u64 tsc;
>> +
>> +    local_irq_save(flags);
>> +    ap_bringup_ref.master_stime = read_platform_stime();
>> +    tsc = rdtsc();
>> +    local_irq_restore(flags);
>> +
>> +    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
>> +}
>> +
>>  void init_percpu_time(void)
>>  {
>>      struct cpu_time *t = &this_cpu(cpu_time);
>>      unsigned long flags;
>> +    u64 tsc;
>>      s_time_t now;
>>  
>>      /* Initial estimate for TSC rate. */
>>      t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
>>  
>>      local_irq_save(flags);
>> -    t->local_tsc_stamp = rdtsc();
>>      now = read_platform_stime();
> Do we need to take another read from the platform timer here? We could
> just reuse the one on ap_bringup_ref.master_stime. See also below.

Yes, for this model of approximation of the local stime delta by
master stime delta we obviously need to read the master clock
here again (or else we have no way to calculate the master
delta, which we want to use as the correcting value).

>> +    tsc = rdtsc();
>>      local_irq_restore(flags);
>>  
>>      t->stime_master_stamp = now;
>> +    /*
>> +     * To avoid a discontinuity (TSC and platform clock can't be expected
>> +     * to be in perfect sync), initialization here needs to match up with
>> +     * local_time_calibration()'s decision whether to use its fast path.
>> +     */
>> +    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>> +    {
>> +        if ( system_state < SYS_STATE_smp_boot )
>> +            now = get_s_time_fixed(tsc);
>> +        else
>> +            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
>> +    }
>> +    t->local_tsc_stamp    = tsc;
> 
> As far as my current experimentation (both v1 of this patch and the whole 
> series on
> v2), the source of the remaining warps could be fixed with this seeding. 
> Though I
> think this seeding might not yet be correct. Essentially the boot CPU you do
> get_s_time_fixed(tsc), which basically gives you a value strictly calculated 
> with TSC
> since boot_tsc_stamp. For the other CPUs you append the time skew between 
> TSC and
> platform timer calculated before AP bringup, with a value just read on the 
> platform
> timer. Which I assume you take this as an approximation skew between TSC and 
> platform
> timer.

Correct.

> So, before this patch cpu_time is filled like this:
> 
> CPU 0: M0 , T0
> CPU 1: M1 , T1
> CPU 2: M2 , T2
> 
> With this patch it goes like:
> 
> CPU 0: L0 , T0
> CPU 1: L1 = (M1 + (L - M)), T1
> CPU 2: L2 = (M2 + (L - M)), T2
> 
> Given that,
> 
> 1) values separated by commas are respectively local stime, tsc that are
> written in cpu_time
> 2) M2 > M1 > M0 as values read from platform timer.
> 3) L and M solely as values calculated on CPU doing AP bringup.
> 
> After this seeding, local_time_calibration will increment the deltas between
> calibrations every second, based entirely on a monotonically increasing TSC. 
> Altough
> I see two issues here: 1) appending to a new read from platform time which 
> might
> already be offsetted by the one taken from the boot CPU and 2) the skew you 
> calculate
> don't account for the current tsc. Which leads to local stime and tsc still 
> not being
> a valid pair for the secondary CPUs.

That's true if platform clock and TSC, even over this small time
period, diverge enough. The calculate local time indeed is only an
approximation to match the TSC just read.

> So it would be possible that 
> get_s_time(S0) on
> CPU0 immediately followed by a get_s_time(S1) on CPU1 [S1 > S0] ends up 
> returning a
> value backwards IOW L0 + scale(S0 - T0) is bigger than L1 + scale(S1 - T1). 
> That is
> independently of the deltas being added on every calibration.
> 
> So how about we do the seeding in another way?
> 
> 1) Relying on individual CPU TSC like you do on CPU 0:
> 
>      t->stamp.master_stime = now;
> +    t->stamp.local_tsc = 0;
> +    t->stamp.local_stime = 0;
> +
>      if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>      {
>          now = get_s_time_fixed(tsc);
>      }

As said before elsewhere, such a model can only work if all TSCs
start ticking at the same time. The latest this assumption breaks
is when a CPU gets physically hotplugged.

> Or 2) taking into account the skew between platform timer and tsc we take on
> init_percpu_time. Diff below based on this series:
> 
> @@ -1394,11 +1508,14 @@ void init_percpu_time(void)
>      t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
> 
>      local_irq_save(flags);
> -    now = read_platform_stime();
> +    now = ap_bringup_ref.master_stime;
>      tsc = rdtsc_ordered();
>      local_irq_restore(flags);
> 
>      t->stamp.master_stime = now;
> +    t->stamp.local_tsc = boot_tsc_stamp;

Again - this is invalid. It indeed works fine on a single socket system
(where all TSCs start ticking at the same time), and gives really good
results. But due to hotplug (and to a lesser degree multi-socket, but
likely to a somewhat higher degree multi-node, systems) we just can't
use boot_tsc_stamp as reference for APs.

> +    t->stamp.local_stime = 0;
> +
> 
>      /*
>       * To avoid a discontinuity (TSC and platform clock can't be expected
>       * to be in perfect sync), initialization here needs to match up with
> @@ -1406,10 +1523,12 @@ void init_percpu_time(void)
>       */
>      if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>      {
> +        u64 stime = get_s_time_fixed(tsc);
> +
>          if ( system_state < SYS_STATE_smp_boot )
> -            now = get_s_time_fixed(tsc);
> +            now = stime;
>          else
> -            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
> +            now += stime - ap_bringup_ref.master_stime;

That seems to contradict your desire to keep out of all calculations
the skew between platform timer and TSC.

>      }
> 
> Second option follows a similar thinking of this patch, but on both ways the
> local_stime will match the tsc on init_percpu_time, thus being a matched 
> pair. I have
> a test similar to check_tsc_warp but with get_s_time() and I no longer 
> observe time
> going backwards. But without the above I still observe it even on short 
> periods.

Right - I'm not claiming the series eliminates all backwards moves.
But I simply can't see a model to fully eliminate them. I.e. my plan
was, once the backwards moves are small enough, to maybe
indeed compensate them by maintaining a global variable tracking
the most recently returned value. There are issues with such an
approach too, though: HT effects can result in one hyperthread
making it just past that check of the global, then hardware
switching to the other hyperthread, NOW() producing a slightly
larger value there, and hardware switching back to the first
hyperthread only after the second one consumed the result of
NOW(). Dario's use would be unaffected by this aiui, as his NOW()
invocations are globally serialized through a spinlock, but arbitrary
NOW() invocations on two hyperthreads can't be made such that
the invoking party can be guaranteed to see strictly montonic
values.

And btw., similar considerations apply for two fully independent
CPUs, if one runs at a much higher P-state than the other (i.e.
the faster one could overtake the slower one between the
montonicity check in NOW() and the callers consuming the returned
values). So in the end I'm not sure it's worth the performance hit
such a global montonicity check would incur, and therefore I didn't
make a respective patch part of this series.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-16  8:27     ` Jan Beulich
@ 2016-06-16 20:27       ` Joao Martins
  2016-06-17  7:32         ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Joao Martins @ 2016-06-16 20:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Dario Faggioli, xen-devel

>>> +    unsigned long flags;
>>> +    u64 tsc;
>>> +
>>> +    local_irq_save(flags);
>>> +    ap_bringup_ref.master_stime = read_platform_stime();
>>> +    tsc = rdtsc();
>>> +    local_irq_restore(flags);
>>> +
>>> +    ap_bringup_ref.local_stime = get_s_time_fixed(tsc);
>>> +}
>>> +
>>>  void init_percpu_time(void)
>>>  {
>>>      struct cpu_time *t = &this_cpu(cpu_time);
>>>      unsigned long flags;
>>> +    u64 tsc;
>>>      s_time_t now;
>>>  
>>>      /* Initial estimate for TSC rate. */
>>>      t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
>>>  
>>>      local_irq_save(flags);
>>> -    t->local_tsc_stamp = rdtsc();
>>>      now = read_platform_stime();
>> Do we need to take another read from the platform timer here? We could
>> just reuse the one on ap_bringup_ref.master_stime. See also below.
> 
> Yes, for this model of approximation of the local stime delta by
> master stime delta we obviously need to read the master clock
> here again (or else we have no way to calculate the master
> delta, which we want to use as the correcting value).
Ah, correct.

>> So it would be possible that 
>> get_s_time(S0) on
>> CPU0 immediately followed by a get_s_time(S1) on CPU1 [S1 > S0] ends up 
>> returning a
>> value backwards IOW L0 + scale(S0 - T0) is bigger than L1 + scale(S1 - T1). 
>> That is
>> independently of the deltas being added on every calibration.
>>
>> So how about we do the seeding in another way?
>>
>> 1) Relying on individual CPU TSC like you do on CPU 0:
>>
>>      t->stamp.master_stime = now;
>> +    t->stamp.local_tsc = 0;
>> +    t->stamp.local_stime = 0;
>> +
>>      if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>>      {
>>          now = get_s_time_fixed(tsc);
>>      }
> 
> As said before elsewhere, such a model can only work if all TSCs
> start ticking at the same time. The latest this assumption breaks 
> is when a CPU gets physically hotplugged.
Agreed.

>> Or 2) taking into account the skew between platform timer and tsc we take on
>> init_percpu_time. Diff below based on this series:
>>
>> @@ -1394,11 +1508,14 @@ void init_percpu_time(void)
>>      t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
>>
>>      local_irq_save(flags);
>> -    now = read_platform_stime();
>> +    now = ap_bringup_ref.master_stime;
>>      tsc = rdtsc_ordered();
>>      local_irq_restore(flags);
>>
>>      t->stamp.master_stime = now;
>> +    t->stamp.local_tsc = boot_tsc_stamp;
> 
> Again - this is invalid. It indeed works fine on a single socket system
> (where all TSCs start ticking at the same time), and gives really good
> results. But due to hotplug (and to a lesser degree multi-socket, but
> likely to a somewhat higher degree multi-node, systems) we just can't
> use boot_tsc_stamp as reference for APs.
It also gives really good results on my multi-socket system at least on my old Intel
multi-socket box (Westmere-based; TSC invariant is introduced on its predecessor:
Nehalem). Btw, Linux seems to take Intel boxes as having synchronized TSCs across
cores and sockets [0][1]. Though CPU hotplug is the troublesome case.

[0] arch/x86/kernel/tsc.c#L1082
[1] arch/x86/kernel/cpu/intel.c#L85

> 
>> +    t->stamp.local_stime = 0;
>> +
>>
>>      /*
>>       * To avoid a discontinuity (TSC and platform clock can't be expected
>>       * to be in perfect sync), initialization here needs to match up with
>> @@ -1406,10 +1523,12 @@ void init_percpu_time(void)
>>       */
>>      if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>>      {
>> +        u64 stime = get_s_time_fixed(tsc);
>> +
>>          if ( system_state < SYS_STATE_smp_boot )
>> -            now = get_s_time_fixed(tsc);
>> +            now = stime;
>>          else
>> -            now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime;
>> +            now += stime - ap_bringup_ref.master_stime;
> 
> That seems to contradict your desire to keep out of all calculations
> the skew between platform timer and TSC.
> 
That's indeed my desire (on my related series about using tsc as clocksource and tsc
stable bit), but being aware of TSC limitations: I was trying to see if there was
common ground between this seeding with platform timer while accounting to the
potential unreliable TSC of individual CPUs. But of course, for any of these
solutions to have get_s_time return monotonically increasing values, it can only work
on a truly invariant TSC box.

>>      }
>>
>> Second option follows a similar thinking of this patch, but on both ways the
>> local_stime will match the tsc on init_percpu_time, thus being a matched 
>> pair. I have
>> a test similar to check_tsc_warp but with get_s_time() and I no longer 
>> observe time
>> going backwards. But without the above I still observe it even on short 
>> periods.
> 
> Right - I'm not claiming the series eliminates all backwards moves.
> But I simply can't see a model to fully eliminate them. 
Totally agree with you.

> I.e. my plan was, once the backwards moves are small enough, to maybe
> indeed compensate them by maintaining a global variable tracking
> the most recently returned value. There are issues with such an
> approach too, though: HT effects can result in one hyperthread
> making it just past that check of the global, then hardware
> switching to the other hyperthread, NOW() producing a slightly
> larger value there, and hardware switching back to the first
> hyperthread only after the second one consumed the result of
> NOW(). Dario's use would be unaffected by this aiui, as his NOW()
> invocations are globally serialized through a spinlock, but arbitrary
> NOW() invocations on two hyperthreads can't be made such that
> the invoking party can be guaranteed to see strictly montonic
> values.
> 
> And btw., similar considerations apply for two fully independent
> CPUs, if one runs at a much higher P-state than the other (i.e.
> the faster one could overtake the slower one between the
> montonicity check in NOW() and the callers consuming the returned
> values). So in the end I'm not sure it's worth the performance hit
> such a global montonicity check would incur, and therefore I didn't
> make a respective patch part of this series.
> 

Hm, guests pvclock should have faced similar issues too as their
local stamps for each vcpu diverge. Linux commit 489fb49 ("x86, paravirt: Add a
global synchronization point for pvclock") depicts a fix to similar situations to the
scenarios you just described - which lead to have a global variable to keep track of
most recent timestamp. One important chunk of that commit is pasted below for
convenience:

--
/*
 * Assumption here is that last_value, a global accumulator, always goes
 * forward. If we are less than that, we should not be much smaller.
 * We assume there is an error marging we're inside, and then the correction
 * does not sacrifice accuracy.
 *
 * For reads: global may have changed between test and return,
 * but this means someone else updated poked the clock at a later time.
 * We just need to make sure we are not seeing a backwards event.
 *
 * For updates: last_value = ret is not enough, since two vcpus could be
 * updating at the same time, and one of them could be slightly behind,
 * making the assumption that last_value always go forward fail to hold.
 */
 last = atomic64_read(&last_value);
 do {
     if (ret < last)
         return last;
     last = atomic64_cmpxchg(&last_value, last, ret);
 } while (unlikely(last != ret));
--

Though I am not sure what other options we would have with such a heavy reliance on
TSC right now.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-16 20:27       ` Joao Martins
@ 2016-06-17  7:32         ` Jan Beulich
  2016-06-21 12:05           ` Joao Martins
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-17  7:32 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, Dario Faggioli, xen-devel

>>> On 16.06.16 at 22:27, <joao.m.martins@oracle.com> wrote:
>> I.e. my plan was, once the backwards moves are small enough, to maybe
>> indeed compensate them by maintaining a global variable tracking
>> the most recently returned value. There are issues with such an
>> approach too, though: HT effects can result in one hyperthread
>> making it just past that check of the global, then hardware
>> switching to the other hyperthread, NOW() producing a slightly
>> larger value there, and hardware switching back to the first
>> hyperthread only after the second one consumed the result of
>> NOW(). Dario's use would be unaffected by this aiui, as his NOW()
>> invocations are globally serialized through a spinlock, but arbitrary
>> NOW() invocations on two hyperthreads can't be made such that
>> the invoking party can be guaranteed to see strictly montonic
>> values.
>> 
>> And btw., similar considerations apply for two fully independent
>> CPUs, if one runs at a much higher P-state than the other (i.e.
>> the faster one could overtake the slower one between the
>> montonicity check in NOW() and the callers consuming the returned
>> values). So in the end I'm not sure it's worth the performance hit
>> such a global montonicity check would incur, and therefore I didn't
>> make a respective patch part of this series.
>> 
> 
> Hm, guests pvclock should have faced similar issues too as their
> local stamps for each vcpu diverge. Linux commit 489fb49 ("x86, paravirt: 
> Add a
> global synchronization point for pvclock") depicts a fix to similar 
> situations to the
> scenarios you just described - which lead to have a global variable to keep 
> track of
> most recent timestamp. One important chunk of that commit is pasted below 
> for
> convenience:
> 
> --
> /*
>  * Assumption here is that last_value, a global accumulator, always goes
>  * forward. If we are less than that, we should not be much smaller.
>  * We assume there is an error marging we're inside, and then the correction
>  * does not sacrifice accuracy.
>  *
>  * For reads: global may have changed between test and return,
>  * but this means someone else updated poked the clock at a later time.
>  * We just need to make sure we are not seeing a backwards event.
>  *
>  * For updates: last_value = ret is not enough, since two vcpus could be
>  * updating at the same time, and one of them could be slightly behind,
>  * making the assumption that last_value always go forward fail to hold.
>  */
>  last = atomic64_read(&last_value);
>  do {
>      if (ret < last)
>          return last;
>      last = atomic64_cmpxchg(&last_value, last, ret);
>  } while (unlikely(last != ret));
> --

Meaning they decided it's worth the overhead. But (having read
through the entire description) they don't even discuss whether this
indeed eliminates _all_ apparent backward moves due to effects
like the ones named above.

Plus, the contention they're facing is limited to a single VM, i.e. likely
much more narrow than that on an entire physical system. So for
us to do the same in the hypervisor, quite a bit more of win would
be needed to outweigh the cost.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/8] x86: also generate assembler usable equates for synthesized features
  2016-06-15 10:26 ` [PATCH 2/8] x86: also generate assembler usable equates for synthesized features Jan Beulich
@ 2016-06-20 12:50   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-06-20 12:50 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins

On 15/06/16 11:26, Jan Beulich wrote:
> ... to make it possible to base alternative instruction patching upon
> such.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] x86/time: introduce and use rdtsc_ordered()
  2016-06-15 10:27 ` [PATCH 3/8] x86/time: introduce and use rdtsc_ordered() Jan Beulich
@ 2016-06-20 12:59   ` Andrew Cooper
  2016-06-20 13:06     ` Jan Beulich
  2016-07-11 11:39     ` Dario Faggioli
  0 siblings, 2 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-06-20 12:59 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 483 bytes --]

On 15/06/16 11:27, Jan Beulich wrote:
> Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
> use it in trivial call sites") and earlier ones it builds upon, let's
> make sure timing loops don't have their rdtsc()-s re-ordered, as that
> would harm precision of the result (values were observed to be several
> hundred clocks off without this adjustment).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

[-- Attachment #1.2: Type: text/html, Size: 1030 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] x86/time: introduce and use rdtsc_ordered()
  2016-06-20 12:59   ` Andrew Cooper
@ 2016-06-20 13:06     ` Jan Beulich
  2016-06-20 13:07       ` Andrew Cooper
  2016-07-11 11:39     ` Dario Faggioli
  1 sibling, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-20 13:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 20.06.16 at 14:59, <andrew.cooper3@citrix.com> wrote:
> On 15/06/16 11:27, Jan Beulich wrote:
>> Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
>> use it in trivial call sites") and earlier ones it builds upon, let's
>> make sure timing loops don't have their rdtsc()-s re-ordered, as that
>> would harm precision of the result (values were observed to be several
>> hundred clocks off without this adjustment).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

I have these two additional hunks for v2:

@@ -1124,16 +1124,13 @@ static void local_time_calibration(void)
  */
 static void check_tsc_warp(unsigned long tsc_khz, unsigned long *max_warp)
 {
-#define rdtsc_barrier() mb()
     static DEFINE_SPINLOCK(sync_lock);
     static cycles_t last_tsc;
 
     cycles_t start, now, prev, end;
     int i;
 
-    rdtsc_barrier();
-    start = get_cycles();
-    rdtsc_barrier();
+    start = rdtsc_ordered();
 
     /* The measurement runs for 20 msecs: */
     end = start + tsc_khz * 20ULL;
@@ -1148,9 +1145,7 @@ static void check_tsc_warp(unsigned long
          */
         spin_lock(&sync_lock);
         prev = last_tsc;
-        rdtsc_barrier();
-        now = get_cycles();
-        rdtsc_barrier();
+        now = rdtsc_ordered();
         last_tsc = now;
         spin_unlock(&sync_lock);
 

May I consider those covered as well?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] x86/time: introduce and use rdtsc_ordered()
  2016-06-20 13:06     ` Jan Beulich
@ 2016-06-20 13:07       ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-06-20 13:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Joao Martins

On 20/06/16 14:06, Jan Beulich wrote:
>>>> On 20.06.16 at 14:59, <andrew.cooper3@citrix.com> wrote:
>> On 15/06/16 11:27, Jan Beulich wrote:
>>> Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered() and
>>> use it in trivial call sites") and earlier ones it builds upon, let's
>>> make sure timing loops don't have their rdtsc()-s re-ordered, as that
>>> would harm precision of the result (values were observed to be several
>>> hundred clocks off without this adjustment).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> I have these two additional hunks for v2:
>
> @@ -1124,16 +1124,13 @@ static void local_time_calibration(void)
>   */
>  static void check_tsc_warp(unsigned long tsc_khz, unsigned long *max_warp)
>  {
> -#define rdtsc_barrier() mb()
>      static DEFINE_SPINLOCK(sync_lock);
>      static cycles_t last_tsc;
>  
>      cycles_t start, now, prev, end;
>      int i;
>  
> -    rdtsc_barrier();
> -    start = get_cycles();
> -    rdtsc_barrier();
> +    start = rdtsc_ordered();
>  
>      /* The measurement runs for 20 msecs: */
>      end = start + tsc_khz * 20ULL;
> @@ -1148,9 +1145,7 @@ static void check_tsc_warp(unsigned long
>           */
>          spin_lock(&sync_lock);
>          prev = last_tsc;
> -        rdtsc_barrier();
> -        now = get_cycles();
> -        rdtsc_barrier();
> +        now = rdtsc_ordered();
>          last_tsc = now;
>          spin_unlock(&sync_lock);
>  
>
> May I consider those covered as well?

Yes.

I need to dust off my series removing most of the misuse of mb() in the
Xen codebase now that 4.8 is open.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] x86/time: calibrate TSC against platform timer
  2016-06-15 10:28 ` [PATCH 4/8] x86/time: calibrate TSC against platform timer Jan Beulich
@ 2016-06-20 14:20   ` Andrew Cooper
  2016-06-20 15:19     ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2016-06-20 14:20 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins

On 15/06/16 11:28, Jan Beulich wrote:
> --- a/xen/arch/x86/i8259.c
> +++ b/xen/arch/x86/i8259.c
> @@ -359,12 +359,7 @@ void __init init_IRQ(void)
>  
>      apic_intr_init();
>  
> -    /* Set the clock to HZ Hz */
> -#define CLOCK_TICK_RATE 1193182 /* crystal freq (Hz) */
> -#define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
> -    outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
> -    outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
> -    outb(LATCH >> 8, PIT_CH0);     /* MSB */
> +    preinit_pit();

This highlights the fact that we have two unconditional calls to
preinit_pit() during startup, which is one too many.

init_IRQ() is called rather earlier than early_time_init(), but I can't
spot anything inbetween the two calls which would require the PIT to be
set up.  AFAICT, it is safe to just drop the preinit_pit() call here.

> @@ -340,7 +348,8 @@ static struct platform_timesource __init
>      .frequency = CLOCK_TICK_RATE,
>      .read_counter = read_pit_count,
>      .counter_bits = 32,
> -    .init = init_pit
> +    .init = init_pit,
> +    .resume = resume_pit

Please add a redundant comma at the end, to reduce the next diff to
change this block.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-06-15 10:28 ` [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags Jan Beulich
@ 2016-06-20 14:32   ` Andrew Cooper
  2016-06-20 15:20     ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2016-06-20 14:32 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins

On 15/06/16 11:28, Jan Beulich wrote:
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>                       &r, 1);
>  }
>  
> +void __init clear_tsc_cap(unsigned int feature)
> +{
> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;

This should read time_calibration_rendezvous_fn rather than assuming
time_calibration_std_rendezvous.

Otherwise, there is a risk that it could be reset back to
time_calibration_std_rendezvous.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] x86/time: calibrate TSC against platform timer
  2016-06-20 14:20   ` Andrew Cooper
@ 2016-06-20 15:19     ` Jan Beulich
  2016-08-02 19:25       ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-20 15:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 20.06.16 at 16:20, <andrew.cooper3@citrix.com> wrote:
> On 15/06/16 11:28, Jan Beulich wrote:
>> --- a/xen/arch/x86/i8259.c
>> +++ b/xen/arch/x86/i8259.c
>> @@ -359,12 +359,7 @@ void __init init_IRQ(void)
>>  
>>      apic_intr_init();
>>  
>> -    /* Set the clock to HZ Hz */
>> -#define CLOCK_TICK_RATE 1193182 /* crystal freq (Hz) */
>> -#define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
>> -    outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
>> -    outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
>> -    outb(LATCH >> 8, PIT_CH0);     /* MSB */
>> +    preinit_pit();
> 
> This highlights the fact that we have two unconditional calls to
> preinit_pit() during startup, which is one too many.
> 
> init_IRQ() is called rather earlier than early_time_init(), but I can't
> spot anything inbetween the two calls which would require the PIT to be
> set up.  AFAICT, it is safe to just drop the preinit_pit() call here.

LAPIC initialization makes use of the PIT, and I think that would
break when removing it here. And since doing it twice is benign,
I'd also like to not drop it from early_time_init().

>> @@ -340,7 +348,8 @@ static struct platform_timesource __init
>>      .frequency = CLOCK_TICK_RATE,
>>      .read_counter = read_pit_count,
>>      .counter_bits = 32,
>> -    .init = init_pit
>> +    .init = init_pit,
>> +    .resume = resume_pit
> 
> Please add a redundant comma at the end, to reduce the next diff to
> change this block.

Well, I'd like the three instance to remain consistent in this regard
(yet touching the others doesn't seem warranted). And a change
here isn't all that likely.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-06-20 14:32   ` Andrew Cooper
@ 2016-06-20 15:20     ` Jan Beulich
  2016-07-04 15:39       ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-20 15:20 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
> On 15/06/16 11:28, Jan Beulich wrote:
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>                       &r, 1);
>>  }
>>  
>> +void __init clear_tsc_cap(unsigned int feature)
>> +{
>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
> 
> This should read time_calibration_rendezvous_fn rather than assuming
> time_calibration_std_rendezvous.
> 
> Otherwise, there is a risk that it could be reset back to
> time_calibration_std_rendezvous.

But that's the purpose: We may need to switch back.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-17  7:32         ` Jan Beulich
@ 2016-06-21 12:05           ` Joao Martins
  2016-06-21 12:28             ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Joao Martins @ 2016-06-21 12:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Dario Faggioli, xen-devel



On 06/17/2016 08:32 AM, Jan Beulich wrote:
>>>> On 16.06.16 at 22:27, <joao.m.martins@oracle.com> wrote:
>>> I.e. my plan was, once the backwards moves are small enough, to maybe
>>> indeed compensate them by maintaining a global variable tracking
>>> the most recently returned value. There are issues with such an
>>> approach too, though: HT effects can result in one hyperthread
>>> making it just past that check of the global, then hardware
>>> switching to the other hyperthread, NOW() producing a slightly
>>> larger value there, and hardware switching back to the first
>>> hyperthread only after the second one consumed the result of
>>> NOW(). Dario's use would be unaffected by this aiui, as his NOW()
>>> invocations are globally serialized through a spinlock, but arbitrary
>>> NOW() invocations on two hyperthreads can't be made such that
>>> the invoking party can be guaranteed to see strictly montonic
>>> values.
>>>
>>> And btw., similar considerations apply for two fully independent
>>> CPUs, if one runs at a much higher P-state than the other (i.e.
>>> the faster one could overtake the slower one between the
>>> montonicity check in NOW() and the callers consuming the returned
>>> values). So in the end I'm not sure it's worth the performance hit
>>> such a global montonicity check would incur, and therefore I didn't
>>> make a respective patch part of this series.
>>>
>>
>> Hm, guests pvclock should have faced similar issues too as their
>> local stamps for each vcpu diverge. Linux commit 489fb49 ("x86, paravirt: 
>> Add a
>> global synchronization point for pvclock") depicts a fix to similar 
>> situations to the
>> scenarios you just described - which lead to have a global variable to keep 
>> track of
>> most recent timestamp. One important chunk of that commit is pasted below 
>> for
>> convenience:
>>
>> --
>> /*
>>  * Assumption here is that last_value, a global accumulator, always goes
>>  * forward. If we are less than that, we should not be much smaller.
>>  * We assume there is an error marging we're inside, and then the correction
>>  * does not sacrifice accuracy.
>>  *
>>  * For reads: global may have changed between test and return,
>>  * but this means someone else updated poked the clock at a later time.
>>  * We just need to make sure we are not seeing a backwards event.
>>  *
>>  * For updates: last_value = ret is not enough, since two vcpus could be
>>  * updating at the same time, and one of them could be slightly behind,
>>  * making the assumption that last_value always go forward fail to hold.
>>  */
>>  last = atomic64_read(&last_value);
>>  do {
>>      if (ret < last)
>>          return last;
>>      last = atomic64_cmpxchg(&last_value, last, ret);
>>  } while (unlikely(last != ret));
>> --
> 
> Meaning they decided it's worth the overhead. But (having read
> through the entire description) they don't even discuss whether this
> indeed eliminates _all_ apparent backward moves due to effects
> like the ones named above.
>
> Plus, the contention they're facing is limited to a single VM, i.e. likely
> much more narrow than that on an entire physical system. So for
> us to do the same in the hypervisor, quite a bit more of win would
> be needed to outweigh the cost.
> 
Experimental details look very unclear too - likely running the time
warp test for 5 days would get some of these cases cleared out. But
as you say it should be much more narrow that of an entire system.

BTW It was implicit in the discussion but my apologies for not
formally/explicitly stating. So FWIW:

Tested-by: Joao Martins <joao.m.martins@oracle.com>

This series is certainly a way forward into improving cross-CPU monotonicity,
and I am seeing indeed less occurrences of time going backwards on my systems.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-21 12:05           ` Joao Martins
@ 2016-06-21 12:28             ` Jan Beulich
  2016-06-21 13:57               ` Joao Martins
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-06-21 12:28 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, Dario Faggioli, xen-devel

>>> On 21.06.16 at 14:05, <joao.m.martins@oracle.com> wrote:

> 
> On 06/17/2016 08:32 AM, Jan Beulich wrote:
>>>>> On 16.06.16 at 22:27, <joao.m.martins@oracle.com> wrote:
>>>> I.e. my plan was, once the backwards moves are small enough, to maybe
>>>> indeed compensate them by maintaining a global variable tracking
>>>> the most recently returned value. There are issues with such an
>>>> approach too, though: HT effects can result in one hyperthread
>>>> making it just past that check of the global, then hardware
>>>> switching to the other hyperthread, NOW() producing a slightly
>>>> larger value there, and hardware switching back to the first
>>>> hyperthread only after the second one consumed the result of
>>>> NOW(). Dario's use would be unaffected by this aiui, as his NOW()
>>>> invocations are globally serialized through a spinlock, but arbitrary
>>>> NOW() invocations on two hyperthreads can't be made such that
>>>> the invoking party can be guaranteed to see strictly montonic
>>>> values.
>>>>
>>>> And btw., similar considerations apply for two fully independent
>>>> CPUs, if one runs at a much higher P-state than the other (i.e.
>>>> the faster one could overtake the slower one between the
>>>> montonicity check in NOW() and the callers consuming the returned
>>>> values). So in the end I'm not sure it's worth the performance hit
>>>> such a global montonicity check would incur, and therefore I didn't
>>>> make a respective patch part of this series.
>>>>
>>>
>>> Hm, guests pvclock should have faced similar issues too as their
>>> local stamps for each vcpu diverge. Linux commit 489fb49 ("x86, paravirt: 
>>> Add a
>>> global synchronization point for pvclock") depicts a fix to similar 
>>> situations to the
>>> scenarios you just described - which lead to have a global variable to keep 
>>> track of
>>> most recent timestamp. One important chunk of that commit is pasted below 
>>> for
>>> convenience:
>>>
>>> --
>>> /*
>>>  * Assumption here is that last_value, a global accumulator, always goes
>>>  * forward. If we are less than that, we should not be much smaller.
>>>  * We assume there is an error marging we're inside, and then the correction
>>>  * does not sacrifice accuracy.
>>>  *
>>>  * For reads: global may have changed between test and return,
>>>  * but this means someone else updated poked the clock at a later time.
>>>  * We just need to make sure we are not seeing a backwards event.
>>>  *
>>>  * For updates: last_value = ret is not enough, since two vcpus could be
>>>  * updating at the same time, and one of them could be slightly behind,
>>>  * making the assumption that last_value always go forward fail to hold.
>>>  */
>>>  last = atomic64_read(&last_value);
>>>  do {
>>>      if (ret < last)
>>>          return last;
>>>      last = atomic64_cmpxchg(&last_value, last, ret);
>>>  } while (unlikely(last != ret));
>>> --
>> 
>> Meaning they decided it's worth the overhead. But (having read
>> through the entire description) they don't even discuss whether this
>> indeed eliminates _all_ apparent backward moves due to effects
>> like the ones named above.
>>
>> Plus, the contention they're facing is limited to a single VM, i.e. likely
>> much more narrow than that on an entire physical system. So for
>> us to do the same in the hypervisor, quite a bit more of win would
>> be needed to outweigh the cost.
>> 
> Experimental details look very unclear too - likely running the time
> warp test for 5 days would get some of these cases cleared out. But
> as you say it should be much more narrow that of an entire system.
> 
> BTW It was implicit in the discussion but my apologies for not
> formally/explicitly stating. So FWIW:
> 
> Tested-by: Joao Martins <joao.m.martins@oracle.com>

Thanks, but this ...

> This series is certainly a way forward into improving cross-CPU monotonicity,
> and I am seeing indeed less occurrences of time going backwards on my 
> systems.

... leaves me guessing whether the above was meant for just this
patch, or the entire series.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-21 12:28             ` Jan Beulich
@ 2016-06-21 13:57               ` Joao Martins
  0 siblings, 0 replies; 39+ messages in thread
From: Joao Martins @ 2016-06-21 13:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Dario Faggioli, xen-devel



On 06/21/2016 01:28 PM, Jan Beulich wrote:
>>>> On 21.06.16 at 14:05, <joao.m.martins@oracle.com> wrote:
> 
>>
>> On 06/17/2016 08:32 AM, Jan Beulich wrote:
>>>>>> On 16.06.16 at 22:27, <joao.m.martins@oracle.com> wrote:
>>>>> I.e. my plan was, once the backwards moves are small enough, to maybe
>>>>> indeed compensate them by maintaining a global variable tracking
>>>>> the most recently returned value. There are issues with such an
>>>>> approach too, though: HT effects can result in one hyperthread
>>>>> making it just past that check of the global, then hardware
>>>>> switching to the other hyperthread, NOW() producing a slightly
>>>>> larger value there, and hardware switching back to the first
>>>>> hyperthread only after the second one consumed the result of
>>>>> NOW(). Dario's use would be unaffected by this aiui, as his NOW()
>>>>> invocations are globally serialized through a spinlock, but arbitrary
>>>>> NOW() invocations on two hyperthreads can't be made such that
>>>>> the invoking party can be guaranteed to see strictly montonic
>>>>> values.
>>>>>
>>>>> And btw., similar considerations apply for two fully independent
>>>>> CPUs, if one runs at a much higher P-state than the other (i.e.
>>>>> the faster one could overtake the slower one between the
>>>>> montonicity check in NOW() and the callers consuming the returned
>>>>> values). So in the end I'm not sure it's worth the performance hit
>>>>> such a global montonicity check would incur, and therefore I didn't
>>>>> make a respective patch part of this series.
>>>>>
>>>>
>>>> Hm, guests pvclock should have faced similar issues too as their
>>>> local stamps for each vcpu diverge. Linux commit 489fb49 ("x86, paravirt: 
>>>> Add a
>>>> global synchronization point for pvclock") depicts a fix to similar 
>>>> situations to the
>>>> scenarios you just described - which lead to have a global variable to keep 
>>>> track of
>>>> most recent timestamp. One important chunk of that commit is pasted below 
>>>> for
>>>> convenience:
>>>>
>>>> --
>>>> /*
>>>>  * Assumption here is that last_value, a global accumulator, always goes
>>>>  * forward. If we are less than that, we should not be much smaller.
>>>>  * We assume there is an error marging we're inside, and then the correction
>>>>  * does not sacrifice accuracy.
>>>>  *
>>>>  * For reads: global may have changed between test and return,
>>>>  * but this means someone else updated poked the clock at a later time.
>>>>  * We just need to make sure we are not seeing a backwards event.
>>>>  *
>>>>  * For updates: last_value = ret is not enough, since two vcpus could be
>>>>  * updating at the same time, and one of them could be slightly behind,
>>>>  * making the assumption that last_value always go forward fail to hold.
>>>>  */
>>>>  last = atomic64_read(&last_value);
>>>>  do {
>>>>      if (ret < last)
>>>>          return last;
>>>>      last = atomic64_cmpxchg(&last_value, last, ret);
>>>>  } while (unlikely(last != ret));
>>>> --
>>>
>>> Meaning they decided it's worth the overhead. But (having read
>>> through the entire description) they don't even discuss whether this
>>> indeed eliminates _all_ apparent backward moves due to effects
>>> like the ones named above.
>>>
>>> Plus, the contention they're facing is limited to a single VM, i.e. likely
>>> much more narrow than that on an entire physical system. So for
>>> us to do the same in the hypervisor, quite a bit more of win would
>>> be needed to outweigh the cost.
>>>
>> Experimental details look very unclear too - likely running the time
>> warp test for 5 days would get some of these cases cleared out. But
>> as you say it should be much more narrow that of an entire system.
>>
>> BTW It was implicit in the discussion but my apologies for not
>> formally/explicitly stating. So FWIW:
>>
>> Tested-by: Joao Martins <joao.m.martins@oracle.com>
> 
> Thanks, but this ...
> 
>> This series is certainly a way forward into improving cross-CPU monotonicity,
>> and I am seeing indeed less occurrences of time going backwards on my 
>> systems.
> 
> ... leaves me guessing whether the above was meant for just this
> patch, or the entire series.
> 
Ah sorry, a little ambiguous on my end - It is meant for the entire series.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Ping: [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
                   ` (7 preceding siblings ...)
  2016-06-15 10:30 ` [PATCH 8/8] x86/time: group time stamps into a structure Jan Beulich
@ 2016-07-01  7:44 ` Jan Beulich
  8 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2016-07-01  7:44 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 15.06.16 at 11:50, <JBeulich@suse.com> wrote:
> 1: x86/time: adjust local system time initialization
> 2: x86: also generate assembler usable equates for synthesized features
> 3: x86/time: introduce and use rdtsc_ordered()
> 4: x86/time: calibrate TSC against platform timer
> 5: x86/time: correctly honor late clearing of TSC related feature flags
> 6: x86/time: support 32-bit wide ACPI PM timer
> 7: x86/time: fold recurring code
> 8: x86/time: group time stamps into a structure

Patches 1 and 4...8 are still awaiting feedback from you.

Thanks, Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-06-20 15:20     ` Jan Beulich
@ 2016-07-04 15:39       ` Andrew Cooper
  2016-07-04 15:53         ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2016-07-04 15:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Joao Martins

On 20/06/16 16:20, Jan Beulich wrote:
>>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
>> On 15/06/16 11:28, Jan Beulich wrote:
>>> --- a/xen/arch/x86/time.c
>>> +++ b/xen/arch/x86/time.c
>>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>>                       &r, 1);
>>>  }
>>>  
>>> +void __init clear_tsc_cap(unsigned int feature)
>>> +{
>>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
>> This should read time_calibration_rendezvous_fn rather than assuming
>> time_calibration_std_rendezvous.
>>
>> Otherwise, there is a risk that it could be reset back to
>> time_calibration_std_rendezvous.
> But that's the purpose: We may need to switch back.

Under what circumstances could we ever move from re-syncing back to not
re-syncing?

Even in a scenario with pcpu hotplug, where we lose the final pcpu which
was causing re-syncing, we don't know that one of the intermediate pcpus
hasn't gone and come back, with a different TSC base.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] x86/time: support 32-bit wide ACPI PM timer
  2016-06-15 10:29 ` [PATCH 6/8] x86/time: support 32-bit wide ACPI PM timer Jan Beulich
@ 2016-07-04 15:40   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-07-04 15:40 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 202 bytes --]

On 15/06/16 11:29, Jan Beulich wrote:
> I have no idea why we didn't do so from the beginning.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

[-- Attachment #1.2: Type: text/html, Size: 757 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] x86/time: fold recurring code
  2016-06-15 10:30 ` [PATCH 7/8] x86/time: fold recurring code Jan Beulich
@ 2016-07-04 15:43   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-07-04 15:43 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 303 bytes --]

On 15/06/16 11:30, Jan Beulich wrote:
> Common code between time_calibration_{std,tsc}_rendezvous() can better
> live in a single place, eliminating the risk of adjusting one without
> the other.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

[-- Attachment #1.2: Type: text/html, Size: 841 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-07-04 15:39       ` Andrew Cooper
@ 2016-07-04 15:53         ` Jan Beulich
  2016-08-02 19:08           ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-07-04 15:53 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 04.07.16 at 17:39, <andrew.cooper3@citrix.com> wrote:
> On 20/06/16 16:20, Jan Beulich wrote:
>>>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
>>> On 15/06/16 11:28, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/time.c
>>>> +++ b/xen/arch/x86/time.c
>>>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>>>                       &r, 1);
>>>>  }
>>>>  
>>>> +void __init clear_tsc_cap(unsigned int feature)
>>>> +{
>>>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
>>> This should read time_calibration_rendezvous_fn rather than assuming
>>> time_calibration_std_rendezvous.
>>>
>>> Otherwise, there is a risk that it could be reset back to
>>> time_calibration_std_rendezvous.
>> But that's the purpose: We may need to switch back.
> 
> Under what circumstances could we ever move from re-syncing back to not
> re-syncing?

verify_tsc_reliability() may result in X86_FEATURE_TSC_RELIABLE
getting cleared. That's an initcall, which means it runs after
init_xen_time(), and hence after the rendezvous function got
established initially.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] x86/time: group time stamps into a structure
  2016-06-15 10:30 ` [PATCH 8/8] x86/time: group time stamps into a structure Jan Beulich
@ 2016-07-04 15:57   ` Andrew Cooper
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-07-04 15:57 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins

On 15/06/16 11:30, Jan Beulich wrote:
> If that had been done from the beginning, mistakes like the one
> corrected in commit b64438c7c1 ("x86/time: use correct (local) time
> stamp in constant-TSC calibration fast path") would likely never have
> happened.
>
> Also add a few "const" to make more obvious when things aren't expected
> to change.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

With one further suggestion.

> @@ -1037,40 +1029,34 @@ static void local_time_calibration(void)
>      {
>          /* Atomically read cpu_calibration struct and write cpu_time struct. */
>          local_irq_disable();
> -        t->local_tsc_stamp    = c->local_tsc_stamp;
> -        t->stime_local_stamp  = c->stime_local_stamp;
> -        t->stime_master_stamp = c->stime_master_stamp;
> +        t->stamp = *c;
>          local_irq_enable();
>          update_vcpu_system_time(current);
>          goto out;
>      }
>  
> -    prev_tsc          = t->local_tsc_stamp;
> -    prev_local_stime  = t->stime_local_stamp;
> -    prev_master_stime = t->stime_master_stamp;
> +    prev = t->stamp;
>  
>      /* Disabling IRQs ensures we atomically read cpu_calibration struct. */
>      local_irq_disable();
> -    curr_tsc          = c->local_tsc_stamp;
> -    curr_local_stime  = c->stime_local_stamp;
> -    curr_master_stime = c->stime_master_stamp;
> +    curr = *c;
>      local_irq_enable();
>  
>  #if 0

Turning this #if 0 into an "if ( 0 ) /* For debugging. */" would cause
the printk()s to be parsed and discarded, and avoid the risk of printk
parameters bitrotting.

>      printk("PRE%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64"\n",
> -           smp_processor_id(), prev_tsc, prev_local_stime, prev_master_stime);
> +           smp_processor_id(), prev.local_tsc, prev.local_stime, prev.master_stime);
>      printk("CUR%d: tsc=%"PRIu64" stime=%"PRIu64" master=%"PRIu64
>             " -> %"PRId64"\n",
> -           smp_processor_id(), curr_tsc, curr_local_stime, curr_master_stime,
> -           curr_master_stime - curr_local_stime);
> +           smp_processor_id(), curr.local_tsc, curr.local_stime, curr.master_stime,
> +           curr.master_stime - curr.local_stime);
>  #endif
>  


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] x86/time: introduce and use rdtsc_ordered()
  2016-06-20 12:59   ` Andrew Cooper
  2016-06-20 13:06     ` Jan Beulich
@ 2016-07-11 11:39     ` Dario Faggioli
  1 sibling, 0 replies; 39+ messages in thread
From: Dario Faggioli @ 2016-07-11 11:39 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, xen-devel; +Cc: Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 1440 bytes --]

On Mon, 2016-06-20 at 13:59 +0100, Andrew Cooper wrote:
> On 15/06/16 11:27, Jan Beulich wrote:
> > Matching Linux commit 03b9730b76 ("x86/asm/tsc: Add rdtsc_ordered()
> > and
> > use it in trivial call sites") and earlier ones it builds upon,
> > let's
> > make sure timing loops don't have their rdtsc()-s re-ordered, as
> > that
> > would harm precision of the result (values were observed to be
> > several
> > hundred clocks off without this adjustment).
> > 
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
>  
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
FWIW:

Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Tested-by: Dario Faggioli <dario.faggioli@citrix.com>

(or Reviewed-and-Tested-by: as you wish :-)).

FTR, during my own investigation, before raising the issue on the
mailing list, I also came to the conclusion that we'd need something
like this. I even try doing something like this (in a much more hacky
way), and had the feeling that it was making a difference but, of
course, alone, without all the other issues that Jan found and fixed in
this series, it wasn't enough.

Thanks and regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-07-04 15:53         ` Jan Beulich
@ 2016-08-02 19:08           ` Andrew Cooper
  2016-08-03  9:43             ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2016-08-02 19:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Joao Martins

On 04/07/16 16:53, Jan Beulich wrote:
>>>> On 04.07.16 at 17:39, <andrew.cooper3@citrix.com> wrote:
>> On 20/06/16 16:20, Jan Beulich wrote:
>>>>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
>>>> On 15/06/16 11:28, Jan Beulich wrote:
>>>>> --- a/xen/arch/x86/time.c
>>>>> +++ b/xen/arch/x86/time.c
>>>>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>>>>                       &r, 1);
>>>>>  }
>>>>>  
>>>>> +void __init clear_tsc_cap(unsigned int feature)
>>>>> +{
>>>>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
>>>> This should read time_calibration_rendezvous_fn rather than assuming
>>>> time_calibration_std_rendezvous.
>>>>
>>>> Otherwise, there is a risk that it could be reset back to
>>>> time_calibration_std_rendezvous.
>>> But that's the purpose: We may need to switch back.
>> Under what circumstances could we ever move from re-syncing back to not
>> re-syncing?
> verify_tsc_reliability() may result in X86_FEATURE_TSC_RELIABLE
> getting cleared. That's an initcall, which means it runs after
> init_xen_time(), and hence after the rendezvous function got
> established initially.

Right, but that isn't important.

There will never be a case where, once TSC_RELIABLE is cleared, it is
safe to revert back to std_rendezvous, even if TSC_RELIABLE is
subsequently re-set.

Therefore, this function must never accidentally revert
time_calibration_rendezvous_fn from time_calibration_tsc_rendezvous back
to time_calibration_std_rendezvous.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] x86/time: calibrate TSC against platform timer
  2016-06-20 15:19     ` Jan Beulich
@ 2016-08-02 19:25       ` Andrew Cooper
  2016-08-03  9:32         ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2016-08-02 19:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Joao Martins

On 20/06/16 16:19, Jan Beulich wrote:
>>>> On 20.06.16 at 16:20, <andrew.cooper3@citrix.com> wrote:
>> On 15/06/16 11:28, Jan Beulich wrote:
>>> --- a/xen/arch/x86/i8259.c
>>> +++ b/xen/arch/x86/i8259.c
>>> @@ -359,12 +359,7 @@ void __init init_IRQ(void)
>>>  
>>>      apic_intr_init();
>>>  
>>> -    /* Set the clock to HZ Hz */
>>> -#define CLOCK_TICK_RATE 1193182 /* crystal freq (Hz) */
>>> -#define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
>>> -    outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
>>> -    outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
>>> -    outb(LATCH >> 8, PIT_CH0);     /* MSB */
>>> +    preinit_pit();
>> This highlights the fact that we have two unconditional calls to
>> preinit_pit() during startup, which is one too many.
>>
>> init_IRQ() is called rather earlier than early_time_init(), but I can't
>> spot anything inbetween the two calls which would require the PIT to be
>> set up.  AFAICT, it is safe to just drop the preinit_pit() call here.
> LAPIC initialization makes use of the PIT, and I think that would
> break when removing it here. And since doing it twice is benign,
> I'd also like to not drop it from early_time_init().

Where? LAPIC initialisation is before this point - its higher up in
init_IRQ() so surely can't depend on this currently working.

As for benign, at the best it is a waste of time, and on reduced
hardware platforms, wrong.  We shouldn't be propagating problems like these.

>
>>> @@ -340,7 +348,8 @@ static struct platform_timesource __init
>>>      .frequency = CLOCK_TICK_RATE,
>>>      .read_counter = read_pit_count,
>>>      .counter_bits = 32,
>>> -    .init = init_pit
>>> +    .init = init_pit,
>>> +    .resume = resume_pit
>> Please add a redundant comma at the end, to reduce the next diff to
>> change this block.
> Well, I'd like the three instance to remain consistent in this regard
> (yet touching the others doesn't seem warranted). And a change
> here isn't all that likely.

This is just like any other bit of style - it should be fixed up while
editing even if the rest of the file isn't completely up to scratch.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)
  2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
  2016-06-15 10:32   ` Jan Beulich
  2016-06-15 22:51   ` Joao Martins
@ 2016-08-02 19:30   ` Andrew Cooper
  2 siblings, 0 replies; 39+ messages in thread
From: Andrew Cooper @ 2016-08-02 19:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Joao Martins


[-- Attachment #1.1: Type: text/plain, Size: 1217 bytes --]

On 15/06/16 11:26, Jan Beulich wrote:
> Using the bare return value from read_platform_stime() is not suitable
> when local_time_calibration() is going to use its fast path: Divergence
> of several dozen microseconds between NOW() return values on different
> CPUs results when platform and local time don't stay in close sync.
>
> Latch local and platform time on the CPU initiating AP bringup, such
> that the AP can use these values to seed its stime_local_stamp with as
> little of an error as possible. The boot CPU, otoh, can simply
> calculate the correct initial value (other CPUs could do so too with
> even greater accuracy than the approach being introduced, but that can
> work only if all CPUs' TSCs start ticking at the same time, which
> generally can't be assumed to be the case on multi-socket systems).
>
> This slightly defers init_percpu_time() (moved ahead by commit
> dd2658f966 ["x86/time: initialise time earlier during
> start_secondary()"]) in order to reduce as much as possible the gap
> between populating the stamps and consuming them.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Subject to the style issue spotted by Joao, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>

[-- Attachment #1.2: Type: text/html, Size: 1747 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] x86/time: calibrate TSC against platform timer
  2016-08-02 19:25       ` Andrew Cooper
@ 2016-08-03  9:32         ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2016-08-03  9:32 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 02.08.16 at 21:25, <andrew.cooper3@citrix.com> wrote:
> On 20/06/16 16:19, Jan Beulich wrote:
>>>>> On 20.06.16 at 16:20, <andrew.cooper3@citrix.com> wrote:
>>> On 15/06/16 11:28, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/i8259.c
>>>> +++ b/xen/arch/x86/i8259.c
>>>> @@ -359,12 +359,7 @@ void __init init_IRQ(void)
>>>>  
>>>>      apic_intr_init();
>>>>  
>>>> -    /* Set the clock to HZ Hz */
>>>> -#define CLOCK_TICK_RATE 1193182 /* crystal freq (Hz) */
>>>> -#define LATCH (((CLOCK_TICK_RATE)+(HZ/2))/HZ)
>>>> -    outb_p(0x34, PIT_MODE);        /* binary, mode 2, LSB/MSB, ch 0 */
>>>> -    outb_p(LATCH & 0xff, PIT_CH0); /* LSB */
>>>> -    outb(LATCH >> 8, PIT_CH0);     /* MSB */
>>>> +    preinit_pit();
>>> This highlights the fact that we have two unconditional calls to
>>> preinit_pit() during startup, which is one too many.
>>>
>>> init_IRQ() is called rather earlier than early_time_init(), but I can't
>>> spot anything inbetween the two calls which would require the PIT to be
>>> set up.  AFAICT, it is safe to just drop the preinit_pit() call here.
>> LAPIC initialization makes use of the PIT, and I think that would
>> break when removing it here. And since doing it twice is benign,
>> I'd also like to not drop it from early_time_init().
> 
> Where? LAPIC initialisation is before this point - its higher up in
> init_IRQ() so surely can't depend on this currently working.

Hmm, looks like my earlier flow analysis was wrong (or perhaps
based on the old placement of the call to init_platform_timer() in
init_xen_time()):

__start_xen()
-> [line 1388] init_IRQ()
-> [line 1429] early_time_init()
-> [line 1447] smp_prepare_cpus()
 -> setup_boot_APIC_clock()
  -> calibrate_APIC_clock()
-> [line 1456] init_xen_time();

Let me verify that removing the one here also works in practice.

> As for benign, at the best it is a waste of time, and on reduced
> hardware platforms, wrong.  We shouldn't be propagating problems like these.
> 
>>
>>>> @@ -340,7 +348,8 @@ static struct platform_timesource __init
>>>>      .frequency = CLOCK_TICK_RATE,
>>>>      .read_counter = read_pit_count,
>>>>      .counter_bits = 32,
>>>> -    .init = init_pit
>>>> +    .init = init_pit,
>>>> +    .resume = resume_pit
>>> Please add a redundant comma at the end, to reduce the next diff to
>>> change this block.
>> Well, I'd like the three instance to remain consistent in this regard
>> (yet touching the others doesn't seem warranted). And a change
>> here isn't all that likely.
> 
> This is just like any other bit of style - it should be fixed up while
> editing even if the rest of the file isn't completely up to scratch.

Well, I actually had done that part already.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-08-02 19:08           ` Andrew Cooper
@ 2016-08-03  9:43             ` Jan Beulich
  2016-08-31 13:42               ` Andrew Cooper
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Beulich @ 2016-08-03  9:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 02.08.16 at 21:08, <andrew.cooper3@citrix.com> wrote:
> On 04/07/16 16:53, Jan Beulich wrote:
>>>>> On 04.07.16 at 17:39, <andrew.cooper3@citrix.com> wrote:
>>> On 20/06/16 16:20, Jan Beulich wrote:
>>>>>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
>>>>> On 15/06/16 11:28, Jan Beulich wrote:
>>>>>> --- a/xen/arch/x86/time.c
>>>>>> +++ b/xen/arch/x86/time.c
>>>>>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>>>>>                       &r, 1);
>>>>>>  }
>>>>>>  
>>>>>> +void __init clear_tsc_cap(unsigned int feature)
>>>>>> +{
>>>>>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
>>>>> This should read time_calibration_rendezvous_fn rather than assuming
>>>>> time_calibration_std_rendezvous.
>>>>>
>>>>> Otherwise, there is a risk that it could be reset back to
>>>>> time_calibration_std_rendezvous.
>>>> But that's the purpose: We may need to switch back.
>>> Under what circumstances could we ever move from re-syncing back to not
>>> re-syncing?
>> verify_tsc_reliability() may result in X86_FEATURE_TSC_RELIABLE
>> getting cleared. That's an initcall, which means it runs after
>> init_xen_time(), and hence after the rendezvous function got
>> established initially.
> 
> Right, but that isn't important.
> 
> There will never be a case where, once TSC_RELIABLE is cleared, it is
> safe to revert back to std_rendezvous, even if TSC_RELIABLE is
> subsequently re-set.

You've got this backwards: TSC_RELIABLE may get _cleared_ late.
Nothing can ever set it late, due to the use of setup_clear_cpu_cap().
Reverting back to time_calibration_std_rendezvous() would only be
possible if CONSTANT_TSC got cleared late, ...

> Therefore, this function must never accidentally revert
> time_calibration_rendezvous_fn from time_calibration_tsc_rendezvous back
> to time_calibration_std_rendezvous.

... yet I can't see what would be wrong with that especially when
that still happened at boot time. Or else how would it be safe to
switch the other direction? (If neither switch was safe, then all we
could do upon finding the TSC unreliable in verify_tsc_reliability()
would be to panic().)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-08-03  9:43             ` Jan Beulich
@ 2016-08-31 13:42               ` Andrew Cooper
  2016-08-31 14:03                 ` Jan Beulich
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Cooper @ 2016-08-31 13:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Dario Faggioli, Joao Martins

On 03/08/16 10:43, Jan Beulich wrote:
>>>> On 02.08.16 at 21:08, <andrew.cooper3@citrix.com> wrote:
>> On 04/07/16 16:53, Jan Beulich wrote:
>>>>>> On 04.07.16 at 17:39, <andrew.cooper3@citrix.com> wrote:
>>>> On 20/06/16 16:20, Jan Beulich wrote:
>>>>>>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
>>>>>> On 15/06/16 11:28, Jan Beulich wrote:
>>>>>>> --- a/xen/arch/x86/time.c
>>>>>>> +++ b/xen/arch/x86/time.c
>>>>>>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>>>>>>                       &r, 1);
>>>>>>>  }
>>>>>>>  
>>>>>>> +void __init clear_tsc_cap(unsigned int feature)
>>>>>>> +{
>>>>>>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
>>>>>> This should read time_calibration_rendezvous_fn rather than assuming
>>>>>> time_calibration_std_rendezvous.
>>>>>>
>>>>>> Otherwise, there is a risk that it could be reset back to
>>>>>> time_calibration_std_rendezvous.
>>>>> But that's the purpose: We may need to switch back.
>>>> Under what circumstances could we ever move from re-syncing back to not
>>>> re-syncing?
>>> verify_tsc_reliability() may result in X86_FEATURE_TSC_RELIABLE
>>> getting cleared. That's an initcall, which means it runs after
>>> init_xen_time(), and hence after the rendezvous function got
>>> established initially.
>> Right, but that isn't important.
>>
>> There will never be a case where, once TSC_RELIABLE is cleared, it is
>> safe to revert back to std_rendezvous, even if TSC_RELIABLE is
>> subsequently re-set.
> You've got this backwards: TSC_RELIABLE may get _cleared_ late.

Quite - I haven't got this backwards.

> Nothing can ever set it late, due to the use of setup_clear_cpu_cap().
> Reverting back to time_calibration_std_rendezvous() would only be
> possible if CONSTANT_TSC got cleared late, ...

time_calibration_rendezvous_fn defaults to
time_calibration_std_rendezvous(), i.e. defaults to the assumption that
the TSCs are invariant.

We then later call clear_caps(TSC_RELIABLE), and the default changes to
time_calibration_tsc_rendezvous().

We then later call clear_tsc_cap(CONSTANT_TSC), or indeed that
CONSTANT_TSC was never set in the first place, and the default switches
back to time_calibration_std_rendezvous() because of the aformentioned bug.

Once the switch to time_calibration_tsc_rendezvous() is made, it is
never safe to switch back.

Therefore, your function must read time_calibration_rendezvous_fn and
not assume time_calibration_std_rendezvous().

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags
  2016-08-31 13:42               ` Andrew Cooper
@ 2016-08-31 14:03                 ` Jan Beulich
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Beulich @ 2016-08-31 14:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Dario Faggioli, Joao Martins

>>> On 31.08.16 at 15:42, <andrew.cooper3@citrix.com> wrote:
> On 03/08/16 10:43, Jan Beulich wrote:
>>>>> On 02.08.16 at 21:08, <andrew.cooper3@citrix.com> wrote:
>>> On 04/07/16 16:53, Jan Beulich wrote:
>>>>>>> On 04.07.16 at 17:39, <andrew.cooper3@citrix.com> wrote:
>>>>> On 20/06/16 16:20, Jan Beulich wrote:
>>>>>>>>> On 20.06.16 at 16:32, <andrew.cooper3@citrix.com> wrote:
>>>>>>> On 15/06/16 11:28, Jan Beulich wrote:
>>>>>>>> --- a/xen/arch/x86/time.c
>>>>>>>> +++ b/xen/arch/x86/time.c
>>>>>>>> @@ -1358,6 +1358,24 @@ static void time_calibration(void *unuse
>>>>>>>>                       &r, 1);
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +void __init clear_tsc_cap(unsigned int feature)
>>>>>>>> +{
>>>>>>>> +    void (*rendezvous_fn)(void *) = time_calibration_std_rendezvous;
>>>>>>> This should read time_calibration_rendezvous_fn rather than assuming
>>>>>>> time_calibration_std_rendezvous.
>>>>>>>
>>>>>>> Otherwise, there is a risk that it could be reset back to
>>>>>>> time_calibration_std_rendezvous.
>>>>>> But that's the purpose: We may need to switch back.
>>>>> Under what circumstances could we ever move from re-syncing back to not
>>>>> re-syncing?
>>>> verify_tsc_reliability() may result in X86_FEATURE_TSC_RELIABLE
>>>> getting cleared. That's an initcall, which means it runs after
>>>> init_xen_time(), and hence after the rendezvous function got
>>>> established initially.
>>> Right, but that isn't important.
>>>
>>> There will never be a case where, once TSC_RELIABLE is cleared, it is
>>> safe to revert back to std_rendezvous, even if TSC_RELIABLE is
>>> subsequently re-set.
>> You've got this backwards: TSC_RELIABLE may get _cleared_ late.
> 
> Quite - I haven't got this backwards.
> 
>> Nothing can ever set it late, due to the use of setup_clear_cpu_cap().
>> Reverting back to time_calibration_std_rendezvous() would only be
>> possible if CONSTANT_TSC got cleared late, ...
> 
> time_calibration_rendezvous_fn defaults to
> time_calibration_std_rendezvous(), i.e. defaults to the assumption that
> the TSCs are invariant.
> 
> We then later call clear_caps(TSC_RELIABLE), and the default changes to
> time_calibration_tsc_rendezvous().
> 
> We then later call clear_tsc_cap(CONSTANT_TSC), or indeed that
> CONSTANT_TSC was never set in the first place, and the default switches
> back to time_calibration_std_rendezvous() because of the aformentioned bug.
> 
> Once the switch to time_calibration_tsc_rendezvous() is made, it is
> never safe to switch back.

You still don't explain why - I don't see what's wrong with doing
so namely when there wasn't a whole lot of skew gained yet. Plus
I don't see why running with tsc_rendezvous is fine when one of
the two pre-conditions for switching to it isn't met, but we find
out only after having brought up APs.

Jan

> Therefore, your function must read time_calibration_rendezvous_fn and
> not assume time_calibration_std_rendezvous().
> 
> ~Andrew




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-08-31 14:03 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-15  9:50 [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich
2016-06-15 10:26 ` [PATCH 1/8] " Jan Beulich
2016-06-15 10:32   ` Jan Beulich
2016-06-15 22:51   ` Joao Martins
2016-06-16  8:27     ` Jan Beulich
2016-06-16 20:27       ` Joao Martins
2016-06-17  7:32         ` Jan Beulich
2016-06-21 12:05           ` Joao Martins
2016-06-21 12:28             ` Jan Beulich
2016-06-21 13:57               ` Joao Martins
2016-08-02 19:30   ` Andrew Cooper
2016-06-15 10:26 ` [PATCH 2/8] x86: also generate assembler usable equates for synthesized features Jan Beulich
2016-06-20 12:50   ` Andrew Cooper
2016-06-15 10:27 ` [PATCH 3/8] x86/time: introduce and use rdtsc_ordered() Jan Beulich
2016-06-20 12:59   ` Andrew Cooper
2016-06-20 13:06     ` Jan Beulich
2016-06-20 13:07       ` Andrew Cooper
2016-07-11 11:39     ` Dario Faggioli
2016-06-15 10:28 ` [PATCH 4/8] x86/time: calibrate TSC against platform timer Jan Beulich
2016-06-20 14:20   ` Andrew Cooper
2016-06-20 15:19     ` Jan Beulich
2016-08-02 19:25       ` Andrew Cooper
2016-08-03  9:32         ` Jan Beulich
2016-06-15 10:28 ` [PATCH 5/8] x86/time: correctly honor late clearing of TSC related feature flags Jan Beulich
2016-06-20 14:32   ` Andrew Cooper
2016-06-20 15:20     ` Jan Beulich
2016-07-04 15:39       ` Andrew Cooper
2016-07-04 15:53         ` Jan Beulich
2016-08-02 19:08           ` Andrew Cooper
2016-08-03  9:43             ` Jan Beulich
2016-08-31 13:42               ` Andrew Cooper
2016-08-31 14:03                 ` Jan Beulich
2016-06-15 10:29 ` [PATCH 6/8] x86/time: support 32-bit wide ACPI PM timer Jan Beulich
2016-07-04 15:40   ` Andrew Cooper
2016-06-15 10:30 ` [PATCH 7/8] x86/time: fold recurring code Jan Beulich
2016-07-04 15:43   ` Andrew Cooper
2016-06-15 10:30 ` [PATCH 8/8] x86/time: group time stamps into a structure Jan Beulich
2016-07-04 15:57   ` Andrew Cooper
2016-07-01  7:44 ` Ping: [PATCH 0/8] x86/time: improve cross-CPU clock monotonicity (and more) Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.