xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] x86/time: calibration rendezvous adjustments
@ 2021-02-01 12:41 Jan Beulich
  2021-02-01 12:42 ` [PATCH v2 1/3] x86/time: change initiation of the calibration timer Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Jan Beulich @ 2021-02-01 12:41 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

The latter two patches are meant to address a regression reported on
the list under "Problems with APIC on versions 4.9 and later (4.8
works)". In the course of analyzing output from a debugging patch I
ran into another anomaly again, which I thought I should finally try
to address. Hence patch 1.

While looking closely at the corresponding debugging patch'es output I
noticed a suspicious drift between local and master stime: Measured not
very precisely, local was behind master by about 200ms in about half an
hour. Interestingly the recording of ->master_stime (and hence the not
really inexpensive invocation of read_platform_stime()) looks to be
pretty pointless when CONSTANT_TSC - I haven't been able to spot an
actual consumer. IOW the drift may not be a problem, and we might be
able to eliminate the platform timer reads. (When !CONSTANT_TSC, such
drift would get corrected anyway, by local_time_calibration().)

1: change initiation of the calibration timer
2: adjust time recording time_calibration_tsc_rendezvous()
3: don't move TSC backwards in time_calibration_tsc_rendezvous()

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] x86/time: change initiation of the calibration timer
  2021-02-01 12:41 [PATCH v2 0/3] x86/time: calibration rendezvous adjustments Jan Beulich
@ 2021-02-01 12:42 ` Jan Beulich
  2021-02-05 16:00   ` Roger Pau Monné
  2021-02-01 12:43 ` [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous() Jan Beulich
  2021-02-01 12:43 ` [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2021-02-01 12:42 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

Setting the timer a second (EPOCH) into the future at a random point
during boot (prior to bringing up APs and prior to launching Dom0) does
not yield predictable results: The timer may expire while we're still
bringing up APs (too early) or when Dom0 already boots (too late).
Instead invoke the timer handler function explicitly at a predictable
point in time, once we've established the rendezvous function to use
(and hence also once all APs are online). This will, through the raising
and handling of TIMER_SOFTIRQ, then also have the effect of arming the
timer.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -854,9 +854,7 @@ static void resume_platform_timer(void)
 
 static void __init reset_platform_timer(void)
 {
-    /* Deactivate any timers running */
     kill_timer(&plt_overflow_timer);
-    kill_timer(&calibration_timer);
 
     /* Reset counters and stamps */
     spin_lock_irq(&platform_timer_lock);
@@ -1956,19 +1954,13 @@ static void __init reset_percpu_time(voi
     t->stamp.master_stime = t->stamp.local_stime;
 }
 
-static void __init try_platform_timer_tail(bool late)
+static void __init try_platform_timer_tail(void)
 {
     init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
     plt_overflow(NULL);
 
     platform_timer_stamp = plt_stamp64;
     stime_platform_stamp = NOW();
-
-    if ( !late )
-        init_percpu_time();
-
-    init_timer(&calibration_timer, time_calibration, NULL, 0);
-    set_timer(&calibration_timer, NOW() + EPOCH);
 }
 
 /* Late init function, after all cpus have booted */
@@ -2009,10 +2001,13 @@ static int __init verify_tsc_reliability
             time_calibration_rendezvous_fn = time_calibration_nop_rendezvous;
 
             /* Finish platform timer switch. */
-            try_platform_timer_tail(true);
+            try_platform_timer_tail();
 
             printk("Switched to Platform timer %s TSC\n",
                    freq_string(plt_src.frequency));
+
+            time_calibration(NULL);
+
             return 0;
         }
     }
@@ -2033,6 +2028,8 @@ static int __init verify_tsc_reliability
          !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
         time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
 
+    time_calibration(NULL);
+
     return 0;
 }
 __initcall(verify_tsc_reliability);
@@ -2048,7 +2045,11 @@ int __init init_xen_time(void)
     do_settime(get_wallclock_time(), 0, NOW());
 
     /* Finish platform timer initialization. */
-    try_platform_timer_tail(false);
+    try_platform_timer_tail();
+
+    init_percpu_time();
+
+    init_timer(&calibration_timer, time_calibration, NULL, 0);
 
     /*
      * Setup space to track per-socket TSC_ADJUST values. Don't fiddle with



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()
  2021-02-01 12:41 [PATCH v2 0/3] x86/time: calibration rendezvous adjustments Jan Beulich
  2021-02-01 12:42 ` [PATCH v2 1/3] x86/time: change initiation of the calibration timer Jan Beulich
@ 2021-02-01 12:43 ` Jan Beulich
  2021-02-05 16:15   ` Roger Pau Monné
  2021-02-01 12:43 ` [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2021-02-01 12:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
Therefore the two better get taken as close to one another as possible.
This means two things: First, reading platform time is too early when
done on the first iteration. The closest we can get is on the last
iteration, immediately before telling other CPUs to write their TSCs
(and then also writing CPU0's). While at the first glance it may seem
not overly relevant when exactly platform time is read (when assuming
that only stime is ever relevant anywhere, and hence the association
with the precise TSC values is of lower interest), both CPU frequency
changes and the effects of SMT make it unpredictable (between individual
rendezvous instances) how long the loop iterations will take. This will
in turn lead to higher an error than neccesary in how close to linear
stime movement we can get.

Second, re-reading the TSC for local recording is increasing the overall
error as well, when we already know a more precise value - the one just
written.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1662,11 +1662,12 @@ struct calibration_rendezvous {
 };
 
 static void
-time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
+time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
+                                 uint64_t tsc)
 {
     struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
-    c->local_tsc    = rdtsc_ordered();
+    c->local_tsc    = tsc;
     c->local_stime  = get_s_time_fixed(c->local_tsc);
     c->master_stime = r->master_stime;
 
@@ -1691,11 +1692,11 @@ static void time_calibration_tsc_rendezv
             while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
                 cpu_relax();
 
-            if ( r->master_stime == 0 )
-            {
-                r->master_stime = read_platform_stime(NULL);
+            if ( r->master_tsc_stamp == 0 )
                 r->master_tsc_stamp = rdtsc_ordered();
-            }
+            else if ( i == 0 )
+                r->master_stime = read_platform_stime(NULL);
+
             atomic_inc(&r->semaphore);
 
             if ( i == 0 )
@@ -1720,7 +1721,7 @@ static void time_calibration_tsc_rendezv
         }
     }
 
-    time_calibration_rendezvous_tail(r);
+    time_calibration_rendezvous_tail(r, r->master_tsc_stamp);
 }
 
 /* Ordinary rendezvous function which does not modify TSC values. */
@@ -1745,7 +1746,7 @@ static void time_calibration_std_rendezv
         smp_rmb(); /* receive signal /then/ read r->master_stime */
     }
 
-    time_calibration_rendezvous_tail(r);
+    time_calibration_rendezvous_tail(r, rdtsc_ordered());
 }
 
 /*



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-01 12:41 [PATCH v2 0/3] x86/time: calibration rendezvous adjustments Jan Beulich
  2021-02-01 12:42 ` [PATCH v2 1/3] x86/time: change initiation of the calibration timer Jan Beulich
  2021-02-01 12:43 ` [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous() Jan Beulich
@ 2021-02-01 12:43 ` Jan Beulich
  2021-02-02  8:16   ` Jan Beulich
  2021-02-08  9:38   ` Roger Pau Monné
  2 siblings, 2 replies; 16+ messages in thread
From: Jan Beulich @ 2021-02-01 12:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

While doing this for small amounts may be okay, the unconditional use
of CPU0's value here has been found to be a problem when the boot time
TSC of the BSP was behind that of all APs by more than a second. In
particular because of get_s_time_fixed() producing insane output when
the calculated delta is negative, we can't allow this to happen.

On the first iteration have all other CPUs sort out the highest TSC
value any one of them has read. On the second iteration, if that
maximum is higher than CPU0's, update its recorded value from that
taken in the first iteration. Use the resulting value on the last
iteration to write everyone's TSCs.

To account for the possible discontinuity, have
time_calibration_rendezvous_tail() record the newly written value, but
extrapolate local stime using the value read.

Reported-by: Claudemir Todo Bom <claudemir@todobom.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Don't update r->master_stime by calculation. Re-base over new
    earlier patch. Make time_calibration_rendezvous_tail() take two TSC
    values.
---
Since CPU0 reads its TSC last on the first iteration, if TSCs were
perfectly sync-ed there shouldn't ever be a need to update. However,
even on the TSC-reliable system I first tested this on (using
"tsc=skewed" to get this rendezvous function into use in the first
place) updates by up to several thousand clocks did happen. I wonder
whether this points at some problem with the approach that I'm not (yet)
seeing.

Considering the sufficiently modern CPU it's using, I suspect the
reporter's system wouldn't even need to turn off TSC_RELIABLE, if only
there wasn't the boot time skew. Hence another approach might be to fix
this boot time skew. Of course to recognize whether the TSCs then still
aren't in sync we'd need to run tsc_check_reliability() sufficiently
long after that adjustment. Which is besides the need to have this
"fixing" be precise enough for the TSCs to not look skewed anymore
afterwards.

As per the comment ahead of it, the original purpose of the function was
to deal with TSCs halted in deep C states. While this probably explains
why only forward moves were ever expected, I don't see how this could
have been reliable in case CPU0 was deep-sleeping for a sufficiently
long time. My only guess here is a hidden assumption of CPU0 never being
idle for long enough.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1658,17 +1658,17 @@ struct calibration_rendezvous {
     cpumask_t cpu_calibration_map;
     atomic_t semaphore;
     s_time_t master_stime;
-    u64 master_tsc_stamp;
+    uint64_t master_tsc_stamp, max_tsc_stamp;
 };
 
 static void
 time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
-                                 uint64_t tsc)
+                                 uint64_t old_tsc, uint64_t new_tsc)
 {
     struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
 
-    c->local_tsc    = tsc;
-    c->local_stime  = get_s_time_fixed(c->local_tsc);
+    c->local_tsc    = new_tsc;
+    c->local_stime  = get_s_time_fixed(old_tsc ?: new_tsc);
     c->master_stime = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
@@ -1683,6 +1683,7 @@ static void time_calibration_tsc_rendezv
     int i;
     struct calibration_rendezvous *r = _r;
     unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
+    uint64_t tsc = 0;
 
     /* Loop to get rid of cache effects on TSC skew. */
     for ( i = 4; i >= 0; i-- )
@@ -1692,8 +1693,15 @@ static void time_calibration_tsc_rendezv
             while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
                 cpu_relax();
 
-            if ( r->master_tsc_stamp == 0 )
-                r->master_tsc_stamp = rdtsc_ordered();
+            if ( tsc == 0 )
+                r->master_tsc_stamp = tsc = rdtsc_ordered();
+            else if ( r->master_tsc_stamp < r->max_tsc_stamp )
+                /*
+                 * We want to avoid moving the TSC backwards for any CPU.
+                 * Use the largest value observed anywhere on the first
+                 * iteration.
+                 */
+                r->master_tsc_stamp = r->max_tsc_stamp;
             else if ( i == 0 )
                 r->master_stime = read_platform_stime(NULL);
 
@@ -1712,6 +1720,16 @@ static void time_calibration_tsc_rendezv
             while ( atomic_read(&r->semaphore) < total_cpus )
                 cpu_relax();
 
+            if ( tsc == 0 )
+            {
+                uint64_t cur;
+
+                tsc = rdtsc_ordered();
+                while ( tsc > (cur = r->max_tsc_stamp) )
+                    if ( cmpxchg(&r->max_tsc_stamp, cur, tsc) == cur )
+                        break;
+            }
+
             if ( i == 0 )
                 write_tsc(r->master_tsc_stamp);
 
@@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
             while ( atomic_read(&r->semaphore) > total_cpus )
                 cpu_relax();
         }
+
+        /* Just in case a read above ended up reading zero. */
+        tsc += !tsc;
     }
 
-    time_calibration_rendezvous_tail(r, r->master_tsc_stamp);
+    time_calibration_rendezvous_tail(r, tsc, r->master_tsc_stamp);
 }
 
 /* Ordinary rendezvous function which does not modify TSC values. */
@@ -1746,7 +1767,7 @@ static void time_calibration_std_rendezv
         smp_rmb(); /* receive signal /then/ read r->master_stime */
     }
 
-    time_calibration_rendezvous_tail(r, rdtsc_ordered());
+    time_calibration_rendezvous_tail(r, 0, rdtsc_ordered());
 }
 
 /*



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-01 12:43 ` [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
@ 2021-02-02  8:16   ` Jan Beulich
  2021-02-08  9:38   ` Roger Pau Monné
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2021-02-02  8:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

On 01.02.2021 13:43, Jan Beulich wrote:
> As per the comment ahead of it, the original purpose of the function was
> to deal with TSCs halted in deep C states. While this probably explains
> why only forward moves were ever expected, I don't see how this could
> have been reliable in case CPU0 was deep-sleeping for a sufficiently
> long time. My only guess here is a hidden assumption of CPU0 never being
> idle for long enough.

Furthermore that comment looks to be contradicting the actual use of
the function: It gets installed when !RELIABLE_TSC, while the comment
would suggest !NONSTOP_TSC. I suppose the comment is simply misleading,
because RELIABLE_TSC implies NONSTOP_TSC according to all the places
where either of the two feature bits gets played with. Plus in the
!NONSTOP_TSC case we write the TSC explicitly anyway when coming back
out of a (deep; see below) C-state.

As an implication from the above mwait_idle_cpu_init() then looks to
pointlessly clear "reliable" when "nonstop" is clear.

It further looks odd that mwait_idle() (unlike acpi_processor_idle())
calls cstate_restore_tsc() independent of what C-state was active.

> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) > total_cpus )
>                  cpu_relax();
>          }
> +
> +        /* Just in case a read above ended up reading zero. */
> +        tsc += !tsc;
>      }
>  
> -    time_calibration_rendezvous_tail(r, r->master_tsc_stamp);
> +    time_calibration_rendezvous_tail(r, tsc, r->master_tsc_stamp);

This, in particular, wouldn't be valid when !NONSTOP_TSC without
cstate_restore_tsc(). We then wouldn't have a way to know whether
the observed gap is because of the TSC having been halted for a
while (as the comment ahead of the function - imo wrongly, as per
above - suggests), or whether - like in Claudemir's case - the
individual TSCs were offset against one another.

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] x86/time: change initiation of the calibration timer
  2021-02-01 12:42 ` [PATCH v2 1/3] x86/time: change initiation of the calibration timer Jan Beulich
@ 2021-02-05 16:00   ` Roger Pau Monné
  0 siblings, 0 replies; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-05 16:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 01, 2021 at 01:42:35PM +0100, Jan Beulich wrote:
> Setting the timer a second (EPOCH) into the future at a random point
> during boot (prior to bringing up APs and prior to launching Dom0) does
> not yield predictable results: The timer may expire while we're still
> bringing up APs (too early) or when Dom0 already boots (too late).
> Instead invoke the timer handler function explicitly at a predictable
> point in time, once we've established the rendezvous function to use
> (and hence also once all APs are online). This will, through the raising
> and handling of TIMER_SOFTIRQ, then also have the effect of arming the
> timer.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()
  2021-02-01 12:43 ` [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous() Jan Beulich
@ 2021-02-05 16:15   ` Roger Pau Monné
  2021-02-08 10:56     ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-05 16:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote:
> The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
> Therefore the two better get taken as close to one another as possible.
> This means two things: First, reading platform time is too early when
> done on the first iteration. The closest we can get is on the last
> iteration, immediately before telling other CPUs to write their TSCs
> (and then also writing CPU0's). While at the first glance it may seem
> not overly relevant when exactly platform time is read (when assuming
> that only stime is ever relevant anywhere, and hence the association
> with the precise TSC values is of lower interest), both CPU frequency
> changes and the effects of SMT make it unpredictable (between individual
> rendezvous instances) how long the loop iterations will take. This will
> in turn lead to higher an error than neccesary in how close to linear
> stime movement we can get.
> 
> Second, re-reading the TSC for local recording is increasing the overall
> error as well, when we already know a more precise value - the one just
> written.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

I've been thinking this all seems doomed when Xen runs in a virtualized
environment, and should likely be disabled. There's no point on trying
to sync the TSC over multiple vCPUs as the scheduling delay between
them will likely skew any calculations.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-01 12:43 ` [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
  2021-02-02  8:16   ` Jan Beulich
@ 2021-02-08  9:38   ` Roger Pau Monné
  2021-02-08 11:22     ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-08  9:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 01, 2021 at 01:43:28PM +0100, Jan Beulich wrote:
> While doing this for small amounts may be okay, the unconditional use
> of CPU0's value here has been found to be a problem when the boot time
> TSC of the BSP was behind that of all APs by more than a second. In
> particular because of get_s_time_fixed() producing insane output when
> the calculated delta is negative, we can't allow this to happen.
> 
> On the first iteration have all other CPUs sort out the highest TSC
> value any one of them has read. On the second iteration, if that
> maximum is higher than CPU0's, update its recorded value from that
> taken in the first iteration. Use the resulting value on the last
> iteration to write everyone's TSCs.
> 
> To account for the possible discontinuity, have
> time_calibration_rendezvous_tail() record the newly written value, but
> extrapolate local stime using the value read.
> 
> Reported-by: Claudemir Todo Bom <claudemir@todobom.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Don't update r->master_stime by calculation. Re-base over new
>     earlier patch. Make time_calibration_rendezvous_tail() take two TSC
>     values.
> ---
> Since CPU0 reads its TSC last on the first iteration, if TSCs were
> perfectly sync-ed there shouldn't ever be a need to update. However,
> even on the TSC-reliable system I first tested this on (using
> "tsc=skewed" to get this rendezvous function into use in the first
> place) updates by up to several thousand clocks did happen. I wonder
> whether this points at some problem with the approach that I'm not (yet)
> seeing.

I'm confused by this, so on a system that had reliable TSCs, which
you forced to remove the reliable flag, and then you saw big
differences when doing the rendezvous?

That would seem to indicate that such system doesn't really have
reliable TSCs?

> Considering the sufficiently modern CPU it's using, I suspect the
> reporter's system wouldn't even need to turn off TSC_RELIABLE, if only
> there wasn't the boot time skew. Hence another approach might be to fix
> this boot time skew. Of course to recognize whether the TSCs then still
> aren't in sync we'd need to run tsc_check_reliability() sufficiently
> long after that adjustment. Which is besides the need to have this
> "fixing" be precise enough for the TSCs to not look skewed anymore
> afterwards.

Maybe it would make sense to do a TSC counter sync after APs are up
and then disable the rendezvous if the next calibration rendezvous
shows no skew?

I also wonder, we test for skew just after the APs have been booted,
and decide at that point whether we need a calibration rendezvous.

Maybe we could do a TSC sync just after APs are up (to hopefully bring
them in sync), and then do the tsc_check_reliability just before Xen
ends booting (ie: before handing control to dom0?)

What we do right now (ie: do the tsc_check_reliability so early) is
also likely to miss small skews that will only show up after APs have
been running for a while?

> As per the comment ahead of it, the original purpose of the function was
> to deal with TSCs halted in deep C states. While this probably explains
> why only forward moves were ever expected, I don't see how this could
> have been reliable in case CPU0 was deep-sleeping for a sufficiently
> long time. My only guess here is a hidden assumption of CPU0 never being
> idle for long enough.
> 
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1658,17 +1658,17 @@ struct calibration_rendezvous {
>      cpumask_t cpu_calibration_map;
>      atomic_t semaphore;
>      s_time_t master_stime;
> -    u64 master_tsc_stamp;
> +    uint64_t master_tsc_stamp, max_tsc_stamp;
>  };
>  
>  static void
>  time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
> -                                 uint64_t tsc)
> +                                 uint64_t old_tsc, uint64_t new_tsc)
>  {
>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>  
> -    c->local_tsc    = tsc;
> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
> +    c->local_tsc    = new_tsc;
> +    c->local_stime  = get_s_time_fixed(old_tsc ?: new_tsc);
>      c->master_stime = r->master_stime;
>  
>      raise_softirq(TIME_CALIBRATE_SOFTIRQ);
> @@ -1683,6 +1683,7 @@ static void time_calibration_tsc_rendezv
>      int i;
>      struct calibration_rendezvous *r = _r;
>      unsigned int total_cpus = cpumask_weight(&r->cpu_calibration_map);
> +    uint64_t tsc = 0;
>  
>      /* Loop to get rid of cache effects on TSC skew. */
>      for ( i = 4; i >= 0; i-- )
> @@ -1692,8 +1693,15 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) != (total_cpus - 1) )
>                  cpu_relax();
>  
> -            if ( r->master_tsc_stamp == 0 )
> -                r->master_tsc_stamp = rdtsc_ordered();
> +            if ( tsc == 0 )
> +                r->master_tsc_stamp = tsc = rdtsc_ordered();
> +            else if ( r->master_tsc_stamp < r->max_tsc_stamp )
> +                /*
> +                 * We want to avoid moving the TSC backwards for any CPU.
> +                 * Use the largest value observed anywhere on the first
> +                 * iteration.
> +                 */
> +                r->master_tsc_stamp = r->max_tsc_stamp;
>              else if ( i == 0 )
>                  r->master_stime = read_platform_stime(NULL);
>  
> @@ -1712,6 +1720,16 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) < total_cpus )
>                  cpu_relax();
>  
> +            if ( tsc == 0 )
> +            {
> +                uint64_t cur;
> +
> +                tsc = rdtsc_ordered();
> +                while ( tsc > (cur = r->max_tsc_stamp) )
> +                    if ( cmpxchg(&r->max_tsc_stamp, cur, tsc) == cur )
> +                        break;

I think you could avoid reading cur explicitly for each loop and
instead do?

cur = ACCESS_ONCE(r->max_tsc_stamp)
while ( tsc > cur )
    cur = cmpxchg(&r->max_tsc_stamp, cur, tsc);

> +            }
> +
>              if ( i == 0 )
>                  write_tsc(r->master_tsc_stamp);
>  
> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
>              while ( atomic_read(&r->semaphore) > total_cpus )
>                  cpu_relax();
>          }
> +
> +        /* Just in case a read above ended up reading zero. */
> +        tsc += !tsc;

Won't that be worthy of an ASSERT_UNREACHABLE? I'm not sure I see how
tsc could be 0 on a healthy system after the loop above.


Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()
  2021-02-05 16:15   ` Roger Pau Monné
@ 2021-02-08 10:56     ` Jan Beulich
  2021-02-08 11:05       ` Roger Pau Monné
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2021-02-08 10:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On 05.02.2021 17:15, Roger Pau Monné wrote:
> On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote:
>> The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
>> Therefore the two better get taken as close to one another as possible.
>> This means two things: First, reading platform time is too early when
>> done on the first iteration. The closest we can get is on the last
>> iteration, immediately before telling other CPUs to write their TSCs
>> (and then also writing CPU0's). While at the first glance it may seem
>> not overly relevant when exactly platform time is read (when assuming
>> that only stime is ever relevant anywhere, and hence the association
>> with the precise TSC values is of lower interest), both CPU frequency
>> changes and the effects of SMT make it unpredictable (between individual
>> rendezvous instances) how long the loop iterations will take. This will
>> in turn lead to higher an error than neccesary in how close to linear
>> stime movement we can get.
>>
>> Second, re-reading the TSC for local recording is increasing the overall
>> error as well, when we already know a more precise value - the one just
>> written.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> I've been thinking this all seems doomed when Xen runs in a virtualized
> environment, and should likely be disabled. There's no point on trying
> to sync the TSC over multiple vCPUs as the scheduling delay between
> them will likely skew any calculations.

We may want to consider to force the equivalent of
"clocksource=tsc" in that case. Otoh a well behaved hypervisor
underneath shouldn't lead to us finding a need to clear
TSC_RELIABLE, at which point this logic wouldn't get engaged
in the first place.

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()
  2021-02-08 10:56     ` Jan Beulich
@ 2021-02-08 11:05       ` Roger Pau Monné
  2021-02-08 11:50         ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-08 11:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote:
> On 05.02.2021 17:15, Roger Pau Monné wrote:
> > On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote:
> >> The (stime,tsc) tuple is the basis for extrapolation by get_s_time().
> >> Therefore the two better get taken as close to one another as possible.
> >> This means two things: First, reading platform time is too early when
> >> done on the first iteration. The closest we can get is on the last
> >> iteration, immediately before telling other CPUs to write their TSCs
> >> (and then also writing CPU0's). While at the first glance it may seem
> >> not overly relevant when exactly platform time is read (when assuming
> >> that only stime is ever relevant anywhere, and hence the association
> >> with the precise TSC values is of lower interest), both CPU frequency
> >> changes and the effects of SMT make it unpredictable (between individual
> >> rendezvous instances) how long the loop iterations will take. This will
> >> in turn lead to higher an error than neccesary in how close to linear
> >> stime movement we can get.
> >>
> >> Second, re-reading the TSC for local recording is increasing the overall
> >> error as well, when we already know a more precise value - the one just
> >> written.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > I've been thinking this all seems doomed when Xen runs in a virtualized
> > environment, and should likely be disabled. There's no point on trying
> > to sync the TSC over multiple vCPUs as the scheduling delay between
> > them will likely skew any calculations.
> 
> We may want to consider to force the equivalent of
> "clocksource=tsc" in that case. Otoh a well behaved hypervisor
> underneath shouldn't lead to us finding a need to clear
> TSC_RELIABLE, at which point this logic wouldn't get engaged
> in the first place.

I got the impression that on a loaded system guests with a non-trivial
amount of vCPUs might be in trouble to be able to schedule them all
close enough for the rendezvous to not report a big skew, and thus
disable TSC_RELIABLE?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-08  9:38   ` Roger Pau Monné
@ 2021-02-08 11:22     ` Jan Beulich
  2021-02-08 13:19       ` Roger Pau Monné
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2021-02-08 11:22 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On 08.02.2021 10:38, Roger Pau Monné wrote:
> On Mon, Feb 01, 2021 at 01:43:28PM +0100, Jan Beulich wrote:
>> ---
>> Since CPU0 reads its TSC last on the first iteration, if TSCs were
>> perfectly sync-ed there shouldn't ever be a need to update. However,
>> even on the TSC-reliable system I first tested this on (using
>> "tsc=skewed" to get this rendezvous function into use in the first
>> place) updates by up to several thousand clocks did happen. I wonder
>> whether this points at some problem with the approach that I'm not (yet)
>> seeing.
> 
> I'm confused by this, so on a system that had reliable TSCs, which
> you forced to remove the reliable flag, and then you saw big
> differences when doing the rendezvous?
> 
> That would seem to indicate that such system doesn't really have
> reliable TSCs?

I don't think so, no. This can easily be a timing effect from the
heavy cache line bouncing involved here.

What I'm worried here seeing these updates is that I might still
be moving TSCs backwards in ways observable to the rest of the
system (i.e. beyond the inherent property of the approach), and
this then getting corrected by a subsequent rendezvous. But as
said - I can't see what this could result from, and hence I'm
inclined to assume these are merely effects I've not found a
good explanation for so far.

>> Considering the sufficiently modern CPU it's using, I suspect the
>> reporter's system wouldn't even need to turn off TSC_RELIABLE, if only
>> there wasn't the boot time skew. Hence another approach might be to fix
>> this boot time skew. Of course to recognize whether the TSCs then still
>> aren't in sync we'd need to run tsc_check_reliability() sufficiently
>> long after that adjustment. Which is besides the need to have this
>> "fixing" be precise enough for the TSCs to not look skewed anymore
>> afterwards.
> 
> Maybe it would make sense to do a TSC counter sync after APs are up
> and then disable the rendezvous if the next calibration rendezvous
> shows no skew?

Yes, that's what I was hinting at with the above. For the next
rendezvous to not observe any skew, our adjustment would need to
be far more precise than it is today, though.

> I also wonder, we test for skew just after the APs have been booted,
> and decide at that point whether we need a calibration rendezvous.
> 
> Maybe we could do a TSC sync just after APs are up (to hopefully bring
> them in sync), and then do the tsc_check_reliability just before Xen
> ends booting (ie: before handing control to dom0?)
> 
> What we do right now (ie: do the tsc_check_reliability so early) is
> also likely to miss small skews that will only show up after APs have
> been running for a while?

The APs' TSCs will have been running for about as long as the
BSP's, as INIT does not affect them (and in fact they ought to
be running for _exactly_ as long, or else tsc_check_reliability()
would end up turning off TSC_RELIABLE). So I expect skews to be
large enough at this point to be recognizable.

>> @@ -1712,6 +1720,16 @@ static void time_calibration_tsc_rendezv
>>              while ( atomic_read(&r->semaphore) < total_cpus )
>>                  cpu_relax();
>>  
>> +            if ( tsc == 0 )
>> +            {
>> +                uint64_t cur;
>> +
>> +                tsc = rdtsc_ordered();
>> +                while ( tsc > (cur = r->max_tsc_stamp) )
>> +                    if ( cmpxchg(&r->max_tsc_stamp, cur, tsc) == cur )
>> +                        break;
> 
> I think you could avoid reading cur explicitly for each loop and
> instead do?
> 
> cur = ACCESS_ONCE(r->max_tsc_stamp)
> while ( tsc > cur )
>     cur = cmpxchg(&r->max_tsc_stamp, cur, tsc);

Ah yes. I tried something similar, but not quite the same,
and it looked wrong, so I gave up re-arranging.

>> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
>>              while ( atomic_read(&r->semaphore) > total_cpus )
>>                  cpu_relax();
>>          }
>> +
>> +        /* Just in case a read above ended up reading zero. */
>> +        tsc += !tsc;
> 
> Won't that be worthy of an ASSERT_UNREACHABLE? I'm not sure I see how
> tsc could be 0 on a healthy system after the loop above.

It's not forbidden for the firmware to set the TSCs to some
huge negative value. Considering the effect TSC_ADJUST has on
the actual value read by RDTSC, I think I did actually observe
a system coming up this way, because of (not very helpful)
TSC_ADJUST setting by firmware. So no, no ASSERT_UNREACHABLE()
here.

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()
  2021-02-08 11:05       ` Roger Pau Monné
@ 2021-02-08 11:50         ` Jan Beulich
  2021-02-08 16:39           ` Roger Pau Monné
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2021-02-08 11:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On 08.02.2021 12:05, Roger Pau Monné wrote:
> On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote:
>> On 05.02.2021 17:15, Roger Pau Monné wrote:
>>> I've been thinking this all seems doomed when Xen runs in a virtualized
>>> environment, and should likely be disabled. There's no point on trying
>>> to sync the TSC over multiple vCPUs as the scheduling delay between
>>> them will likely skew any calculations.
>>
>> We may want to consider to force the equivalent of
>> "clocksource=tsc" in that case. Otoh a well behaved hypervisor
>> underneath shouldn't lead to us finding a need to clear
>> TSC_RELIABLE, at which point this logic wouldn't get engaged
>> in the first place.
> 
> I got the impression that on a loaded system guests with a non-trivial
> amount of vCPUs might be in trouble to be able to schedule them all
> close enough for the rendezvous to not report a big skew, and thus
> disable TSC_RELIABLE?

No, check_tsc_warp() / tsc_check_reliability() don't have a
problem there. Every CPU reads the shared "most advanced"
stamp before reading its local one. So it doesn't matter how
large the gaps are here. In fact the possible bad effect is
the other way around here - if the scheduling effects are
too heavy, we may mistakenly consider TSCs reliable when
they aren't.

A problem of the kind you describe exists in the actual
rendezvous function. And actually any problem of this kind
can, on a smaller scale, already be be observed with SMT,
because the individual hyperthreads of a core can't
possibly all run at the same time.

As occurs to me only now, I think we can improve accuracy
some (in particular on big systems) by making sure
struct calibration_rendezvous's master_tsc_stamp is not
sharing its cache line with semaphore and master_stime. The
latter get written by (at least) the BSP, while
master_tsc_stamp is stable after the 2nd loop iteration.
Hence on the 3rd and 4th iterations we could even prefetch
it to reduce the delay on the last one.

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-08 11:22     ` Jan Beulich
@ 2021-02-08 13:19       ` Roger Pau Monné
  2021-02-08 13:59         ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-08 13:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 08, 2021 at 12:22:25PM +0100, Jan Beulich wrote:
> On 08.02.2021 10:38, Roger Pau Monné wrote:
> > On Mon, Feb 01, 2021 at 01:43:28PM +0100, Jan Beulich wrote:
> >> ---
> >> Since CPU0 reads its TSC last on the first iteration, if TSCs were
> >> perfectly sync-ed there shouldn't ever be a need to update. However,
> >> even on the TSC-reliable system I first tested this on (using
> >> "tsc=skewed" to get this rendezvous function into use in the first
> >> place) updates by up to several thousand clocks did happen. I wonder
> >> whether this points at some problem with the approach that I'm not (yet)
> >> seeing.
> > 
> > I'm confused by this, so on a system that had reliable TSCs, which
> > you forced to remove the reliable flag, and then you saw big
> > differences when doing the rendezvous?
> > 
> > That would seem to indicate that such system doesn't really have
> > reliable TSCs?
> 
> I don't think so, no. This can easily be a timing effect from the
> heavy cache line bouncing involved here.
> 
> What I'm worried here seeing these updates is that I might still
> be moving TSCs backwards in ways observable to the rest of the
> system (i.e. beyond the inherent property of the approach), and
> this then getting corrected by a subsequent rendezvous. But as
> said - I can't see what this could result from, and hence I'm
> inclined to assume these are merely effects I've not found a
> good explanation for so far.

I'm slightly worried by this, maybe because I'm misunderstanding part
of the TSC sync stuff.

So you forced a system that Xen would otherwise consider to have a
reliable TSC (one that doesn't need a calibration rendezvous) into
doing the calibration rendezvous, and then the skew seen is quite big.
I would expect such skew to be minimal? As we would otherwise consider
the system to not need calibration at all.

This makes me wonder if the system does indeed need such calibration
(which I don't think so), or if the calibration that we actually try
to do is quite bogus?

> >> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
> >>              while ( atomic_read(&r->semaphore) > total_cpus )
> >>                  cpu_relax();
> >>          }
> >> +
> >> +        /* Just in case a read above ended up reading zero. */
> >> +        tsc += !tsc;
> > 
> > Won't that be worthy of an ASSERT_UNREACHABLE? I'm not sure I see how
> > tsc could be 0 on a healthy system after the loop above.
> 
> It's not forbidden for the firmware to set the TSCs to some
> huge negative value. Considering the effect TSC_ADJUST has on
> the actual value read by RDTSC, I think I did actually observe
> a system coming up this way, because of (not very helpful)
> TSC_ADJUST setting by firmware. So no, no ASSERT_UNREACHABLE()
> here.

But then the code here will loop 5 times, and it's not possible for
those 5 loops to read a TSC value of 0? I could see it reading 0 on
one of the iterations but not in all of them.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-08 13:19       ` Roger Pau Monné
@ 2021-02-08 13:59         ` Jan Beulich
  2021-02-08 16:33           ` Roger Pau Monné
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2021-02-08 13:59 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On 08.02.2021 14:19, Roger Pau Monné wrote:
> On Mon, Feb 08, 2021 at 12:22:25PM +0100, Jan Beulich wrote:
>> On 08.02.2021 10:38, Roger Pau Monné wrote:
>>> On Mon, Feb 01, 2021 at 01:43:28PM +0100, Jan Beulich wrote:
>>>> ---
>>>> Since CPU0 reads its TSC last on the first iteration, if TSCs were
>>>> perfectly sync-ed there shouldn't ever be a need to update. However,
>>>> even on the TSC-reliable system I first tested this on (using
>>>> "tsc=skewed" to get this rendezvous function into use in the first
>>>> place) updates by up to several thousand clocks did happen. I wonder
>>>> whether this points at some problem with the approach that I'm not (yet)
>>>> seeing.
>>>
>>> I'm confused by this, so on a system that had reliable TSCs, which
>>> you forced to remove the reliable flag, and then you saw big
>>> differences when doing the rendezvous?
>>>
>>> That would seem to indicate that such system doesn't really have
>>> reliable TSCs?
>>
>> I don't think so, no. This can easily be a timing effect from the
>> heavy cache line bouncing involved here.
>>
>> What I'm worried here seeing these updates is that I might still
>> be moving TSCs backwards in ways observable to the rest of the
>> system (i.e. beyond the inherent property of the approach), and
>> this then getting corrected by a subsequent rendezvous. But as
>> said - I can't see what this could result from, and hence I'm
>> inclined to assume these are merely effects I've not found a
>> good explanation for so far.
> 
> I'm slightly worried by this, maybe because I'm misunderstanding part
> of the TSC sync stuff.
> 
> So you forced a system that Xen would otherwise consider to have a
> reliable TSC (one that doesn't need a calibration rendezvous) into
> doing the calibration rendezvous, and then the skew seen is quite big.
> I would expect such skew to be minimal? As we would otherwise consider
> the system to not need calibration at all.
> 
> This makes me wonder if the system does indeed need such calibration
> (which I don't think so), or if the calibration that we actually try
> to do is quite bogus?

I wouldn't call it bogus, but it's not very precise. Hence me
saying that if we wanted to make the problematic system seen as
TSC_RELIABLE (and hence be able to switch from "tsc" to "std"
rendezvous), we'd first need to improve accuracy here quite a bit.
(I suspect sufficient accuracy can only be achieved by making use
of TSC_ADJUST, but that's not available on the reporter's hardware,
so of no immediate interest here.)

>>>> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
>>>>              while ( atomic_read(&r->semaphore) > total_cpus )
>>>>                  cpu_relax();
>>>>          }
>>>> +
>>>> +        /* Just in case a read above ended up reading zero. */
>>>> +        tsc += !tsc;
>>>
>>> Won't that be worthy of an ASSERT_UNREACHABLE? I'm not sure I see how
>>> tsc could be 0 on a healthy system after the loop above.
>>
>> It's not forbidden for the firmware to set the TSCs to some
>> huge negative value. Considering the effect TSC_ADJUST has on
>> the actual value read by RDTSC, I think I did actually observe
>> a system coming up this way, because of (not very helpful)
>> TSC_ADJUST setting by firmware. So no, no ASSERT_UNREACHABLE()
>> here.
> 
> But then the code here will loop 5 times, and it's not possible for
> those 5 loops to read a TSC value of 0? I could see it reading 0 on
> one of the iterations but not in all of them.

Sure, we can read zero at most once here. Yet the "if ( tsc == 0 )"
conditionals get executed on every iteration, while they must yield
"true" only on the first (from the variable's initializer); we
absolutely need to go the "else if()" path on CPU0 on the 2nd
iteration, and we also need to skip that part on later iterations
on the other CPUs (for CPU0 to then take the 2nd "else if()" path
from no later than the 3rd iteration onwards; the body of this of
course will only be executed on the last iteration).

The arrangement of all of this could be changed of course, but I'd
prefer to retain the property of there not being any dependency on
the exact number of iterations the loop header specifies, as long as
it's no less than 2 (before this series) / 3 (after this series).
I.e. I wouldn't want to identify e.g. the 1st iteration by "i == 4".

Jan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-02-08 13:59         ` Jan Beulich
@ 2021-02-08 16:33           ` Roger Pau Monné
  0 siblings, 0 replies; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-08 16:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 08, 2021 at 02:59:55PM +0100, Jan Beulich wrote:
> On 08.02.2021 14:19, Roger Pau Monné wrote:
> > On Mon, Feb 08, 2021 at 12:22:25PM +0100, Jan Beulich wrote:
> >> On 08.02.2021 10:38, Roger Pau Monné wrote:
> >>> On Mon, Feb 01, 2021 at 01:43:28PM +0100, Jan Beulich wrote:
> >>>> ---
> >>>> Since CPU0 reads its TSC last on the first iteration, if TSCs were
> >>>> perfectly sync-ed there shouldn't ever be a need to update. However,
> >>>> even on the TSC-reliable system I first tested this on (using
> >>>> "tsc=skewed" to get this rendezvous function into use in the first
> >>>> place) updates by up to several thousand clocks did happen. I wonder
> >>>> whether this points at some problem with the approach that I'm not (yet)
> >>>> seeing.
> >>>
> >>> I'm confused by this, so on a system that had reliable TSCs, which
> >>> you forced to remove the reliable flag, and then you saw big
> >>> differences when doing the rendezvous?
> >>>
> >>> That would seem to indicate that such system doesn't really have
> >>> reliable TSCs?
> >>
> >> I don't think so, no. This can easily be a timing effect from the
> >> heavy cache line bouncing involved here.
> >>
> >> What I'm worried here seeing these updates is that I might still
> >> be moving TSCs backwards in ways observable to the rest of the
> >> system (i.e. beyond the inherent property of the approach), and
> >> this then getting corrected by a subsequent rendezvous. But as
> >> said - I can't see what this could result from, and hence I'm
> >> inclined to assume these are merely effects I've not found a
> >> good explanation for so far.
> > 
> > I'm slightly worried by this, maybe because I'm misunderstanding part
> > of the TSC sync stuff.
> > 
> > So you forced a system that Xen would otherwise consider to have a
> > reliable TSC (one that doesn't need a calibration rendezvous) into
> > doing the calibration rendezvous, and then the skew seen is quite big.
> > I would expect such skew to be minimal? As we would otherwise consider
> > the system to not need calibration at all.
> > 
> > This makes me wonder if the system does indeed need such calibration
> > (which I don't think so), or if the calibration that we actually try
> > to do is quite bogus?
> 
> I wouldn't call it bogus, but it's not very precise. Hence me
> saying that if we wanted to make the problematic system seen as
> TSC_RELIABLE (and hence be able to switch from "tsc" to "std"
> rendezvous), we'd first need to improve accuracy here quite a bit.
> (I suspect sufficient accuracy can only be achieved by making use
> of TSC_ADJUST, but that's not available on the reporter's hardware,
> so of no immediate interest here.)

Right, TSC_ADJUST does indeed seem to be a better way to adjust TSC,
and to notice if firmware has skewed them.

> 
> >>>> @@ -1719,9 +1737,12 @@ static void time_calibration_tsc_rendezv
> >>>>              while ( atomic_read(&r->semaphore) > total_cpus )
> >>>>                  cpu_relax();
> >>>>          }
> >>>> +
> >>>> +        /* Just in case a read above ended up reading zero. */
> >>>> +        tsc += !tsc;
> >>>
> >>> Won't that be worthy of an ASSERT_UNREACHABLE? I'm not sure I see how
> >>> tsc could be 0 on a healthy system after the loop above.
> >>
> >> It's not forbidden for the firmware to set the TSCs to some
> >> huge negative value. Considering the effect TSC_ADJUST has on
> >> the actual value read by RDTSC, I think I did actually observe
> >> a system coming up this way, because of (not very helpful)
> >> TSC_ADJUST setting by firmware. So no, no ASSERT_UNREACHABLE()
> >> here.
> > 
> > But then the code here will loop 5 times, and it's not possible for
> > those 5 loops to read a TSC value of 0? I could see it reading 0 on
> > one of the iterations but not in all of them.
> 
> Sure, we can read zero at most once here. Yet the "if ( tsc == 0 )"
> conditionals get executed on every iteration, while they must yield
> "true" only on the first (from the variable's initializer); we
> absolutely need to go the "else if()" path on CPU0 on the 2nd
> iteration, and we also need to skip that part on later iterations
> on the other CPUs (for CPU0 to then take the 2nd "else if()" path
> from no later than the 3rd iteration onwards; the body of this of
> course will only be executed on the last iteration).

Oh, I see. Then I think I have no further comments. If you agree to
adjust the cmpxchg please add by R-b.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous()
  2021-02-08 11:50         ` Jan Beulich
@ 2021-02-08 16:39           ` Roger Pau Monné
  0 siblings, 0 replies; 16+ messages in thread
From: Roger Pau Monné @ 2021-02-08 16:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Claudemir Todo Bom

On Mon, Feb 08, 2021 at 12:50:09PM +0100, Jan Beulich wrote:
> On 08.02.2021 12:05, Roger Pau Monné wrote:
> > On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote:
> >> On 05.02.2021 17:15, Roger Pau Monné wrote:
> >>> I've been thinking this all seems doomed when Xen runs in a virtualized
> >>> environment, and should likely be disabled. There's no point on trying
> >>> to sync the TSC over multiple vCPUs as the scheduling delay between
> >>> them will likely skew any calculations.
> >>
> >> We may want to consider to force the equivalent of
> >> "clocksource=tsc" in that case. Otoh a well behaved hypervisor
> >> underneath shouldn't lead to us finding a need to clear
> >> TSC_RELIABLE, at which point this logic wouldn't get engaged
> >> in the first place.
> > 
> > I got the impression that on a loaded system guests with a non-trivial
> > amount of vCPUs might be in trouble to be able to schedule them all
> > close enough for the rendezvous to not report a big skew, and thus
> > disable TSC_RELIABLE?
> 
> No, check_tsc_warp() / tsc_check_reliability() don't have a
> problem there. Every CPU reads the shared "most advanced"
> stamp before reading its local one. So it doesn't matter how
> large the gaps are here. In fact the possible bad effect is
> the other way around here - if the scheduling effects are
> too heavy, we may mistakenly consider TSCs reliable when
> they aren't.
> 
> A problem of the kind you describe exists in the actual
> rendezvous function. And actually any problem of this kind
> can, on a smaller scale, already be be observed with SMT,
> because the individual hyperthreads of a core can't
> possibly all run at the same time.

Indeed I got confused between tsc_check_reliability and the actual
rendezvous function, so it's likely the adjustments done by the
rendezvous are pointless when running virtualized IMO, due to the
inability to likely schedule all the vCPUs at one to execute the
rendezvous.

> As occurs to me only now, I think we can improve accuracy
> some (in particular on big systems) by making sure
> struct calibration_rendezvous's master_tsc_stamp is not
> sharing its cache line with semaphore and master_stime. The
> latter get written by (at least) the BSP, while
> master_tsc_stamp is stable after the 2nd loop iteration.
> Hence on the 3rd and 4th iterations we could even prefetch
> it to reduce the delay on the last one.

Seems like a possibility indeed.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-02-08 16:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01 12:41 [PATCH v2 0/3] x86/time: calibration rendezvous adjustments Jan Beulich
2021-02-01 12:42 ` [PATCH v2 1/3] x86/time: change initiation of the calibration timer Jan Beulich
2021-02-05 16:00   ` Roger Pau Monné
2021-02-01 12:43 ` [PATCH v2 2/3] x86/time: adjust time recording time_calibration_tsc_rendezvous() Jan Beulich
2021-02-05 16:15   ` Roger Pau Monné
2021-02-08 10:56     ` Jan Beulich
2021-02-08 11:05       ` Roger Pau Monné
2021-02-08 11:50         ` Jan Beulich
2021-02-08 16:39           ` Roger Pau Monné
2021-02-01 12:43 ` [PATCH v2 3/3] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
2021-02-02  8:16   ` Jan Beulich
2021-02-08  9:38   ` Roger Pau Monné
2021-02-08 11:22     ` Jan Beulich
2021-02-08 13:19       ` Roger Pau Monné
2021-02-08 13:59         ` Jan Beulich
2021-02-08 16:33           ` Roger Pau Monné

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).