All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] x86/time: calibration rendezvous adjustments
@ 2021-01-29 16:18 Jan Beulich
  2021-01-29 16:19 ` [PATCH 1/2] x86/time: change initiation of the calibration timer Jan Beulich
  2021-01-29 16:20 ` [PATCH RFC 2/2] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
  0 siblings, 2 replies; 4+ messages in thread
From: Jan Beulich @ 2021-01-29 16:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

The 2nd patch (RFC for now) is meant to address a regression
reported on the list under "Problems with APIC on versions 4.9
and later (4.8 works)". In the course of analyzing output from
a debugging patch I ran into another anomaly again, which I
thought I should finally try to address. Hence patch 1.

1: change initiation of the calibration timer
2: don't move TSC backwards in time_calibration_tsc_rendezvous()

Jan


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] x86/time: change initiation of the calibration timer
  2021-01-29 16:18 [PATCH 0/2] x86/time: calibration rendezvous adjustments Jan Beulich
@ 2021-01-29 16:19 ` Jan Beulich
  2021-01-29 16:20 ` [PATCH RFC 2/2] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
  1 sibling, 0 replies; 4+ messages in thread
From: Jan Beulich @ 2021-01-29 16:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Setting the timer a second (EPOCH) into the future at a random point
during boot (prior to bringing up APs and prior to launching Dom0) does
not yield predictable results: The timer may expire while we're still
bringing up APs (too early) or when Dom0 already boots (too late).
Instead invoke the timer handler function explicitly at a predictable
point in time, once we've established the rendezvous function to use
(and hence also once all APs are online). This will, through the raising
and handling of TIMER_SOFTIRQ, then also have the effect of arming the
timer.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -854,9 +854,7 @@ static void resume_platform_timer(void)
 
 static void __init reset_platform_timer(void)
 {
-    /* Deactivate any timers running */
     kill_timer(&plt_overflow_timer);
-    kill_timer(&calibration_timer);
 
     /* Reset counters and stamps */
     spin_lock_irq(&platform_timer_lock);
@@ -1956,19 +1954,13 @@ static void __init reset_percpu_time(voi
     t->stamp.master_stime = t->stamp.local_stime;
 }
 
-static void __init try_platform_timer_tail(bool late)
+static void __init try_platform_timer_tail(void)
 {
     init_timer(&plt_overflow_timer, plt_overflow, NULL, 0);
     plt_overflow(NULL);
 
     platform_timer_stamp = plt_stamp64;
     stime_platform_stamp = NOW();
-
-    if ( !late )
-        init_percpu_time();
-
-    init_timer(&calibration_timer, time_calibration, NULL, 0);
-    set_timer(&calibration_timer, NOW() + EPOCH);
 }
 
 /* Late init function, after all cpus have booted */
@@ -2009,10 +2001,13 @@ static int __init verify_tsc_reliability
             time_calibration_rendezvous_fn = time_calibration_nop_rendezvous;
 
             /* Finish platform timer switch. */
-            try_platform_timer_tail(true);
+            try_platform_timer_tail();
 
             printk("Switched to Platform timer %s TSC\n",
                    freq_string(plt_src.frequency));
+
+            time_calibration(NULL);
+
             return 0;
         }
     }
@@ -2033,6 +2028,8 @@ static int __init verify_tsc_reliability
          !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
         time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
 
+    time_calibration(NULL);
+
     return 0;
 }
 __initcall(verify_tsc_reliability);
@@ -2048,7 +2045,11 @@ int __init init_xen_time(void)
     do_settime(get_wallclock_time(), 0, NOW());
 
     /* Finish platform timer initialization. */
-    try_platform_timer_tail(false);
+    try_platform_timer_tail();
+
+    init_percpu_time();
+
+    init_timer(&calibration_timer, time_calibration, NULL, 0);
 
     /*
      * Setup space to track per-socket TSC_ADJUST values. Don't fiddle with



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/2] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-01-29 16:18 [PATCH 0/2] x86/time: calibration rendezvous adjustments Jan Beulich
  2021-01-29 16:19 ` [PATCH 1/2] x86/time: change initiation of the calibration timer Jan Beulich
@ 2021-01-29 16:20 ` Jan Beulich
  2021-02-01  7:37   ` Jan Beulich
  1 sibling, 1 reply; 4+ messages in thread
From: Jan Beulich @ 2021-01-29 16:20 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

While doing this for small amounts may be okay, the unconditional use
of CPU0's value here has been found to be a problem when the boot time
TSC of the BSP was behind that of all APs by more than a second. In
particular because of get_s_time_fixed() producing insane output when
the calculated delta is negative, we can't allow this to happen.

On the first iteration have all other CPUs sort out the highest TSC
value any one of them has read. On the second iteration, if that
maximum is higher than CPU0's, update its recorded value from that
taken in the first iteration, along with the system time. Use the
resulting value on the last iteration to write everyone's TSCs.

Reported-by: Claudemir Todo Bom <claudemir@todobom.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Since CPU0 reads its TSC last on the first iteration, if TSCs were
perfectly sync-ed there shouldn't ever be a need to update. However,
even on the TSC-reliable system I first tested this on (using
"tsc=skewed" to get this rendezvous function into use in the first
place) updates by up to several thousand clocks did happen. I wonder
whether this points at some problem with the approach that I'm not (yet)
seeing.

Considering the sufficiently modern CPU it's using, I suspect the system
wouldn't even need to turn off TSC_RELIABLE, if only there wasn't the
boot time skew. Hence another approach might be to fix this boot time
skew. Of course to recognize whether the TSCs then still aren't in sync
we'd need to run tsc_check_reliability() sufficiently long after that
adjustment.

The above and the desire to have the change tested by the reporter are
the reasons for the RFC.

As per the comment ahead of it, the original purpose of the function was
to deal with TSCs halted in deep C states. While this probably explains
why only forward moves were ever expected, I don't see how this could
have been reliable in case CPU0 was deep-sleeping for a sufficiently
long time. My only guess here is a hidden assumption of CPU0 never being
idle for long enough.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1658,7 +1658,7 @@ struct calibration_rendezvous {
     cpumask_t cpu_calibration_map;
     atomic_t semaphore;
     s_time_t master_stime;
-    u64 master_tsc_stamp;
+    uint64_t master_tsc_stamp, max_tsc_stamp;
 };
 
 static void
@@ -1696,6 +1696,21 @@ static void time_calibration_tsc_rendezv
                 r->master_stime = read_platform_stime(NULL);
                 r->master_tsc_stamp = rdtsc_ordered();
             }
+            else if ( r->master_tsc_stamp < r->max_tsc_stamp )
+            {
+                /*
+                 * We want to avoid moving the TSC backwards for any CPU.
+                 * Use the largest value observed anywhere on the first
+                 * iteration and bump up our previously recorded system
+                 * accordingly.
+                 */
+                uint64_t delta = r->max_tsc_stamp - r->master_tsc_stamp;
+
+                r->master_stime += scale_delta(delta,
+                                               &this_cpu(cpu_time).tsc_scale);
+                r->master_tsc_stamp = r->max_tsc_stamp;
+            }
+
             atomic_inc(&r->semaphore);
 
             if ( i == 0 )
@@ -1711,6 +1726,17 @@ static void time_calibration_tsc_rendezv
             while ( atomic_read(&r->semaphore) < total_cpus )
                 cpu_relax();
 
+            if ( _r )
+            {
+                uint64_t tsc = rdtsc_ordered(), cur;
+
+                while ( tsc > (cur = r->max_tsc_stamp) )
+                    if ( cmpxchg(&r->max_tsc_stamp, cur, tsc) == cur )
+                        break;
+
+                _r = NULL;
+            }
+
             if ( i == 0 )
                 write_tsc(r->master_tsc_stamp);
 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 2/2] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous()
  2021-01-29 16:20 ` [PATCH RFC 2/2] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
@ 2021-02-01  7:37   ` Jan Beulich
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Beulich @ 2021-02-01  7:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Claudemir Todo Bom

On 29.01.2021 17:20, Jan Beulich wrote:
> @@ -1696,6 +1696,21 @@ static void time_calibration_tsc_rendezv
>                  r->master_stime = read_platform_stime(NULL);
>                  r->master_tsc_stamp = rdtsc_ordered();
>              }
> +            else if ( r->master_tsc_stamp < r->max_tsc_stamp )
> +            {
> +                /*
> +                 * We want to avoid moving the TSC backwards for any CPU.
> +                 * Use the largest value observed anywhere on the first
> +                 * iteration and bump up our previously recorded system
> +                 * accordingly.
> +                 */
> +                uint64_t delta = r->max_tsc_stamp - r->master_tsc_stamp;
> +
> +                r->master_stime += scale_delta(delta,
> +                                               &this_cpu(cpu_time).tsc_scale);
> +                r->master_tsc_stamp = r->max_tsc_stamp;
> +            }

I went too far here - adjusting ->master_stime like this is
a mistake. Especially in extreme cases like Claudemir's this
can lead to the read_platform_stime() visible in context
above reading a value behind the previously recorded one,
leading to NOW() moving backwards (temporarily).

Instead of this I think I will want to move the call to
read_platform_stime() to the last iteration, such that the
gap between the point in time when it was taken and the
point in time the TSCs start counting from their new values
gets minimized. In fact I intend that to also do away with
the unnecessary reading back of the TSC in
time_calibration_rendezvous_tail() - we already know the
closest TSC value we can get hold of (without calculations),
which is the one we wrote a few cycles back.

Jan


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-02-01  7:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-29 16:18 [PATCH 0/2] x86/time: calibration rendezvous adjustments Jan Beulich
2021-01-29 16:19 ` [PATCH 1/2] x86/time: change initiation of the calibration timer Jan Beulich
2021-01-29 16:20 ` [PATCH RFC 2/2] x86/time: don't move TSC backwards in time_calibration_tsc_rendezvous() Jan Beulich
2021-02-01  7:37   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.