All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
@ 2022-01-13 13:41 Jan Beulich
  2022-01-18  8:50 ` Jan Beulich
  2022-01-18 12:45 ` Roger Pau Monné
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Beulich @ 2022-01-13 13:41 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Calibration logic assumes that the platform timer (HPET or ACPI PM
timer) and the TSC are read at about the same time. This assumption may
not hold when a long latency event (e.g. SMI or NMI) occurs between the
two reads. Reduce the risk of reading uncorrelated values by doing at
least four pairs of reads, using the tuple where the delta between the
enclosing TSC reads was smallest. From the fourth iteration onwards bail
if the new TSC delta isn't better (smaller) than the best earlier one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
the calibration logic could be folded between HPET and PMTMR, at the
expense of a couple more indirect calls (which may not be that much of a
problem as this is all boot-time only). Really such folding would have
been possible already before, just that the amount of duplicate code
hasn't been this large so far. IOW if so desired, then perhaps the
folding should be a separate prereq patch.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
     return hpet_read32(HPET_COUNTER);
 }
 
+static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
+{
+    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
+    uint32_t best = best;
+    unsigned int i;
+
+    for ( i = 0; ; ++i )
+    {
+        uint32_t hpet = hpet_read32(HPET_COUNTER);
+        uint64_t tsc_cur = rdtsc_ordered();
+        uint64_t tsc_delta = tsc_cur - tsc_prev;
+
+        if ( tsc_delta < tsc_min )
+        {
+            tsc_min = tsc_delta;
+            *tsc = tsc_cur;
+            best = hpet;
+        }
+        else if ( i > 2 )
+            break;
+
+        tsc_prev = tsc_cur;
+    }
+
+    return best;
+}
+
 static int64_t __init init_hpet(struct platform_timesource *pts)
 {
-    uint64_t hpet_rate, start;
+    uint64_t hpet_rate, start, end;
     uint32_t count, target, elapsed;
     /*
      * Allow HPET to be setup, but report a frequency of 0 so it's not selected
@@ -466,13 +493,13 @@ static int64_t __init init_hpet(struct p
 
     pts->frequency = hpet_rate;
 
-    count = hpet_read32(HPET_COUNTER);
-    start = rdtsc_ordered();
+    count = read_hpet_and_tsc(&start);
     target = CALIBRATE_VALUE(hpet_rate);
     while ( (elapsed = hpet_read32(HPET_COUNTER) - count) < target )
         continue;
+    elapsed = read_hpet_and_tsc(&end) - count;
 
-    return adjust_elapsed(rdtsc_ordered() - start, elapsed, target);
+    return adjust_elapsed(end - start, elapsed, target);
 }
 
 static void resume_hpet(struct platform_timesource *pts)
@@ -505,9 +532,36 @@ static u64 read_pmtimer_count(void)
     return inl(pmtmr_ioport);
 }
 
+static uint32_t __init read_pmtmr_and_tsc(uint64_t *tsc)
+{
+    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
+    uint32_t best = best;
+    unsigned int i;
+
+    for ( i = 0; ; ++i )
+    {
+        uint32_t pmtmr = inl(pmtmr_ioport);
+        uint64_t tsc_cur = rdtsc_ordered();
+        uint64_t tsc_delta = tsc_cur - tsc_prev;
+
+        if ( tsc_delta < tsc_min )
+        {
+            tsc_min = tsc_delta;
+            *tsc = tsc_cur;
+            best = pmtmr;
+        }
+        else if ( i > 2 )
+            break;
+
+        tsc_prev = tsc_cur;
+    }
+
+    return best;
+}
+
 static s64 __init init_pmtimer(struct platform_timesource *pts)
 {
-    uint64_t start;
+    uint64_t start, end;
     uint32_t count, target, mask, elapsed;
 
     if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) )
@@ -516,13 +570,13 @@ static s64 __init init_pmtimer(struct pl
     pts->counter_bits = pmtmr_width;
     mask = 0xffffffff >> (32 - pmtmr_width);
 
-    count = inl(pmtmr_ioport);
-    start = rdtsc_ordered();
+    count = read_pmtmr_and_tsc(&start);
     target = CALIBRATE_VALUE(ACPI_PM_FREQUENCY);
-    while ( (elapsed = (inl(pmtmr_ioport) - count) & mask) < target )
+    while ( ((inl(pmtmr_ioport) - count) & mask) < target )
         continue;
+    elapsed = read_pmtmr_and_tsc(&end) - count;
 
-    return adjust_elapsed(rdtsc_ordered() - start, elapsed, target);
+    return adjust_elapsed(end - start, elapsed, target);
 }
 
 static struct platform_timesource __initdata plt_pmtimer =



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-13 13:41 [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy Jan Beulich
@ 2022-01-18  8:50 ` Jan Beulich
  2022-01-18 11:37   ` David Vrabel
  2022-01-18 12:45 ` Roger Pau Monné
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2022-01-18  8:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

On 13.01.2022 14:41, Jan Beulich wrote:
> Calibration logic assumes that the platform timer (HPET or ACPI PM
> timer) and the TSC are read at about the same time. This assumption may
> not hold when a long latency event (e.g. SMI or NMI) occurs between the
> two reads. Reduce the risk of reading uncorrelated values by doing at
> least four pairs of reads, using the tuple where the delta between the
> enclosing TSC reads was smallest. From the fourth iteration onwards bail
> if the new TSC delta isn't better (smaller) than the best earlier one.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

When running virtualized, scheduling in the host would also constitute
long latency events. I wonder whether, to compensate for that, we'd want
more than 3 "base" iterations, as I would expect scheduling events to
occur more frequently than e.g. SMI (and with a higher probability of
multiple ones occurring in close succession).

Jan



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-18  8:50 ` Jan Beulich
@ 2022-01-18 11:37   ` David Vrabel
  2022-01-18 13:30     ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: David Vrabel @ 2022-01-18 11:37 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné



On 18/01/2022 08:50, Jan Beulich wrote:
> On 13.01.2022 14:41, Jan Beulich wrote:
>> Calibration logic assumes that the platform timer (HPET or ACPI PM
>> timer) and the TSC are read at about the same time. This assumption may
>> not hold when a long latency event (e.g. SMI or NMI) occurs between the
>> two reads. Reduce the risk of reading uncorrelated values by doing at
>> least four pairs of reads, using the tuple where the delta between the
>> enclosing TSC reads was smallest. From the fourth iteration onwards bail
>> if the new TSC delta isn't better (smaller) than the best earlier one.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> When running virtualized, scheduling in the host would also constitute
> long latency events. I wonder whether, to compensate for that, we'd want
> more than 3 "base" iterations, as I would expect scheduling events to
> occur more frequently than e.g. SMI (and with a higher probability of
> multiple ones occurring in close succession).

Should Xen be continually or periodically recalibrating? Rather than 
trying to get it perfect at the start of day?

You may also be able to find inspiration from the design or 
implementation of the Precision Time Protocol which has to similarly 
filter out outliers due to transmission delays.

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-13 13:41 [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy Jan Beulich
  2022-01-18  8:50 ` Jan Beulich
@ 2022-01-18 12:45 ` Roger Pau Monné
  2022-01-18 13:39   ` Jan Beulich
  1 sibling, 1 reply; 10+ messages in thread
From: Roger Pau Monné @ 2022-01-18 12:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Jan 13, 2022 at 02:41:31PM +0100, Jan Beulich wrote:
> Calibration logic assumes that the platform timer (HPET or ACPI PM
> timer) and the TSC are read at about the same time. This assumption may
> not hold when a long latency event (e.g. SMI or NMI) occurs between the
> two reads. Reduce the risk of reading uncorrelated values by doing at
> least four pairs of reads, using the tuple where the delta between the
> enclosing TSC reads was smallest. From the fourth iteration onwards bail
> if the new TSC delta isn't better (smaller) than the best earlier one.

Do you have some measurements as to whether this makes the frequency
detection any more accurate?

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
> the calibration logic could be folded between HPET and PMTMR, at the

As an intermediate solution you could have a read_counter_and_tsc I
would think?

> expense of a couple more indirect calls (which may not be that much of a
> problem as this is all boot-time only). Really such folding would have
> been possible already before, just that the amount of duplicate code
> hasn't been this large so far. IOW if so desired, then perhaps the
> folding should be a separate prereq patch.

You could even pass a platform_timesource as a parameter to that
generic read counter and TSC helper.

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
>      return hpet_read32(HPET_COUNTER);
>  }
>  
> +static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
> +{
> +    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
> +    uint32_t best = best;
> +    unsigned int i;
> +
> +    for ( i = 0; ; ++i )
> +    {
> +        uint32_t hpet = hpet_read32(HPET_COUNTER);
> +        uint64_t tsc_cur = rdtsc_ordered();
> +        uint64_t tsc_delta = tsc_cur - tsc_prev;
> +
> +        if ( tsc_delta < tsc_min )
> +        {
> +            tsc_min = tsc_delta;
> +            *tsc = tsc_cur;
> +            best = hpet;
> +        }
> +        else if ( i > 2 )
> +            break;
> +
> +        tsc_prev = tsc_cur;

Would it be better to set tsc_prev to the current TSC value here in
order to minimize the window you are measuring? Or else tsc_delta will
also account for the time in the loop code, and there's more
likeliness from being interrupted?

I guess being interrupted four times so that all loops are bad is very
unlikely.

Since this loop could in theory run multiple times, do we need to
check that the counter doesn't overflow? Or else the elapsed counter
value would be miscalculated.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-18 11:37   ` David Vrabel
@ 2022-01-18 13:30     ` Jan Beulich
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Beulich @ 2022-01-18 13:30 UTC (permalink / raw)
  To: David Vrabel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, xen-devel

On 18.01.2022 12:37, David Vrabel wrote:
> 
> 
> On 18/01/2022 08:50, Jan Beulich wrote:
>> On 13.01.2022 14:41, Jan Beulich wrote:
>>> Calibration logic assumes that the platform timer (HPET or ACPI PM
>>> timer) and the TSC are read at about the same time. This assumption may
>>> not hold when a long latency event (e.g. SMI or NMI) occurs between the
>>> two reads. Reduce the risk of reading uncorrelated values by doing at
>>> least four pairs of reads, using the tuple where the delta between the
>>> enclosing TSC reads was smallest. From the fourth iteration onwards bail
>>> if the new TSC delta isn't better (smaller) than the best earlier one.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> When running virtualized, scheduling in the host would also constitute
>> long latency events. I wonder whether, to compensate for that, we'd want
>> more than 3 "base" iterations, as I would expect scheduling events to
>> occur more frequently than e.g. SMI (and with a higher probability of
>> multiple ones occurring in close succession).
> 
> Should Xen be continually or periodically recalibrating? Rather than 
> trying to get it perfect at the start of day?

I wouldn't call dealing with bad samples "getting it perfect". IOW I
think recalibrating later may be an option, but independent of what
I'm doing here.

> You may also be able to find inspiration from the design or 
> implementation of the Precision Time Protocol which has to similarly 
> filter out outliers due to transmission delays.

Thanks for the pointer.

Jan



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-18 12:45 ` Roger Pau Monné
@ 2022-01-18 13:39   ` Jan Beulich
  2022-01-18 14:05     ` Roger Pau Monné
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2022-01-18 13:39 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 18.01.2022 13:45, Roger Pau Monné wrote:
> On Thu, Jan 13, 2022 at 02:41:31PM +0100, Jan Beulich wrote:
>> Calibration logic assumes that the platform timer (HPET or ACPI PM
>> timer) and the TSC are read at about the same time. This assumption may
>> not hold when a long latency event (e.g. SMI or NMI) occurs between the
>> two reads. Reduce the risk of reading uncorrelated values by doing at
>> least four pairs of reads, using the tuple where the delta between the
>> enclosing TSC reads was smallest. From the fourth iteration onwards bail
>> if the new TSC delta isn't better (smaller) than the best earlier one.
> 
> Do you have some measurements as to whether this makes the frequency
> detection any more accurate?

In the normal case (no SMI or alike) I haven't observed any further
improvement on top of the earlier patch.

>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
>> the calibration logic could be folded between HPET and PMTMR, at the
> 
> As an intermediate solution you could have a read_counter_and_tsc I
> would think?

Not sure in which way you view this as "intermediate".

>> expense of a couple more indirect calls (which may not be that much of a
>> problem as this is all boot-time only). Really such folding would have
>> been possible already before, just that the amount of duplicate code
>> hasn't been this large so far. IOW if so desired, then perhaps the
>> folding should be a separate prereq patch.
> 
> You could even pass a platform_timesource as a parameter to that
> generic read counter and TSC helper.

This was the plan, yes, in case I was asked to follow that route (or
if I decided myself that I'd prefer that significantly over the
redundancy).

>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
>>      return hpet_read32(HPET_COUNTER);
>>  }
>>  
>> +static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
>> +{
>> +    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
>> +    uint32_t best = best;
>> +    unsigned int i;
>> +
>> +    for ( i = 0; ; ++i )
>> +    {
>> +        uint32_t hpet = hpet_read32(HPET_COUNTER);
>> +        uint64_t tsc_cur = rdtsc_ordered();
>> +        uint64_t tsc_delta = tsc_cur - tsc_prev;
>> +
>> +        if ( tsc_delta < tsc_min )
>> +        {
>> +            tsc_min = tsc_delta;
>> +            *tsc = tsc_cur;
>> +            best = hpet;
>> +        }
>> +        else if ( i > 2 )
>> +            break;
>> +
>> +        tsc_prev = tsc_cur;
> 
> Would it be better to set tsc_prev to the current TSC value here in
> order to minimize the window you are measuring? Or else tsc_delta will
> also account for the time in the loop code, and there's more
> likeliness from being interrupted?

I did consider this (or some variant thereof), but did for the moment
conclude that it wouldn't make enough of a difference. If you think
it would be meaningfully better that way, I'll switch.

Both here and for the aspect above, please express you preference (i.e.
not in the form of a question).

> I guess being interrupted four times so that all loops are bad is very
> unlikely.
> 
> Since this loop could in theory run multiple times, do we need to
> check that the counter doesn't overflow? Or else the elapsed counter
> value would be miscalculated.

I don't think I see any issue with overflow here. It's the callers
who need to care about that. And I don't think this function will run
for long enough such that a counter would not only wrap, but also
reach its initial count again.

Jan



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-18 13:39   ` Jan Beulich
@ 2022-01-18 14:05     ` Roger Pau Monné
  2022-01-18 14:26       ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Roger Pau Monné @ 2022-01-18 14:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Jan 18, 2022 at 02:39:03PM +0100, Jan Beulich wrote:
> On 18.01.2022 13:45, Roger Pau Monné wrote:
> > On Thu, Jan 13, 2022 at 02:41:31PM +0100, Jan Beulich wrote:
> >> Calibration logic assumes that the platform timer (HPET or ACPI PM
> >> timer) and the TSC are read at about the same time. This assumption may
> >> not hold when a long latency event (e.g. SMI or NMI) occurs between the
> >> two reads. Reduce the risk of reading uncorrelated values by doing at
> >> least four pairs of reads, using the tuple where the delta between the
> >> enclosing TSC reads was smallest. From the fourth iteration onwards bail
> >> if the new TSC delta isn't better (smaller) than the best earlier one.
> > 
> > Do you have some measurements as to whether this makes the frequency
> > detection any more accurate?
> 
> In the normal case (no SMI or alike) I haven't observed any further
> improvement on top of the earlier patch.
> 
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
> >> the calibration logic could be folded between HPET and PMTMR, at the
> > 
> > As an intermediate solution you could have a read_counter_and_tsc I
> > would think?
> 
> Not sure in which way you view this as "intermediate".

As in not unifying both calibration logic into a single function, but
rather just use a generic function to read the counter and TSC.

With your original remark I assumed that you wanted to unify all the
calibration code in init_hpet and init_pmtimer into a generic
function, but maybe I've misunderstood.

> >> expense of a couple more indirect calls (which may not be that much of a
> >> problem as this is all boot-time only). Really such folding would have
> >> been possible already before, just that the amount of duplicate code
> >> hasn't been this large so far. IOW if so desired, then perhaps the
> >> folding should be a separate prereq patch.
> > 
> > You could even pass a platform_timesource as a parameter to that
> > generic read counter and TSC helper.
> 
> This was the plan, yes, in case I was asked to follow that route (or
> if I decided myself that I'd prefer that significantly over the
> redundancy).

I think having a generic read_counter_and_tsc would be better than
having read_{hpet,pmtmr}_and_tsc.

> 
> >> --- a/xen/arch/x86/time.c
> >> +++ b/xen/arch/x86/time.c
> >> @@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
> >>      return hpet_read32(HPET_COUNTER);
> >>  }
> >>  
> >> +static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
> >> +{
> >> +    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
> >> +    uint32_t best = best;
> >> +    unsigned int i;
> >> +
> >> +    for ( i = 0; ; ++i )
> >> +    {
> >> +        uint32_t hpet = hpet_read32(HPET_COUNTER);
> >> +        uint64_t tsc_cur = rdtsc_ordered();
> >> +        uint64_t tsc_delta = tsc_cur - tsc_prev;
> >> +
> >> +        if ( tsc_delta < tsc_min )
> >> +        {
> >> +            tsc_min = tsc_delta;
> >> +            *tsc = tsc_cur;
> >> +            best = hpet;
> >> +        }
> >> +        else if ( i > 2 )
> >> +            break;
> >> +
> >> +        tsc_prev = tsc_cur;
> > 
> > Would it be better to set tsc_prev to the current TSC value here in
> > order to minimize the window you are measuring? Or else tsc_delta will
> > also account for the time in the loop code, and there's more
> > likeliness from being interrupted?
> 
> I did consider this (or some variant thereof), but did for the moment
> conclude that it wouldn't make enough of a difference. If you think
> it would be meaningfully better that way, I'll switch.
> 
> Both here and for the aspect above, please express you preference (i.e.
> not in the form of a question).

Let's leave it as-is (just one TSC read per loop iteration).

> > I guess being interrupted four times so that all loops are bad is very
> > unlikely.
> > 
> > Since this loop could in theory run multiple times, do we need to
> > check that the counter doesn't overflow? Or else the elapsed counter
> > value would be miscalculated.
> 
> I don't think I see any issue with overflow here. It's the callers
> who need to care about that. And I don't think this function will run
> for long enough such that a counter would not only wrap, but also
> reach its initial count again.

Right, I haven't expressed myself correctly. It's not overflowing the
issue, but rather wrapping to the start count value. Limiting the
iterations to some fixed value (10?) in order to remove that
possibility completely would be OK for me.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-18 14:05     ` Roger Pau Monné
@ 2022-01-18 14:26       ` Jan Beulich
  2022-01-19  8:46         ` Roger Pau Monné
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2022-01-18 14:26 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 18.01.2022 15:05, Roger Pau Monné wrote:
> On Tue, Jan 18, 2022 at 02:39:03PM +0100, Jan Beulich wrote:
>> On 18.01.2022 13:45, Roger Pau Monné wrote:
>>> On Thu, Jan 13, 2022 at 02:41:31PM +0100, Jan Beulich wrote:
>>>> Calibration logic assumes that the platform timer (HPET or ACPI PM
>>>> timer) and the TSC are read at about the same time. This assumption may
>>>> not hold when a long latency event (e.g. SMI or NMI) occurs between the
>>>> two reads. Reduce the risk of reading uncorrelated values by doing at
>>>> least four pairs of reads, using the tuple where the delta between the
>>>> enclosing TSC reads was smallest. From the fourth iteration onwards bail
>>>> if the new TSC delta isn't better (smaller) than the best earlier one.
>>>
>>> Do you have some measurements as to whether this makes the frequency
>>> detection any more accurate?
>>
>> In the normal case (no SMI or alike) I haven't observed any further
>> improvement on top of the earlier patch.
>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
>>>> the calibration logic could be folded between HPET and PMTMR, at the
>>>
>>> As an intermediate solution you could have a read_counter_and_tsc I
>>> would think?
>>
>> Not sure in which way you view this as "intermediate".
> 
> As in not unifying both calibration logic into a single function, but
> rather just use a generic function to read the counter and TSC.
> 
> With your original remark I assumed that you wanted to unify all the
> calibration code in init_hpet and init_pmtimer into a generic
> function, but maybe I've misunderstood.

Oh, I see. I have to admit that I see little value (at this point) to
break out just the pair-read logic. While I did say we have similar
issues elsewhere, my initial analysis has left me with the impression
that those other cases are sufficiently different for such a helper to
be of no use there.

>>>> expense of a couple more indirect calls (which may not be that much of a
>>>> problem as this is all boot-time only). Really such folding would have
>>>> been possible already before, just that the amount of duplicate code
>>>> hasn't been this large so far. IOW if so desired, then perhaps the
>>>> folding should be a separate prereq patch.
>>>
>>> You could even pass a platform_timesource as a parameter to that
>>> generic read counter and TSC helper.
>>
>> This was the plan, yes, in case I was asked to follow that route (or
>> if I decided myself that I'd prefer that significantly over the
>> redundancy).
> 
> I think having a generic read_counter_and_tsc would be better than
> having read_{hpet,pmtmr}_and_tsc.
> 
>>
>>>> --- a/xen/arch/x86/time.c
>>>> +++ b/xen/arch/x86/time.c
>>>> @@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
>>>>      return hpet_read32(HPET_COUNTER);
>>>>  }
>>>>  
>>>> +static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
>>>> +{
>>>> +    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
>>>> +    uint32_t best = best;
>>>> +    unsigned int i;
>>>> +
>>>> +    for ( i = 0; ; ++i )
>>>> +    {
>>>> +        uint32_t hpet = hpet_read32(HPET_COUNTER);
>>>> +        uint64_t tsc_cur = rdtsc_ordered();
>>>> +        uint64_t tsc_delta = tsc_cur - tsc_prev;
>>>> +
>>>> +        if ( tsc_delta < tsc_min )
>>>> +        {
>>>> +            tsc_min = tsc_delta;
>>>> +            *tsc = tsc_cur;
>>>> +            best = hpet;
>>>> +        }
>>>> +        else if ( i > 2 )
>>>> +            break;
>>>> +
>>>> +        tsc_prev = tsc_cur;
>>>
>>> Would it be better to set tsc_prev to the current TSC value here in
>>> order to minimize the window you are measuring? Or else tsc_delta will
>>> also account for the time in the loop code, and there's more
>>> likeliness from being interrupted?
>>
>> I did consider this (or some variant thereof), but did for the moment
>> conclude that it wouldn't make enough of a difference. If you think
>> it would be meaningfully better that way, I'll switch.
>>
>> Both here and for the aspect above, please express you preference (i.e.
>> not in the form of a question).
> 
> Let's leave it as-is (just one TSC read per loop iteration).
> 
>>> I guess being interrupted four times so that all loops are bad is very
>>> unlikely.
>>>
>>> Since this loop could in theory run multiple times, do we need to
>>> check that the counter doesn't overflow? Or else the elapsed counter
>>> value would be miscalculated.
>>
>> I don't think I see any issue with overflow here. It's the callers
>> who need to care about that. And I don't think this function will run
>> for long enough such that a counter would not only wrap, but also
>> reach its initial count again.
> 
> Right, I haven't expressed myself correctly. It's not overflowing the
> issue, but rather wrapping to the start count value. Limiting the
> iterations to some fixed value (10?) in order to remove that
> possibility completely would be OK for me.

Hmm, I was in fact hoping (and trying) to get away without any arbitrarily
chosen numbers. The loop cannot be infinite in any event ... Please let me
know how strong you feel about putting in place such an arbitrary limit.

Jan



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-18 14:26       ` Jan Beulich
@ 2022-01-19  8:46         ` Roger Pau Monné
  2022-01-19  8:50           ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Roger Pau Monné @ 2022-01-19  8:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Jan 18, 2022 at 03:26:36PM +0100, Jan Beulich wrote:
> On 18.01.2022 15:05, Roger Pau Monné wrote:
> > On Tue, Jan 18, 2022 at 02:39:03PM +0100, Jan Beulich wrote:
> >> On 18.01.2022 13:45, Roger Pau Monné wrote:
> >>> On Thu, Jan 13, 2022 at 02:41:31PM +0100, Jan Beulich wrote:
> >>>> Calibration logic assumes that the platform timer (HPET or ACPI PM
> >>>> timer) and the TSC are read at about the same time. This assumption may
> >>>> not hold when a long latency event (e.g. SMI or NMI) occurs between the
> >>>> two reads. Reduce the risk of reading uncorrelated values by doing at
> >>>> least four pairs of reads, using the tuple where the delta between the
> >>>> enclosing TSC reads was smallest. From the fourth iteration onwards bail
> >>>> if the new TSC delta isn't better (smaller) than the best earlier one.
> >>>
> >>> Do you have some measurements as to whether this makes the frequency
> >>> detection any more accurate?
> >>
> >> In the normal case (no SMI or alike) I haven't observed any further
> >> improvement on top of the earlier patch.
> >>
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>> ---
> >>>> Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
> >>>> the calibration logic could be folded between HPET and PMTMR, at the
> >>>
> >>> As an intermediate solution you could have a read_counter_and_tsc I
> >>> would think?
> >>
> >> Not sure in which way you view this as "intermediate".
> > 
> > As in not unifying both calibration logic into a single function, but
> > rather just use a generic function to read the counter and TSC.
> > 
> > With your original remark I assumed that you wanted to unify all the
> > calibration code in init_hpet and init_pmtimer into a generic
> > function, but maybe I've misunderstood.
> 
> Oh, I see. I have to admit that I see little value (at this point) to
> break out just the pair-read logic. While I did say we have similar
> issues elsewhere, my initial analysis has left me with the impression
> that those other cases are sufficiently different for such a helper to
> be of no use there.
> 
> >>>> expense of a couple more indirect calls (which may not be that much of a
> >>>> problem as this is all boot-time only). Really such folding would have
> >>>> been possible already before, just that the amount of duplicate code
> >>>> hasn't been this large so far. IOW if so desired, then perhaps the
> >>>> folding should be a separate prereq patch.
> >>>
> >>> You could even pass a platform_timesource as a parameter to that
> >>> generic read counter and TSC helper.
> >>
> >> This was the plan, yes, in case I was asked to follow that route (or
> >> if I decided myself that I'd prefer that significantly over the
> >> redundancy).
> > 
> > I think having a generic read_counter_and_tsc would be better than
> > having read_{hpet,pmtmr}_and_tsc.
> > 
> >>
> >>>> --- a/xen/arch/x86/time.c
> >>>> +++ b/xen/arch/x86/time.c
> >>>> @@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
> >>>>      return hpet_read32(HPET_COUNTER);
> >>>>  }
> >>>>  
> >>>> +static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
> >>>> +{
> >>>> +    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
> >>>> +    uint32_t best = best;
> >>>> +    unsigned int i;
> >>>> +
> >>>> +    for ( i = 0; ; ++i )
> >>>> +    {
> >>>> +        uint32_t hpet = hpet_read32(HPET_COUNTER);
> >>>> +        uint64_t tsc_cur = rdtsc_ordered();
> >>>> +        uint64_t tsc_delta = tsc_cur - tsc_prev;
> >>>> +
> >>>> +        if ( tsc_delta < tsc_min )
> >>>> +        {
> >>>> +            tsc_min = tsc_delta;
> >>>> +            *tsc = tsc_cur;
> >>>> +            best = hpet;
> >>>> +        }
> >>>> +        else if ( i > 2 )
> >>>> +            break;
> >>>> +
> >>>> +        tsc_prev = tsc_cur;
> >>>
> >>> Would it be better to set tsc_prev to the current TSC value here in
> >>> order to minimize the window you are measuring? Or else tsc_delta will
> >>> also account for the time in the loop code, and there's more
> >>> likeliness from being interrupted?
> >>
> >> I did consider this (or some variant thereof), but did for the moment
> >> conclude that it wouldn't make enough of a difference. If you think
> >> it would be meaningfully better that way, I'll switch.
> >>
> >> Both here and for the aspect above, please express you preference (i.e.
> >> not in the form of a question).
> > 
> > Let's leave it as-is (just one TSC read per loop iteration).
> > 
> >>> I guess being interrupted four times so that all loops are bad is very
> >>> unlikely.
> >>>
> >>> Since this loop could in theory run multiple times, do we need to
> >>> check that the counter doesn't overflow? Or else the elapsed counter
> >>> value would be miscalculated.
> >>
> >> I don't think I see any issue with overflow here. It's the callers
> >> who need to care about that. And I don't think this function will run
> >> for long enough such that a counter would not only wrap, but also
> >> reach its initial count again.
> > 
> > Right, I haven't expressed myself correctly. It's not overflowing the
> > issue, but rather wrapping to the start count value. Limiting the
> > iterations to some fixed value (10?) in order to remove that
> > possibility completely would be OK for me.
> 
> Hmm, I was in fact hoping (and trying) to get away without any arbitrarily
> chosen numbers. The loop cannot be infinite in any event ... Please let me
> know how strong you feel about putting in place such an arbitrary limit.

It was mostly for safety reasons, I wouldn't expect that loop to need
more than 4 iterations really (which is also an arbitrary chosen
number). Let's leave it without any limit then for the time being.

Roger.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy
  2022-01-19  8:46         ` Roger Pau Monné
@ 2022-01-19  8:50           ` Jan Beulich
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Beulich @ 2022-01-19  8:50 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 19.01.2022 09:46, Roger Pau Monné wrote:
> On Tue, Jan 18, 2022 at 03:26:36PM +0100, Jan Beulich wrote:
>> On 18.01.2022 15:05, Roger Pau Monné wrote:
>>> On Tue, Jan 18, 2022 at 02:39:03PM +0100, Jan Beulich wrote:
>>>> On 18.01.2022 13:45, Roger Pau Monné wrote:
>>>>> On Thu, Jan 13, 2022 at 02:41:31PM +0100, Jan Beulich wrote:
>>>>>> Calibration logic assumes that the platform timer (HPET or ACPI PM
>>>>>> timer) and the TSC are read at about the same time. This assumption may
>>>>>> not hold when a long latency event (e.g. SMI or NMI) occurs between the
>>>>>> two reads. Reduce the risk of reading uncorrelated values by doing at
>>>>>> least four pairs of reads, using the tuple where the delta between the
>>>>>> enclosing TSC reads was smallest. From the fourth iteration onwards bail
>>>>>> if the new TSC delta isn't better (smaller) than the best earlier one.
>>>>>
>>>>> Do you have some measurements as to whether this makes the frequency
>>>>> detection any more accurate?
>>>>
>>>> In the normal case (no SMI or alike) I haven't observed any further
>>>> improvement on top of the earlier patch.
>>>>
>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>> ---
>>>>>> Obviously (I think) instead of having both read_{hpet,pmtmr}_and_tsc()
>>>>>> the calibration logic could be folded between HPET and PMTMR, at the
>>>>>
>>>>> As an intermediate solution you could have a read_counter_and_tsc I
>>>>> would think?
>>>>
>>>> Not sure in which way you view this as "intermediate".
>>>
>>> As in not unifying both calibration logic into a single function, but
>>> rather just use a generic function to read the counter and TSC.
>>>
>>> With your original remark I assumed that you wanted to unify all the
>>> calibration code in init_hpet and init_pmtimer into a generic
>>> function, but maybe I've misunderstood.
>>
>> Oh, I see. I have to admit that I see little value (at this point) to
>> break out just the pair-read logic. While I did say we have similar
>> issues elsewhere, my initial analysis has left me with the impression
>> that those other cases are sufficiently different for such a helper to
>> be of no use there.
>>
>>>>>> expense of a couple more indirect calls (which may not be that much of a
>>>>>> problem as this is all boot-time only). Really such folding would have
>>>>>> been possible already before, just that the amount of duplicate code
>>>>>> hasn't been this large so far. IOW if so desired, then perhaps the
>>>>>> folding should be a separate prereq patch.
>>>>>
>>>>> You could even pass a platform_timesource as a parameter to that
>>>>> generic read counter and TSC helper.
>>>>
>>>> This was the plan, yes, in case I was asked to follow that route (or
>>>> if I decided myself that I'd prefer that significantly over the
>>>> redundancy).
>>>
>>> I think having a generic read_counter_and_tsc would be better than
>>> having read_{hpet,pmtmr}_and_tsc.

Btw, I think I will want to break out the full calibration logic, to then
further generalize it to make suitable for LAPIC timer calibration (then
no longer against the PIT, but against the earlier chosen platform timer).

>>>>>> --- a/xen/arch/x86/time.c
>>>>>> +++ b/xen/arch/x86/time.c
>>>>>> @@ -392,9 +392,36 @@ static u64 read_hpet_count(void)
>>>>>>      return hpet_read32(HPET_COUNTER);
>>>>>>  }
>>>>>>  
>>>>>> +static uint32_t __init read_hpet_and_tsc(uint64_t *tsc)
>>>>>> +{
>>>>>> +    uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0;
>>>>>> +    uint32_t best = best;
>>>>>> +    unsigned int i;
>>>>>> +
>>>>>> +    for ( i = 0; ; ++i )
>>>>>> +    {
>>>>>> +        uint32_t hpet = hpet_read32(HPET_COUNTER);
>>>>>> +        uint64_t tsc_cur = rdtsc_ordered();
>>>>>> +        uint64_t tsc_delta = tsc_cur - tsc_prev;
>>>>>> +
>>>>>> +        if ( tsc_delta < tsc_min )
>>>>>> +        {
>>>>>> +            tsc_min = tsc_delta;
>>>>>> +            *tsc = tsc_cur;
>>>>>> +            best = hpet;
>>>>>> +        }
>>>>>> +        else if ( i > 2 )
>>>>>> +            break;
>>>>>> +
>>>>>> +        tsc_prev = tsc_cur;
>>>>>
>>>>> Would it be better to set tsc_prev to the current TSC value here in
>>>>> order to minimize the window you are measuring? Or else tsc_delta will
>>>>> also account for the time in the loop code, and there's more
>>>>> likeliness from being interrupted?
>>>>
>>>> I did consider this (or some variant thereof), but did for the moment
>>>> conclude that it wouldn't make enough of a difference. If you think
>>>> it would be meaningfully better that way, I'll switch.
>>>>
>>>> Both here and for the aspect above, please express you preference (i.e.
>>>> not in the form of a question).
>>>
>>> Let's leave it as-is (just one TSC read per loop iteration).
>>>
>>>>> I guess being interrupted four times so that all loops are bad is very
>>>>> unlikely.
>>>>>
>>>>> Since this loop could in theory run multiple times, do we need to
>>>>> check that the counter doesn't overflow? Or else the elapsed counter
>>>>> value would be miscalculated.
>>>>
>>>> I don't think I see any issue with overflow here. It's the callers
>>>> who need to care about that. And I don't think this function will run
>>>> for long enough such that a counter would not only wrap, but also
>>>> reach its initial count again.
>>>
>>> Right, I haven't expressed myself correctly. It's not overflowing the
>>> issue, but rather wrapping to the start count value. Limiting the
>>> iterations to some fixed value (10?) in order to remove that
>>> possibility completely would be OK for me.
>>
>> Hmm, I was in fact hoping (and trying) to get away without any arbitrarily
>> chosen numbers. The loop cannot be infinite in any event ... Please let me
>> know how strong you feel about putting in place such an arbitrary limit.
> 
> It was mostly for safety reasons, I wouldn't expect that loop to need
> more than 4 iterations really (which is also an arbitrary chosen
> number).

Fair point - I should have said "without further arbitrarily chosen
numbers".

> Let's leave it without any limit then for the time being.

Thanks.

Jan



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-01-19  8:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-13 13:41 [PATCH] x86/time: further improve TSC / CPU freq calibration accuracy Jan Beulich
2022-01-18  8:50 ` Jan Beulich
2022-01-18 11:37   ` David Vrabel
2022-01-18 13:30     ` Jan Beulich
2022-01-18 12:45 ` Roger Pau Monné
2022-01-18 13:39   ` Jan Beulich
2022-01-18 14:05     ` Roger Pau Monné
2022-01-18 14:26       ` Jan Beulich
2022-01-19  8:46         ` Roger Pau Monné
2022-01-19  8:50           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.