x86/tsc: Add tsc_tuned_baseclk flag disabling CPUID.16h use for tsc calibration
diff mbox series

Message ID 9rN6HvBfpUYE7XjHYSTKXKkKOUHQd_skSYGqjXlI0jTIk4nqLoLUloev1jgSayOdvzmkXgRNP8j_mgcikMJy6L_JN_vJhUJn9vD9xm_ueSo=@protonmail.com
State New
Headers show
Series
  • x86/tsc: Add tsc_tuned_baseclk flag disabling CPUID.16h use for tsc calibration
Related show

Commit Message

Krzysztof Piecuch Jan. 17, 2020, 3:13 p.m. UTC
Changing base clock frequency directly impacts tsc hz but not CPUID.16h
values. An overclocked CPU supporting CPUID.16h and partial CPUID.15h
support will set tsc hz according to "best guess" given by CPUID.16h
relying on tsc_refine_calibration_work to give better numbers later.
tsc_refine_calibration_work will refuse to do its work when the outcome is
off the early tsc hz value by more than 1% which is certain to happen on an
overclocked system.

Fix this by adding tsc_tuned_baseclk command line parameter that makes
the kernel ignore CPUID.16h data during TSC calibration.

Signed-off-by: Krzysztof Piecuch <piecuch@protonmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 11 +++++++++++
 arch/x86/kernel/tsc.c                           | 16 ++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

--
2.20.1

Comments

Andy Lutomirski Jan. 17, 2020, 4:37 p.m. UTC | #1
> On Jan 17, 2020, at 7:21 AM, Krzysztof Piecuch <piecuch@protonmail.com> wrote:
> 
> Changing base clock frequency directly impacts tsc hz but not CPUID.16h
> values. An overclocked CPU supporting CPUID.16h and partial CPUID.15h
> support will set tsc hz according to "best guess" given by CPUID.16h
> relying on tsc_refine_calibration_work to give better numbers later.
> tsc_refine_calibration_work will refuse to do its work when the outcome is
> off the early tsc hz value by more than 1% which is certain to happen on an
> overclocked system.
> 

Wouldn’t it be better to have an option tsc_max_refinement= to increase the 1%?
Krzysztof Piecuch Jan. 20, 2020, 11:15 a.m. UTC | #2
On Friday, January 17, 2020 4:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Wouldn’t it be better to have an option tsc_max_refinement= to increase the 1%?

All that is in the commends about it say that:

 * If there are any calibration anomalies (too many SMIs, etc),
 * or the refined calibration is off by 1% of the fast early
 * calibration, we throw out the new calibration and use the
 * early calibration.

I still don't fully understand why the "1% rule" exists.

Ideally it would be better to get the early calibration right than risk getting
it wrong because of an "anomaly".
OTOH if you system doesn't support any of the early calibration methods other
than CPUID.16h (mine doesn't support either PIT or MSR) "tsc_max_refinement"
would allow you to control max tsc_hz error.

If you think that would be better please let me know.
Thomas Gleixner Jan. 20, 2020, 1:42 p.m. UTC | #3
Krzysztof,

Krzysztof Piecuch <piecuch@protonmail.com> writes:
> On Friday, January 17, 2020 4:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Wouldn’t it be better to have an option tsc_max_refinement= to increase the 1%?
>
> All that is in the commends about it say that:
>
>  * If there are any calibration anomalies (too many SMIs, etc),
>  * or the refined calibration is off by 1% of the fast early
>  * calibration, we throw out the new calibration and use the
>  * early calibration.
>
> I still don't fully understand why the "1% rule" exists.

Simply because all of this is horribly fragile and if you put virt into
the picture it gets even worse.

The initial calibration via PIT/HPET is halfways accurate in most cases
and we use the 1% as a sanity check.

> Ideally it would be better to get the early calibration right than
> risk getting it wrong because of an "anomaly".

Ideally we would just have a way to read the stupid frequency from some
reliable place, but there is no such thing.

Guess why we have all this code, surely not because we have nothing
better to do than dreaming up a variety of weird ways to figure out that
frequency.

> OTOH if you system doesn't support any of the early calibration
> methods other than CPUID.16h (mine doesn't support either PIT or MSR)
> "tsc_max_refinement" would allow you to control max tsc_hz error.

Widening the error window here is clearly a hack. As you have to supply
a valid number there, then why not just providing the frequency itself
on the command line? That would at least make most sense and would avoid
to use completely wrong data in the early boot stage.

Thanks,

        tglx
Krzysztof Piecuch Jan. 20, 2020, 2:20 p.m. UTC | #4
On Monday, January 20, 2020 1:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Simply because all of this is horribly fragile and if you put virt into
> the picture it gets even worse.
>
> The initial calibration via PIT/HPET is halfways accurate in most cases
> and we use the 1% as a sanity check.
>
> > Ideally it would be better to get the early calibration right than
> > risk getting it wrong because of an "anomaly".
>
> Ideally we would just have a way to read the stupid frequency from some
> reliable place, but there is no such thing.
>
> Guess why we have all this code, surely not because we have nothing
> better to do than dreaming up a variety of weird ways to figure out that
> frequency.

Thank you for the explanation.

> Widening the error window here is clearly a hack. As you have to supply
> a valid number there, then why not just providing the frequency itself
> on the command line? That would at least make most sense and would avoid
> to use completely wrong data in the early boot stage.

That sounds good.
I'll assume that the user will be supposed to provide a flag tsc_early_hz=
so that refine_calibration_work can get better numbers while still doing
the 1% sanity check.

I'll send a patch this week.

Patch
diff mbox series

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ade4e6ec23e0..b251169692a8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4905,6 +4905,17 @@ 
 			interruptions from clocksource watchdog are not
 			acceptable).

+	tsc_tuned_baseclk=
+			[X86,INTEL] Ignore data provided by CPUID.16h during
+			early tsc calibration. Useful when changing base clock
+			frequency (overclocking).
+			Warning: in case your system does not provide
+			alternatives to determine cpu speed (HPET, PIT, complete
+			CPUID.15h support, MSR) the kernel will fail to
+			calibrate the clocksource and local APIC.
+			Format: <bool> (1/Y/y=enabled, 0/N/n=disabled)
+			default: disabled
+
 	tsx=		[X86] Control Transactional Synchronization
 			Extensions (TSX) feature in Intel processors that
 			support TSX control.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7e322e2daaf5..c9b638dd8f4d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -59,6 +59,17 @@  struct cyc2ns {

 static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns);

+static bool __read_mostly tsc_tuned_baseclk;
+static int __init tsc_tuned_baseclk_setup(char *buf)
+{
+	int ret = strtobool(buf, &tsc_tuned_baseclk);
+
+	if (tsc_tuned_baseclk)
+		pr_warn("tsc_tuned_baseclk: This will allow your CPU to use TSC with an overclocked base clock but your system will require some means of TSC calibration other than CPUID 16h.");
+	return ret;
+}
+early_param("tsc_tuned_baseclk", tsc_tuned_baseclk_setup);
+
 __always_inline void cyc2ns_read_begin(struct cyc2ns_data *data)
 {
 	int seq, idx;
@@ -654,7 +665,8 @@  unsigned long native_calibrate_tsc(void)
 	 * clock, but we can easily calculate it to a high degree of accuracy
 	 * by considering the crystal ratio and the CPU speed.
 	 */
-	if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= 0x16) {
+	if (crystal_khz == 0 && !tsc_tuned_baseclk &&
+		boot_cpu_data.cpuid_level >= 0x16) {
 		unsigned int eax_base_mhz, ebx, ecx, edx;

 		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
@@ -692,7 +704,7 @@  static unsigned long cpu_khz_from_cpuid(void)
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return 0;

-	if (boot_cpu_data.cpuid_level < 0x16)
+	if (boot_cpu_data.cpuid_level < 0x16 || tsc_tuned_baseclk)
 		return 0;

 	eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;