* [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-19 0:31 Jason Vas Dias 2017-02-19 15:35 ` Jason Vas Dias 0 siblings, 1 reply; 17+ messages in thread From: Jason Vas Dias @ 2017-02-19 0:31 UTC (permalink / raw) To: kernel-janitors, linux-kernel, prarit [-- Attachment #1: Type: text/plain, Size: 3499 bytes --] I originally reported this issue on bugzilla.kernel.org : bug # 194609 : https://bugzilla.kernel.org/show_bug.cgi?id=194609 , but it was not posted to the list . My CPU reports 'model name' as "Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" , has 4 physical & 8 hyperthreading cores with a frequency scalable from 800000 to 3900000 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ CPUID:15H is available in user-space, returning the integers : ( 7, 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so in detect_art() in tsc.c, Linux does not think ART is enabled, and does not set the synthesized CPUID + ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not see this bit set . if an e1000 NIC card had been installed, PTP would not be available. Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 because the CPU will always get a fault reading the MSR since it has never been written. So the attached patch makes tsc.c set X86_FEATURE_ART correctly in tsc.c , and set the TSC_ADJUST to 0 if the rdmsr gets an error . Please consider applying it to a future linux version. It would be nice for user-space programs that want to use the TSC with rdtsc / rdtscp instructions, such as the demo program attached to the bug report, could have confidence that Linux is actually generating the results of clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) in a predictable way from the TSC by looking at the /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space use of TSC values, so that they can correlate TSC values with linux clock_gettime() values. The patch applies to linux kernels v4.8 & v4.9.10 GIT tags and the kernels build and run & the demo program produces results like : $ ./ttsc1 has tsc: 1 constant: 1 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 Hooray! TSC is enabled with KHz: 2893300 ts2 - ts1: 261 ts3 - ts2: 211 ns1: 0.000000146 ns2: 0.000001629 ts3 - ts2: 27 ns1: 0.000000168 ts3 - ts2: 20 ns1: 0.000000147 ts3 - ts2: 14 ns1: 0.000000152 ts3 - ts2: 15 ns1: 0.000000151 ts3 - ts2: 15 ns1: 0.000000153 ts3 - ts2: 15 ns1: 0.000000150 ts3 - ts2: 20 ns1: 0.000000148 ts3 - ts2: 19 ns1: 0.000000164 ts3 - ts2: 19 ns1: 0.000000164 ts3 - ts2: 19 ns1: 0.000000160 t1 - t0: 52901 - ns2: 0.000053951 The value 'ts3 - ts2' is the number of nanoseconds measured by successive calls to 'rdtscp'; the 'ns1' value is the number of nanoseconds (shown as decimal seconds) measured by clock_gettime(CLOCK_MONOTONIC_RAW, &ts2) - clock_gettime(CLOCK_MONOTONIC_RAW, &ts1) when casting each {ts.tv_sec, ts.tv_nsec} to a 128 bit long long integer . It shows a user-space program can read the TSC with a latency of @20ns but can only measure times >= @ 140ns using Linux clock_gettime() on this CPU. [-- Attachment #2: x86_kernel_tsc-bz194609.patch --] [-- Type: application/octet-stream, Size: 2993 bytes --] diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 46b2f41..f76cca8 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling); #endif /* CONFIG_CPU_FREQ */ #define ART_CPUID_LEAF (0x15) +#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008) #define ART_MIN_DENOMINATOR (1) @@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling); */ static void detect_art(void) { - unsigned int unused[2]; - - if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF) - return; - - cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, - &art_to_tsc_numerator, unused, unused+1); - + unsigned int v[2]; + + if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF) + { + if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART) + { + pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.\n", + boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level); + }else + return; + } + + cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, + &art_to_tsc_numerator, v, v+1); + /* Don't enable ART in a VM, non-stop TSC required */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || - !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || - art_to_tsc_denominator < ART_MIN_DENOMINATOR) - return; - - if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset)) - return; - + !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || + art_to_tsc_denominator < ART_MIN_DENOMINATOR) + { + pr_info("Not using Intel ART for TSC - HYPERVISOR:%d NO NONSTOP_TSC:%d bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator ); + return; + } + if ( (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */ + { + if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0) + { + pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] ); + return; + }else + { + art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */ + pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to %llu.\n",art_to_tsc_offset); + } + } /* Make this sticky over multiple CPU init calls */ + pr_info("Using Intel Always Running Timer (ART) feature %x for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n", X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator, art_to_tsc_offset ); setup_force_cpu_cap(X86_FEATURE_ART); } ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-19 0:31 [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 Jason Vas Dias @ 2017-02-19 15:35 ` Jason Vas Dias 2017-02-20 21:49 ` Thomas Gleixner 0 siblings, 1 reply; 17+ messages in thread From: Jason Vas Dias @ 2017-02-19 15:35 UTC (permalink / raw) To: kernel-janitors, linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 [-- Attachment #1: Type: text/plain, Size: 7467 bytes --] Patch to make tsc.c set X86_FEATURE_ART and setup the TSC_ADJUST MSR correctly on my "i7-4910MQ" CPU, which reports ( boot_cpu_data.cpuid_level==0x13 && boot_cpu_data.extended_cpuid_level==0x80000008 ), so the code didn't think it supported CPUID:15h, but it does . Patch: <quote><code><pre><patch> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 46b2f41..f76cca8 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling); #endif /* CONFIG_CPU_FREQ */ #define ART_CPUID_LEAF (0x15) +#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008) #define ART_MIN_DENOMINATOR (1) @@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling); */ static void detect_art(void) { - unsigned int unused[2]; - - if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF) - return; - - cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, - &art_to_tsc_numerator, unused, unused+1); - + unsigned int v[2]; + + if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF) + { + if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART) + { + pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.\n", + boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level); + }else + return; + } + + cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, + &art_to_tsc_numerator, v, v+1); + /* Don't enable ART in a VM, non-stop TSC required */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || - !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || - art_to_tsc_denominator < ART_MIN_DENOMINATOR) - return; - - if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset)) - return; - + !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || + art_to_tsc_denominator < ART_MIN_DENOMINATOR) + { + pr_info("Not using Intel ART for TSC - HYPERVISOR:%d NO NONSTOP_TSC:%d bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator ); + return; + } + if ( (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */ + { + if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0) + { + pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] ); + return; + }else + { + art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */ + pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to %llu.\n",art_to_tsc_offset); + } + } /* Make this sticky over multiple CPU init calls */ + pr_info("Using Intel Always Running Timer (ART) feature %x for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n", X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator, art_to_tsc_offset ); setup_force_cpu_cap(X86_FEATURE_ART); } </patch></quote></code></pre> I originally reported this issue on bugzilla.kernel.org : bug # 194609 : https://bugzilla.kernel.org/show_bug.cgi?id=194609 , but it was not posted to the list , & then I posted it to the list, but Julia Lawell <julia.lawall@lip6.fr> kindly suggested I should re-post with patch inline, & include extra recipients, which includes the last person to modify tsc.c (Prarit), so am doing so. My CPU reports 'model name' as "Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" , has 4 physical & 8 hyperthreading cores with a frequency scalable from 800000 to 3900000 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ CPUID:15H is available in user-space, returning the integers : ( 7, 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so in detect_art() in tsc.c, Linux does not think ART is enabled, and does not set the synthesized CPUID + ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not see this bit set . if an e1000 NIC card had been installed, PTP would not be available. Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 because the CPU will always get a fault reading the MSR since it has never been written. So the attached patch makes tsc.c set X86_FEATURE_ART correctly in tsc.c , and set the TSC_ADJUST to 0 if the rdmsr gets an error . Please consider applying it to a future linux version. It would be nice for user-space programs that want to use the TSC with rdtsc / rdtscp instructions, such as the demo program attached to the bug report, could have confidence that Linux is actually generating the results of clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) in a predictable way from the TSC by looking at the /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space use of TSC values, so that they can correlate TSC values with linux clock_gettime() values. The patch applies to linux kernels v4.8 & v4.9.10 GIT tags and the kernels build and run & the demo program produces results like : $ ./ttsc1 has tsc: 1 constant: 1 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 Hooray! TSC is enabled with KHz: 2893300 ts2 - ts1: 261 ts3 - ts2: 211 ns1: 0.000000146 ns2: 0.000001629 ts3 - ts2: 27 ns1: 0.000000168 ts3 - ts2: 20 ns1: 0.000000147 ts3 - ts2: 14 ns1: 0.000000152 ts3 - ts2: 15 ns1: 0.000000151 ts3 - ts2: 15 ns1: 0.000000153 ts3 - ts2: 15 ns1: 0.000000150 ts3 - ts2: 20 ns1: 0.000000148 ts3 - ts2: 19 ns1: 0.000000164 ts3 - ts2: 19 ns1: 0.000000164 ts3 - ts2: 19 ns1: 0.000000160 t1 - t0: 52901 - ns2: 0.000053951 The value 'ts3 - ts2' is the number of nanoseconds measured by successive calls to 'rdtscp'; the 'ns1' value is the number of nanoseconds (shown as decimal seconds) measured by clock_gettime(CLOCK_MONOTONIC_RAW, &ts2) - clock_gettime(CLOCK_MONOTONIC_RAW, &ts1) when casting each {ts.tv_sec, ts.tv_nsec} to a 128 bit long long integer . It shows a user-space program can read the TSC with a latency of @20ns but can only measure times >= @ 140ns using Linux clock_gettime() on this CPU. Please make Linux provide some help to programs that want to use the TSC from user-space by applying the patch so they can determine with confidence that Linux is supplying TSC values with a predictable conversion function in response to clock_gettime(CLOCK_MONOTONIC_RAW,&ts) system calls. It would also be nice if it actually exported the 'Refined TSC calibrated frequency' (tsc_khz) in sysfs , but that's another story. Best Regards, Jason [-- Attachment #2: x86_kernel_tsc-bz194609.patch --] [-- Type: application/octet-stream, Size: 2993 bytes --] diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 46b2f41..f76cca8 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling); #endif /* CONFIG_CPU_FREQ */ #define ART_CPUID_LEAF (0x15) +#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008) #define ART_MIN_DENOMINATOR (1) @@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling); */ static void detect_art(void) { - unsigned int unused[2]; - - if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF) - return; - - cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, - &art_to_tsc_numerator, unused, unused+1); - + unsigned int v[2]; + + if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF) + { + if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART) + { + pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.\n", + boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level); + }else + return; + } + + cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator, + &art_to_tsc_numerator, v, v+1); + /* Don't enable ART in a VM, non-stop TSC required */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || - !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || - art_to_tsc_denominator < ART_MIN_DENOMINATOR) - return; - - if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset)) - return; - + !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) || + art_to_tsc_denominator < ART_MIN_DENOMINATOR) + { + pr_info("Not using Intel ART for TSC - HYPERVISOR:%d NO NONSTOP_TSC:%d bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator ); + return; + } + if ( (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */ + { + if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0) + { + pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] ); + return; + }else + { + art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */ + pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to %llu.\n",art_to_tsc_offset); + } + } /* Make this sticky over multiple CPU init calls */ + pr_info("Using Intel Always Running Timer (ART) feature %x for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n", X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator, art_to_tsc_offset ); setup_force_cpu_cap(X86_FEATURE_ART); } ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-19 15:35 ` Jason Vas Dias @ 2017-02-20 21:49 ` Thomas Gleixner 0 siblings, 0 replies; 17+ messages in thread From: Thomas Gleixner @ 2017-02-20 21:49 UTC (permalink / raw) To: Jason Vas Dias Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 On Sun, 19 Feb 2017, Jason Vas Dias wrote: > CPUID:15H is available in user-space, returning the integers : ( 7, > 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so > in detect_art() in tsc.c, By some definition of available. You can feed CPUID random leaf numbers and it will return something, usually the value of the last valid CPUID leaf, which is 13 on your CPU. A similar CPU model has 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000 i.e. 7, 832, 832, 0 Looks familiar, right? You can verify that with 'cpuid -1 -r' on your machine. > Linux does not think ART is enabled, and does not set the synthesized CPUID + > ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not > see this bit set . Rightfully so. This is a Haswell Core model. > if an e1000 NIC card had been installed, PTP would not be available. PTP is independent of the ART kernel feature . ART just provides enhanced PTP features. You are confusing things here. The ART feature as the kernel sees it is a hardware extension which feeds the ART clock to peripherals for timestamping and time correlation purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so the kernel can make use of that correlation, e.g. for enhanced PTP accuracy. It's correct, that the NONSTOP_TSC feature depends on the availability of ART, but that has nothing to do with the feature bit, which solely describes the ratio between TSC and the ART frequency which is exposed to peripherals. That frequency is not necessarily the real ART frequency. > Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be > nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 > because the CPU will always get a fault reading the MSR since it has > never been written. Huch? If an access to the TSC ADJUST MSR faults, then something is really wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has new code which utilizes the TSC_ADJUST MSR. > It would be nice for user-space programs that want to use the TSC with > rdtsc / rdtscp instructions, such as the demo program attached to the > bug report, > could have confidence that Linux is actually generating the results of > clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) > in a predictable way from the TSC by looking at the > /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space > use of TSC values, so that they can correlate TSC values with linux > clock_gettime() values. What has ART to do with correct CLOCK_MONOTONIC_RAW values? Nothing at all, really. The kernel makes use of the proper information values already. The TSC frequency is determined from: 1) CPUID(0x16) if available 2) MSRs if available 3) By calibration against a known clock If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are correct whether that machine has ART exposed to peripherals or not. > has tsc: 1 constant: 1 > 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 And that voodoo math tells us what? That you found a way to correlate CPUID(0xd) to the TSC frequency on that machine. Now I'm curious how you do that on this other machine which returns for cpuid(15): 1, 1, 1 You can't because all of this is completely wrong. Thanks, tglx ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-20 21:49 ` Thomas Gleixner 0 siblings, 0 replies; 17+ messages in thread From: Thomas Gleixner @ 2017-02-20 21:49 UTC (permalink / raw) To: Jason Vas Dias Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 On Sun, 19 Feb 2017, Jason Vas Dias wrote: > CPUID:15H is available in user-space, returning the integers : ( 7, > 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so > in detect_art() in tsc.c, By some definition of available. You can feed CPUID random leaf numbers and it will return something, usually the value of the last valid CPUID leaf, which is 13 on your CPU. A similar CPU model has 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000 i.e. 7, 832, 832, 0 Looks familiar, right? You can verify that with 'cpuid -1 -r' on your machine. > Linux does not think ART is enabled, and does not set the synthesized CPUID + > ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not > see this bit set . Rightfully so. This is a Haswell Core model. > if an e1000 NIC card had been installed, PTP would not be available. PTP is independent of the ART kernel feature . ART just provides enhanced PTP features. You are confusing things here. The ART feature as the kernel sees it is a hardware extension which feeds the ART clock to peripherals for timestamping and time correlation purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so the kernel can make use of that correlation, e.g. for enhanced PTP accuracy. It's correct, that the NONSTOP_TSC feature depends on the availability of ART, but that has nothing to do with the feature bit, which solely describes the ratio between TSC and the ART frequency which is exposed to peripherals. That frequency is not necessarily the real ART frequency. > Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be > nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 > because the CPU will always get a fault reading the MSR since it has > never been written. Huch? If an access to the TSC ADJUST MSR faults, then something is really wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has new code which utilizes the TSC_ADJUST MSR. > It would be nice for user-space programs that want to use the TSC with > rdtsc / rdtscp instructions, such as the demo program attached to the > bug report, > could have confidence that Linux is actually generating the results of > clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) > in a predictable way from the TSC by looking at the > /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space > use of TSC values, so that they can correlate TSC values with linux > clock_gettime() values. What has ART to do with correct CLOCK_MONOTONIC_RAW values? Nothing at all, really. The kernel makes use of the proper information values already. The TSC frequency is determined from: 1) CPUID(0x16) if available 2) MSRs if available 3) By calibration against a known clock If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are correct whether that machine has ART exposed to peripherals or not. > has tsc: 1 constant: 1 > 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 And that voodoo math tells us what? That you found a way to correlate CPUID(0xd) to the TSC frequency on that machine. Now I'm curious how you do that on this other machine which returns for cpuid(15): 1, 1, 1 You can't because all of this is completely wrong. Thanks, tglx ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-20 21:49 ` Thomas Gleixner @ 2017-02-21 23:39 ` Jason Vas Dias -1 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-21 23:39 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 Thank You for enlightening me - I was just having a hard time believing that Intel would ship a chip that features a monotonic, fixed frequency timestamp counter without specifying in either documentation or on-chip or in ACPI what precisely that hard-wired frequency is, but I now know that to be the case for the unfortunate i7-4910MQ - I mean, how can the CPU assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is difficult to reconcile with the statement in the SDM : 17.16.4 Invariant Time-Keeping The invariant TSC is based on the invariant timekeeping hardware (called Always Running Timer or ART), that runs at the core crystal clock frequency. The ratio defined by CPUID leaf 15H expresses the frequency relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity relationship holds between TSC and the ART hardware: TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) / CPUID.15H:EAX[31:0] + K Where 'K' is an offset that can be adjusted by a privileged agent*2. When ART hardware is reset, both invariant TSC and K are also reset. So I'm just trying to figure out what CPUID.15H:EBX[31:0] and CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) that the "Nominal TSC Frequency" formulae in the manul must apply to all CPUs with InvariantTSC . Do I understand correctly , that since I do have InvariantTSC , the TSC_Value is in fact calculated according to the above formula, but with a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to TSC frequency ? It was obvious this nominal TSC Frequency had nothing to do with the actual TSC frequency used by Linux, which is 'tsc_khz' . I guess wishful thinking led me to believe CPUID:15h was actually supported somehow , because I thought InvariantTSC meant it had ART hardware . I do strongly suggest that Linux exports its calibrated TSC Khz somewhere to user space . I think the best long-term solution would be to allow programs to somehow read the TSC without invoking clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & having to enter the kernel, which incurs an overhead of > 120ns on my system . Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and 'clocksource->shift' values to /sysfs somehow ? For instance , only if the 'current_clocksource' is 'tsc', then these values could be exported as: /sys/devices/system/clocksource/clocksource0/shift /sys/devices/system/clocksource/clocksource0/mult /sys/devices/system/clocksource/clocksource0/freq So user-space programs could know that the value returned by clock_gettime(CLOCK_MONOTONIC_RAW) would be { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U } and that represents ticks of period (1.0 / ( freq * 1000 )) S. That would save user-space programs from having to know 'tsc_khz' by parsing the 'Refined TSC' frequency from log files or by examining the running kernel with objdump to obtain this value & figure out 'mult' & 'shift' themselves. And why not a /sys/devices/system/clocksource/clocksource0/value file that actually prints this ( ( rdtsc() * mult ) >> shift ) expression as a long integer? And perhaps a /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds file that actually prints out the number of real-time nano-seconds since the contents of the existing /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} files using the current TSC value? To read the rtc0/{date,time} files is already faster than entering the kernel to call clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. I will work on developing a patch to this effect if no-one else is. Also, am I right in assuming that the maximum granularity of the real-time clock on my system is 1/64th of a second ? : $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq 64 This is the maximum granularity that can be stored in CMOS , not returned by TSC? Couldn't we have something similar that gave an accurate idea of TSC frequency and the precise formula applied to TSC value to get clock_gettime (CLOCK_MONOTONIC_RAW) value ? Regards, Jason This code does produce good timestamps with a latency of @20ns that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) values, but it depends on a global variable that is initialized to the 'tsc_khz' value computed by running kernel parsed from objdump /proc/kcore output : static inline __attribute__((always_inline)) U64_t IA64_tsc_now() { if(!( _ia64_invariant_tsc_enabled ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL)) ) ) { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant TSC enabled.\n"); return 0; } U32_t tsc_hi, tsc_lo; register UL_t tsc; asm volatile ( "rdtscp\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "mov %%ecx, %2\n\t" : "=m" (tsc_hi) , "=m" (tsc_lo) , "=m" (_ia64_tsc_user_cpu) : : "%eax","%ecx","%edx" ); tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); return tsc; } __thread U64_t _ia64_first_tsc = 0xffffffffffffffffUL; static inline __attribute__((always_inline)) U64_t IA64_tsc_ticks_since_start() { if(_ia64_first_tsc == 0xffffffffffffffffUL) { _ia64_first_tsc = IA64_tsc_now(); return 0; } return (IA64_tsc_now() - _ia64_first_tsc) ; } static inline __attribute__((always_inline)) void ia64_tsc_calc_mult_shift ( register U32_t *mult, register U32_t *shift ) { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: * calculates second + nanosecond mult + shift in same way linux does. * we want to be compatible with what linux returns in struct timespec ts after call to * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). */ const U32_t scale=1000U; register U32_t from= IA64_tsc_khz(); register U32_t to = NSEC_PER_SEC / scale; register U64_t sec = ( ~0UL / from ) / scale; sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); register U64_t maxsec = sec * scale; UL_t tmp; U32_t sft, sftacc=32; /* * Calculate the shift factor which is limiting the conversion * range: */ tmp = (maxsec * from) >> 32; while (tmp) { tmp >>=1; sftacc--; } /* * Find the conversion shift/mult pair which has the best * accuracy and fits the maxsec conversion range: */ for (sft = 32; sft > 0; sft--) { tmp = ((UL_t) to) << sft; tmp += from / 2; tmp = tmp / from; if ((tmp >> sftacc) == 0) break; } *mult = tmp; *shift = sft; } __thread U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; static inline __attribute__((always_inline)) U64_t IA64_s_ns_since_start() { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); register U64_t cycles = IA64_tsc_ticks_since_start(); register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % NSEC_PER_SEC)&0x3fffffffUL) ); /* Yes, we are purposefully ignoring durations of more than 4.2 billion seconds here! */ } I think Linux should export the 'tsc_khz', 'mult' and 'shift' values somehow, then user-space libraries could have more confidence in using 'rdtsc' or 'rdtscp' if Linux's current_clocksource is 'tsc'. Regards, Jason On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: > On Sun, 19 Feb 2017, Jason Vas Dias wrote: > >> CPUID:15H is available in user-space, returning the integers : ( 7, >> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >> in detect_art() in tsc.c, > > By some definition of available. You can feed CPUID random leaf numbers and > it will return something, usually the value of the last valid CPUID leaf, > which is 13 on your CPU. A similar CPU model has > > 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 > edx=0x00000000 > > i.e. 7, 832, 832, 0 > > Looks familiar, right? > > You can verify that with 'cpuid -1 -r' on your machine. > >> Linux does not think ART is enabled, and does not set the synthesized >> CPUID + >> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >> see this bit set . > > Rightfully so. This is a Haswell Core model. > >> if an e1000 NIC card had been installed, PTP would not be available. > > PTP is independent of the ART kernel feature . ART just provides enhanced > PTP features. You are confusing things here. > > The ART feature as the kernel sees it is a hardware extension which feeds > the ART clock to peripherals for timestamping and time correlation > purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so > the kernel can make use of that correlation, e.g. for enhanced PTP > accuracy. > > It's correct, that the NONSTOP_TSC feature depends on the availability of > ART, but that has nothing to do with the feature bit, which solely > describes the ratio between TSC and the ART frequency which is exposed to > peripherals. That frequency is not necessarily the real ART frequency. > >> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >> because the CPU will always get a fault reading the MSR since it has >> never been written. > > Huch? If an access to the TSC ADJUST MSR faults, then something is really > wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has > new code which utilizes the TSC_ADJUST MSR. > >> It would be nice for user-space programs that want to use the TSC with >> rdtsc / rdtscp instructions, such as the demo program attached to the >> bug report, >> could have confidence that Linux is actually generating the results of >> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >> in a predictable way from the TSC by looking at the >> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >> use of TSC values, so that they can correlate TSC values with linux >> clock_gettime() values. > > What has ART to do with correct CLOCK_MONOTONIC_RAW values? > > Nothing at all, really. > > The kernel makes use of the proper information values already. > > The TSC frequency is determined from: > > 1) CPUID(0x16) if available > 2) MSRs if available > 3) By calibration against a known clock > > If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are > correct whether that machine has ART exposed to peripherals or not. > >> has tsc: 1 constant: 1 >> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 > > And that voodoo math tells us what? That you found a way to correlate > CPUID(0xd) to the TSC frequency on that machine. > > Now I'm curious how you do that on this other machine which returns for > cpuid(15): 1, 1, 1 > > You can't because all of this is completely wrong. > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-21 23:39 ` Jason Vas Dias 0 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-21 23:39 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 Thank You for enlightening me - I was just having a hard time believing that Intel would ship a chip that features a monotonic, fixed frequency timestamp counter without specifying in either documentation or on-chip or in ACPI what precisely that hard-wired frequency is, but I now know that to be the case for the unfortunate i7-4910MQ - I mean, how can the CPU assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is difficult to reconcile with the statement in the SDM : 17.16.4 Invariant Time-Keeping The invariant TSC is based on the invariant timekeeping hardware (called Always Running Timer or ART), that runs at the core crystal clock frequency. The ratio defined by CPUID leaf 15H expresses the frequency relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity relationship holds between TSC and the ART hardware: TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) / CPUID.15H:EAX[31:0] + K Where 'K' is an offset that can be adjusted by a privileged agent*2. When ART hardware is reset, both invariant TSC and K are also reset. So I'm just trying to figure out what CPUID.15H:EBX[31:0] and CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) that the "Nominal TSC Frequency" formulae in the manul must apply to all CPUs with InvariantTSC . Do I understand correctly , that since I do have InvariantTSC , the TSC_Value is in fact calculated according to the above formula, but with a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to TSC frequency ? It was obvious this nominal TSC Frequency had nothing to do with the actual TSC frequency used by Linux, which is 'tsc_khz' . I guess wishful thinking led me to believe CPUID:15h was actually supported somehow , because I thought InvariantTSC meant it had ART hardware . I do strongly suggest that Linux exports its calibrated TSC Khz somewhere to user space . I think the best long-term solution would be to allow programs to somehow read the TSC without invoking clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & having to enter the kernel, which incurs an overhead of > 120ns on my system . Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and 'clocksource->shift' values to /sysfs somehow ? For instance , only if the 'current_clocksource' is 'tsc', then these values could be exported as: /sys/devices/system/clocksource/clocksource0/shift /sys/devices/system/clocksource/clocksource0/mult /sys/devices/system/clocksource/clocksource0/freq So user-space programs could know that the value returned by clock_gettime(CLOCK_MONOTONIC_RAW) would be { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U } and that represents ticks of period (1.0 / ( freq * 1000 )) S. That would save user-space programs from having to know 'tsc_khz' by parsing the 'Refined TSC' frequency from log files or by examining the running kernel with objdump to obtain this value & figure out 'mult' & 'shift' themselves. And why not a /sys/devices/system/clocksource/clocksource0/value file that actually prints this ( ( rdtsc() * mult ) >> shift ) expression as a long integer? And perhaps a /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds file that actually prints out the number of real-time nano-seconds since the contents of the existing /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} files using the current TSC value? To read the rtc0/{date,time} files is already faster than entering the kernel to call clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. I will work on developing a patch to this effect if no-one else is. Also, am I right in assuming that the maximum granularity of the real-time clock on my system is 1/64th of a second ? : $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq 64 This is the maximum granularity that can be stored in CMOS , not returned by TSC? Couldn't we have something similar that gave an accurate idea of TSC frequency and the precise formula applied to TSC value to get clock_gettime (CLOCK_MONOTONIC_RAW) value ? Regards, Jason This code does produce good timestamps with a latency of @20ns that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) values, but it depends on a global variable that is initialized to the 'tsc_khz' value computed by running kernel parsed from objdump /proc/kcore output : static inline __attribute__((always_inline)) U64_t IA64_tsc_now() { if(!( _ia64_invariant_tsc_enabled ||(( _cpu0id_fd = -1) && IA64_invariant_tsc_is_enabled(NULL,NULL)) ) ) { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant TSC enabled.\n"); return 0; } U32_t tsc_hi, tsc_lo; register UL_t tsc; asm volatile ( "rdtscp\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "mov %%ecx, %2\n\t" : "=m" (tsc_hi) , "=m" (tsc_lo) , "=m" (_ia64_tsc_user_cpu) : : "%eax","%ecx","%edx" ); tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); return tsc; } __thread U64_t _ia64_first_tsc = 0xffffffffffffffffUL; static inline __attribute__((always_inline)) U64_t IA64_tsc_ticks_since_start() { if(_ia64_first_tsc = 0xffffffffffffffffUL) { _ia64_first_tsc = IA64_tsc_now(); return 0; } return (IA64_tsc_now() - _ia64_first_tsc) ; } static inline __attribute__((always_inline)) void ia64_tsc_calc_mult_shift ( register U32_t *mult, register U32_t *shift ) { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: * calculates second + nanosecond mult + shift in same way linux does. * we want to be compatible with what linux returns in struct timespec ts after call to * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). */ const U32_t scale\x1000U; register U32_t from= IA64_tsc_khz(); register U32_t to = NSEC_PER_SEC / scale; register U64_t sec = ( ~0UL / from ) / scale; sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); register U64_t maxsec = sec * scale; UL_t tmp; U32_t sft, sftacc2; /* * Calculate the shift factor which is limiting the conversion * range: */ tmp = (maxsec * from) >> 32; while (tmp) { tmp >>=1; sftacc--; } /* * Find the conversion shift/mult pair which has the best * accuracy and fits the maxsec conversion range: */ for (sft = 32; sft > 0; sft--) { tmp = ((UL_t) to) << sft; tmp += from / 2; tmp = tmp / from; if ((tmp >> sftacc) = 0) break; } *mult = tmp; *shift = sft; } __thread U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; static inline __attribute__((always_inline)) U64_t IA64_s_ns_since_start() { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) ) ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); register U64_t cycles = IA64_tsc_ticks_since_start(); register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % NSEC_PER_SEC)&0x3fffffffUL) ); /* Yes, we are purposefully ignoring durations of more than 4.2 billion seconds here! */ } I think Linux should export the 'tsc_khz', 'mult' and 'shift' values somehow, then user-space libraries could have more confidence in using 'rdtsc' or 'rdtscp' if Linux's current_clocksource is 'tsc'. Regards, Jason On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: > On Sun, 19 Feb 2017, Jason Vas Dias wrote: > >> CPUID:15H is available in user-space, returning the integers : ( 7, >> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >> in detect_art() in tsc.c, > > By some definition of available. You can feed CPUID random leaf numbers and > it will return something, usually the value of the last valid CPUID leaf, > which is 13 on your CPU. A similar CPU model has > > 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 > edx=0x00000000 > > i.e. 7, 832, 832, 0 > > Looks familiar, right? > > You can verify that with 'cpuid -1 -r' on your machine. > >> Linux does not think ART is enabled, and does not set the synthesized >> CPUID + >> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >> see this bit set . > > Rightfully so. This is a Haswell Core model. > >> if an e1000 NIC card had been installed, PTP would not be available. > > PTP is independent of the ART kernel feature . ART just provides enhanced > PTP features. You are confusing things here. > > The ART feature as the kernel sees it is a hardware extension which feeds > the ART clock to peripherals for timestamping and time correlation > purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so > the kernel can make use of that correlation, e.g. for enhanced PTP > accuracy. > > It's correct, that the NONSTOP_TSC feature depends on the availability of > ART, but that has nothing to do with the feature bit, which solely > describes the ratio between TSC and the ART frequency which is exposed to > peripherals. That frequency is not necessarily the real ART frequency. > >> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >> because the CPU will always get a fault reading the MSR since it has >> never been written. > > Huch? If an access to the TSC ADJUST MSR faults, then something is really > wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has > new code which utilizes the TSC_ADJUST MSR. > >> It would be nice for user-space programs that want to use the TSC with >> rdtsc / rdtscp instructions, such as the demo program attached to the >> bug report, >> could have confidence that Linux is actually generating the results of >> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >> in a predictable way from the TSC by looking at the >> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >> use of TSC values, so that they can correlate TSC values with linux >> clock_gettime() values. > > What has ART to do with correct CLOCK_MONOTONIC_RAW values? > > Nothing at all, really. > > The kernel makes use of the proper information values already. > > The TSC frequency is determined from: > > 1) CPUID(0x16) if available > 2) MSRs if available > 3) By calibration against a known clock > > If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are > correct whether that machine has ART exposed to peripherals or not. > >> has tsc: 1 constant: 1 >> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 > > And that voodoo math tells us what? That you found a way to correlate > CPUID(0xd) to the TSC frequency on that machine. > > Now I'm curious how you do that on this other machine which returns for > cpuid(15): 1, 1, 1 > > You can't because all of this is completely wrong. > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-21 23:39 ` Jason Vas Dias (?) @ 2017-02-22 16:07 ` Jason Vas Dias 2017-02-22 16:18 ` Jason Vas Dias -1 siblings, 1 reply; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 16:07 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 [-- Attachment #1: Type: text/plain, Size: 13010 bytes --] RE: >> 4.10 has new code which utilizes the TSC_ADJUST MSR. I just built an unpatched linux v4.10 with tglx's TSC improvements - much else improved in this kernel (like iwlwifi) - thanks! I have attached an updated version of the test program which doesn't print the bogus "Nominal TSC Frequency" (the previous version printed it, but equally ignored it). The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : $ uname -r 4.10.0 $ ./ttsc1 max_extended_leaf: 80000008 has tsc: 1 constant: 1 Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 ts3 - ts2: 178 ns1: 0.000000592 ts3 - ts2: 14 ns1: 0.000000577 ts3 - ts2: 14 ns1: 0.000000651 ts3 - ts2: 17 ns1: 0.000000625 ts3 - ts2: 17 ns1: 0.000000677 ts3 - ts2: 17 ns1: 0.000000626 ts3 - ts2: 17 ns1: 0.000000627 ts3 - ts2: 17 ns1: 0.000000627 ts3 - ts2: 18 ns1: 0.000000655 ts3 - ts2: 17 ns1: 0.000000631 t1 - t0: 89067 - ns2: 0.000091411 I think this is because under Linux 4.8, the CPU got a fault every time it read the TSC_ADJUST MSR. But user programs wanting to use the TSC and correlate its value to clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above program still have to dig the TSC frequency value out of the kernel with objdump - this was really the point of the bug #194609. I would still like to investigate exporting 'tsc_khz' & 'mult' + 'shift' values via sysfs. Regards, Jason. On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > Thank You for enlightening me - > > I was just having a hard time believing that Intel would ship a chip > that features a monotonic, fixed frequency timestamp counter > without specifying in either documentation or on-chip or in ACPI what > precisely that hard-wired frequency is, but I now know that to > be the case for the unfortunate i7-4910MQ - I mean, how can the CPU > assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is > difficult to reconcile with the statement in the SDM : > 17.16.4 Invariant Time-Keeping > The invariant TSC is based on the invariant timekeeping hardware > (called Always Running Timer or ART), that runs at the core crystal > clock > frequency. The ratio defined by CPUID leaf 15H expresses the frequency > relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != > 0 > and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity > relationship holds between TSC and the ART hardware: > TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) > / CPUID.15H:EAX[31:0] + K > Where 'K' is an offset that can be adjusted by a privileged agent*2. > When ART hardware is reset, both invariant TSC and K are also reset. > > So I'm just trying to figure out what CPUID.15H:EBX[31:0] and > CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) > that > the "Nominal TSC Frequency" formulae in the manul must apply to all > CPUs with InvariantTSC . > > Do I understand correctly , that since I do have InvariantTSC , the > TSC_Value is in fact calculated according to the above formula, but with > a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to > TSC frequency ? > It was obvious this nominal TSC Frequency had nothing to do with the > actual TSC frequency used by Linux, which is 'tsc_khz' . > I guess wishful thinking led me to believe CPUID:15h was actually > supported somehow , because I thought InvariantTSC meant it had ART > hardware . > > I do strongly suggest that Linux exports its calibrated TSC Khz > somewhere to user > space . > > I think the best long-term solution would be to allow programs to > somehow read the TSC without invoking > clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & > having to enter the kernel, which incurs an overhead of > 120ns on my system > . > > > Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and > 'clocksource->shift' values to /sysfs somehow ? > > For instance , only if the 'current_clocksource' is 'tsc', then these > values could be exported as: > /sys/devices/system/clocksource/clocksource0/shift > /sys/devices/system/clocksource/clocksource0/mult > /sys/devices/system/clocksource/clocksource0/freq > > So user-space programs could know that the value returned by > clock_gettime(CLOCK_MONOTONIC_RAW) > would be > { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, > , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U > } > and that represents ticks of period (1.0 / ( freq * 1000 )) S. > > That would save user-space programs from having to know 'tsc_khz' by > parsing the 'Refined TSC' frequency from log files or by examining the > running kernel with objdump to obtain this value & figure out 'mult' & > 'shift' themselves. > > And why not a > /sys/devices/system/clocksource/clocksource0/value > file that actually prints this ( ( rdtsc() * mult ) >> shift ) > expression as a long integer? > And perhaps a > /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds > file that actually prints out the number of real-time nano-seconds since > the > contents of the existing > /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} > files using the current TSC value? > To read the rtc0/{date,time} files is already faster than entering the > kernel to call > clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. > > I will work on developing a patch to this effect if no-one else is. > > Also, am I right in assuming that the maximum granularity of the real-time > clock > on my system is 1/64th of a second ? : > $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq > 64 > This is the maximum granularity that can be stored in CMOS , not > returned by TSC? Couldn't we have something similar that gave an > accurate idea of TSC frequency and the precise formula applied to TSC > value to get clock_gettime > (CLOCK_MONOTONIC_RAW) value ? > > Regards, > Jason > > > This code does produce good timestamps with a latency of @20ns > that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) > values, but it depends on a global variable that is initialized to > the 'tsc_khz' value > computed by running kernel parsed from objdump /proc/kcore output : > > static inline __attribute__((always_inline)) > U64_t > IA64_tsc_now() > { if(!( _ia64_invariant_tsc_enabled > ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL)) > ) > ) > { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant > TSC enabled.\n"); > return 0; > } > U32_t tsc_hi, tsc_lo; > register UL_t tsc; > asm volatile > ( "rdtscp\n\t" > "mov %%edx, %0\n\t" > "mov %%eax, %1\n\t" > "mov %%ecx, %2\n\t" > : "=m" (tsc_hi) , > "=m" (tsc_lo) , > "=m" (_ia64_tsc_user_cpu) : > : "%eax","%ecx","%edx" > ); > tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); > return tsc; > } > > __thread > U64_t _ia64_first_tsc = 0xffffffffffffffffUL; > > static inline __attribute__((always_inline)) > U64_t IA64_tsc_ticks_since_start() > { if(_ia64_first_tsc == 0xffffffffffffffffUL) > { _ia64_first_tsc = IA64_tsc_now(); > return 0; > } > return (IA64_tsc_now() - _ia64_first_tsc) ; > } > > static inline __attribute__((always_inline)) > void > ia64_tsc_calc_mult_shift > ( register U32_t *mult, > register U32_t *shift > ) > { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: > * calculates second + nanosecond mult + shift in same way linux does. > * we want to be compatible with what linux returns in struct > timespec ts after call to > * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). > */ > const U32_t scale=1000U; > register U32_t from= IA64_tsc_khz(); > register U32_t to = NSEC_PER_SEC / scale; > register U64_t sec = ( ~0UL / from ) / scale; > sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); > register U64_t maxsec = sec * scale; > UL_t tmp; > U32_t sft, sftacc=32; > /* > * Calculate the shift factor which is limiting the conversion > * range: > */ > tmp = (maxsec * from) >> 32; > while (tmp) > { tmp >>=1; > sftacc--; > } > /* > * Find the conversion shift/mult pair which has the best > * accuracy and fits the maxsec conversion range: > */ > for (sft = 32; sft > 0; sft--) > { tmp = ((UL_t) to) << sft; > tmp += from / 2; > tmp = tmp / from; > if ((tmp >> sftacc) == 0) > break; > } > *mult = tmp; > *shift = sft; > } > > __thread > U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; > > static inline __attribute__((always_inline)) > U64_t IA64_s_ns_since_start() > { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) > ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); > register U64_t cycles = IA64_tsc_ticks_since_start(); > register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); > return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % > NSEC_PER_SEC)&0x3fffffffUL) ); > /* Yes, we are purposefully ignoring durations of more than 4.2 > billion seconds here! */ > } > > > I think Linux should export the 'tsc_khz', 'mult' and 'shift' values > somehow, > then user-space libraries could have more confidence in using 'rdtsc' > or 'rdtscp' > if Linux's current_clocksource is 'tsc'. > > Regards, > Jason > > > > On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >> >>> CPUID:15H is available in user-space, returning the integers : ( 7, >>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>> in detect_art() in tsc.c, >> >> By some definition of available. You can feed CPUID random leaf numbers >> and >> it will return something, usually the value of the last valid CPUID leaf, >> which is 13 on your CPU. A similar CPU model has >> >> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >> edx=0x00000000 >> >> i.e. 7, 832, 832, 0 >> >> Looks familiar, right? >> >> You can verify that with 'cpuid -1 -r' on your machine. >> >>> Linux does not think ART is enabled, and does not set the synthesized >>> CPUID + >>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>> see this bit set . >> >> Rightfully so. This is a Haswell Core model. >> >>> if an e1000 NIC card had been installed, PTP would not be available. >> >> PTP is independent of the ART kernel feature . ART just provides enhanced >> PTP features. You are confusing things here. >> >> The ART feature as the kernel sees it is a hardware extension which feeds >> the ART clock to peripherals for timestamping and time correlation >> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 >> so >> the kernel can make use of that correlation, e.g. for enhanced PTP >> accuracy. >> >> It's correct, that the NONSTOP_TSC feature depends on the availability of >> ART, but that has nothing to do with the feature bit, which solely >> describes the ratio between TSC and the ART frequency which is exposed to >> peripherals. That frequency is not necessarily the real ART frequency. >> >>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >>> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >>> because the CPU will always get a fault reading the MSR since it has >>> never been written. >> >> Huch? If an access to the TSC ADJUST MSR faults, then something is really >> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 >> has >> new code which utilizes the TSC_ADJUST MSR. >> >>> It would be nice for user-space programs that want to use the TSC with >>> rdtsc / rdtscp instructions, such as the demo program attached to the >>> bug report, >>> could have confidence that Linux is actually generating the results of >>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>> in a predictable way from the TSC by looking at the >>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>> use of TSC values, so that they can correlate TSC values with linux >>> clock_gettime() values. >> >> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >> >> Nothing at all, really. >> >> The kernel makes use of the proper information values already. >> >> The TSC frequency is determined from: >> >> 1) CPUID(0x16) if available >> 2) MSRs if available >> 3) By calibration against a known clock >> >> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values >> are >> correct whether that machine has ART exposed to peripherals or not. >> >>> has tsc: 1 constant: 1 >>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >> >> And that voodoo math tells us what? That you found a way to correlate >> CPUID(0xd) to the TSC frequency on that machine. >> >> Now I'm curious how you do that on this other machine which returns for >> cpuid(15): 1, 1, 1 >> >> You can't because all of this is completely wrong. >> >> Thanks, >> >> tglx >> > [-- Attachment #2: ttsc.tar --] [-- Type: application/x-tar, Size: 30720 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-22 16:07 ` Jason Vas Dias @ 2017-02-22 16:18 ` Jason Vas Dias 0 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 16:18 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > RE: >>> 4.10 has new code which utilizes the TSC_ADJUST MSR. > > I just built an unpatched linux v4.10 with tglx's TSC improvements - > much else improved in this kernel (like iwlwifi) - thanks! > > I have attached an updated version of the test program which > doesn't print the bogus "Nominal TSC Frequency" (the previous > version printed it, but equally ignored it). > > The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by > a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : > > $ uname -r > 4.10.0 > $ ./ttsc1 > max_extended_leaf: 80000008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. > ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 > ts3 - ts2: 178 ns1: 0.000000592 > ts3 - ts2: 14 ns1: 0.000000577 > ts3 - ts2: 14 ns1: 0.000000651 > ts3 - ts2: 17 ns1: 0.000000625 > ts3 - ts2: 17 ns1: 0.000000677 > ts3 - ts2: 17 ns1: 0.000000626 > ts3 - ts2: 17 ns1: 0.000000627 > ts3 - ts2: 17 ns1: 0.000000627 > ts3 - ts2: 18 ns1: 0.000000655 > ts3 - ts2: 17 ns1: 0.000000631 > t1 - t0: 89067 - ns2: 0.000091411 > Oops, going blind in my old age. These latencies are actually 3 times greater than under 4.8 !! Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: ts3 - ts2: 24 ns1: 0.000000162 ts3 - ts2: 17 ns1: 0.000000143 ts3 - ts2: 17 ns1: 0.000000146 ts3 - ts2: 17 ns1: 0.000000149 ts3 - ts2: 17 ns1: 0.000000141 ts3 - ts2: 16 ns1: 0.000000142 now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ 600ns, @ 4 times more than under 4.8 . But I'm glad the TSC_ADJUST problems are fixed. Will programs reading : $ cat /sys/devices/msr/events/tsc event=0x00 read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the TSC ? > I think this is because under Linux 4.8, the CPU got a fault every > time it read the TSC_ADJUST MSR. maybe it still is! > But user programs wanting to use the TSC and correlate its value to > clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above > program still have to dig the TSC frequency value out of the kernel > with objdump - this was really the point of the bug #194609. > > I would still like to investigate exporting 'tsc_khz' & 'mult' + > 'shift' values via sysfs. > > Regards, > Jason. > > > > > > On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> Thank You for enlightening me - >> >> I was just having a hard time believing that Intel would ship a chip >> that features a monotonic, fixed frequency timestamp counter >> without specifying in either documentation or on-chip or in ACPI what >> precisely that hard-wired frequency is, but I now know that to >> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >> difficult to reconcile with the statement in the SDM : >> 17.16.4 Invariant Time-Keeping >> The invariant TSC is based on the invariant timekeeping hardware >> (called Always Running Timer or ART), that runs at the core crystal >> clock >> frequency. The ratio defined by CPUID leaf 15H expresses the >> frequency >> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] >> != >> 0 >> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >> relationship holds between TSC and the ART hardware: >> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >> / CPUID.15H:EAX[31:0] + K >> Where 'K' is an offset that can be adjusted by a privileged agent*2. >> When ART hardware is reset, both invariant TSC and K are also reset. >> >> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >> that >> the "Nominal TSC Frequency" formulae in the manul must apply to all >> CPUs with InvariantTSC . >> >> Do I understand correctly , that since I do have InvariantTSC , the >> TSC_Value is in fact calculated according to the above formula, but with >> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >> TSC frequency ? >> It was obvious this nominal TSC Frequency had nothing to do with the >> actual TSC frequency used by Linux, which is 'tsc_khz' . >> I guess wishful thinking led me to believe CPUID:15h was actually >> supported somehow , because I thought InvariantTSC meant it had ART >> hardware . >> >> I do strongly suggest that Linux exports its calibrated TSC Khz >> somewhere to user >> space . >> >> I think the best long-term solution would be to allow programs to >> somehow read the TSC without invoking >> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >> having to enter the kernel, which incurs an overhead of > 120ns on my >> system >> . >> >> >> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >> 'clocksource->shift' values to /sysfs somehow ? >> >> For instance , only if the 'current_clocksource' is 'tsc', then these >> values could be exported as: >> /sys/devices/system/clocksource/clocksource0/shift >> /sys/devices/system/clocksource/clocksource0/mult >> /sys/devices/system/clocksource/clocksource0/freq >> >> So user-space programs could know that the value returned by >> clock_gettime(CLOCK_MONOTONIC_RAW) >> would be >> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >> } >> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >> >> That would save user-space programs from having to know 'tsc_khz' by >> parsing the 'Refined TSC' frequency from log files or by examining the >> running kernel with objdump to obtain this value & figure out 'mult' & >> 'shift' themselves. >> >> And why not a >> /sys/devices/system/clocksource/clocksource0/value >> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >> expression as a long integer? >> And perhaps a >> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >> file that actually prints out the number of real-time nano-seconds since >> the >> contents of the existing >> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >> files using the current TSC value? >> To read the rtc0/{date,time} files is already faster than entering the >> kernel to call >> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >> >> I will work on developing a patch to this effect if no-one else is. >> >> Also, am I right in assuming that the maximum granularity of the >> real-time >> clock >> on my system is 1/64th of a second ? : >> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >> 64 >> This is the maximum granularity that can be stored in CMOS , not >> returned by TSC? Couldn't we have something similar that gave an >> accurate idea of TSC frequency and the precise formula applied to TSC >> value to get clock_gettime >> (CLOCK_MONOTONIC_RAW) value ? >> >> Regards, >> Jason >> >> >> This code does produce good timestamps with a latency of @20ns >> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >> values, but it depends on a global variable that is initialized to >> the 'tsc_khz' value >> computed by running kernel parsed from objdump /proc/kcore output : >> >> static inline __attribute__((always_inline)) >> U64_t >> IA64_tsc_now() >> { if(!( _ia64_invariant_tsc_enabled >> ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL)) >> ) >> ) >> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >> TSC enabled.\n"); >> return 0; >> } >> U32_t tsc_hi, tsc_lo; >> register UL_t tsc; >> asm volatile >> ( "rdtscp\n\t" >> "mov %%edx, %0\n\t" >> "mov %%eax, %1\n\t" >> "mov %%ecx, %2\n\t" >> : "=m" (tsc_hi) , >> "=m" (tsc_lo) , >> "=m" (_ia64_tsc_user_cpu) : >> : "%eax","%ecx","%edx" >> ); >> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >> return tsc; >> } >> >> __thread >> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >> >> static inline __attribute__((always_inline)) >> U64_t IA64_tsc_ticks_since_start() >> { if(_ia64_first_tsc == 0xffffffffffffffffUL) >> { _ia64_first_tsc = IA64_tsc_now(); >> return 0; >> } >> return (IA64_tsc_now() - _ia64_first_tsc) ; >> } >> >> static inline __attribute__((always_inline)) >> void >> ia64_tsc_calc_mult_shift >> ( register U32_t *mult, >> register U32_t *shift >> ) >> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: >> * calculates second + nanosecond mult + shift in same way linux does. >> * we want to be compatible with what linux returns in struct >> timespec ts after call to >> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >> */ >> const U32_t scale=1000U; >> register U32_t from= IA64_tsc_khz(); >> register U32_t to = NSEC_PER_SEC / scale; >> register U64_t sec = ( ~0UL / from ) / scale; >> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >> register U64_t maxsec = sec * scale; >> UL_t tmp; >> U32_t sft, sftacc=32; >> /* >> * Calculate the shift factor which is limiting the conversion >> * range: >> */ >> tmp = (maxsec * from) >> 32; >> while (tmp) >> { tmp >>=1; >> sftacc--; >> } >> /* >> * Find the conversion shift/mult pair which has the best >> * accuracy and fits the maxsec conversion range: >> */ >> for (sft = 32; sft > 0; sft--) >> { tmp = ((UL_t) to) << sft; >> tmp += from / 2; >> tmp = tmp / from; >> if ((tmp >> sftacc) == 0) >> break; >> } >> *mult = tmp; >> *shift = sft; >> } >> >> __thread >> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >> >> static inline __attribute__((always_inline)) >> U64_t IA64_s_ns_since_start() >> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) >> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >> register U64_t cycles = IA64_tsc_ticks_since_start(); >> register U64_t ns = ((cycles >> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >> NSEC_PER_SEC)&0x3fffffffUL) ); >> /* Yes, we are purposefully ignoring durations of more than 4.2 >> billion seconds here! */ >> } >> >> >> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >> somehow, >> then user-space libraries could have more confidence in using 'rdtsc' >> or 'rdtscp' >> if Linux's current_clocksource is 'tsc'. >> >> Regards, >> Jason >> >> >> >> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>> >>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>> in detect_art() in tsc.c, >>> >>> By some definition of available. You can feed CPUID random leaf numbers >>> and >>> it will return something, usually the value of the last valid CPUID >>> leaf, >>> which is 13 on your CPU. A similar CPU model has >>> >>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>> edx=0x00000000 >>> >>> i.e. 7, 832, 832, 0 >>> >>> Looks familiar, right? >>> >>> You can verify that with 'cpuid -1 -r' on your machine. >>> >>>> Linux does not think ART is enabled, and does not set the synthesized >>>> CPUID + >>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>> see this bit set . >>> >>> Rightfully so. This is a Haswell Core model. >>> >>>> if an e1000 NIC card had been installed, PTP would not be available. >>> >>> PTP is independent of the ART kernel feature . ART just provides >>> enhanced >>> PTP features. You are confusing things here. >>> >>> The ART feature as the kernel sees it is a hardware extension which >>> feeds >>> the ART clock to peripherals for timestamping and time correlation >>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 >>> so >>> the kernel can make use of that correlation, e.g. for enhanced PTP >>> accuracy. >>> >>> It's correct, that the NONSTOP_TSC feature depends on the availability >>> of >>> ART, but that has nothing to do with the feature bit, which solely >>> describes the ratio between TSC and the ART frequency which is exposed >>> to >>> peripherals. That frequency is not necessarily the real ART frequency. >>> >>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >>>> because the CPU will always get a fault reading the MSR since it has >>>> never been written. >>> >>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>> really >>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 >>> has >>> new code which utilizes the TSC_ADJUST MSR. >>> >>>> It would be nice for user-space programs that want to use the TSC with >>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>> bug report, >>>> could have confidence that Linux is actually generating the results of >>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>> in a predictable way from the TSC by looking at the >>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>> use of TSC values, so that they can correlate TSC values with linux >>>> clock_gettime() values. >>> >>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>> >>> Nothing at all, really. >>> >>> The kernel makes use of the proper information values already. >>> >>> The TSC frequency is determined from: >>> >>> 1) CPUID(0x16) if available >>> 2) MSRs if available >>> 3) By calibration against a known clock >>> >>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values >>> are >>> correct whether that machine has ART exposed to peripherals or not. >>> >>>> has tsc: 1 constant: 1 >>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>> >>> And that voodoo math tells us what? That you found a way to correlate >>> CPUID(0xd) to the TSC frequency on that machine. >>> >>> Now I'm curious how you do that on this other machine which returns for >>> cpuid(15): 1, 1, 1 >>> >>> You can't because all of this is completely wrong. >>> >>> Thanks, >>> >>> tglx >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-22 16:18 ` Jason Vas Dias 0 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 16:18 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > RE: >>> 4.10 has new code which utilizes the TSC_ADJUST MSR. > > I just built an unpatched linux v4.10 with tglx's TSC improvements - > much else improved in this kernel (like iwlwifi) - thanks! > > I have attached an updated version of the test program which > doesn't print the bogus "Nominal TSC Frequency" (the previous > version printed it, but equally ignored it). > > The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by > a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : > > $ uname -r > 4.10.0 > $ ./ttsc1 > max_extended_leaf: 80000008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. > ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 > ts3 - ts2: 178 ns1: 0.000000592 > ts3 - ts2: 14 ns1: 0.000000577 > ts3 - ts2: 14 ns1: 0.000000651 > ts3 - ts2: 17 ns1: 0.000000625 > ts3 - ts2: 17 ns1: 0.000000677 > ts3 - ts2: 17 ns1: 0.000000626 > ts3 - ts2: 17 ns1: 0.000000627 > ts3 - ts2: 17 ns1: 0.000000627 > ts3 - ts2: 18 ns1: 0.000000655 > ts3 - ts2: 17 ns1: 0.000000631 > t1 - t0: 89067 - ns2: 0.000091411 > Oops, going blind in my old age. These latencies are actually 3 times greater than under 4.8 !! Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: ts3 - ts2: 24 ns1: 0.000000162 ts3 - ts2: 17 ns1: 0.000000143 ts3 - ts2: 17 ns1: 0.000000146 ts3 - ts2: 17 ns1: 0.000000149 ts3 - ts2: 17 ns1: 0.000000141 ts3 - ts2: 16 ns1: 0.000000142 now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ 600ns, @ 4 times more than under 4.8 . But I'm glad the TSC_ADJUST problems are fixed. Will programs reading : $ cat /sys/devices/msr/events/tsc event=0x00 read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the TSC ? > I think this is because under Linux 4.8, the CPU got a fault every > time it read the TSC_ADJUST MSR. maybe it still is! > But user programs wanting to use the TSC and correlate its value to > clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above > program still have to dig the TSC frequency value out of the kernel > with objdump - this was really the point of the bug #194609. > > I would still like to investigate exporting 'tsc_khz' & 'mult' + > 'shift' values via sysfs. > > Regards, > Jason. > > > > > > On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> Thank You for enlightening me - >> >> I was just having a hard time believing that Intel would ship a chip >> that features a monotonic, fixed frequency timestamp counter >> without specifying in either documentation or on-chip or in ACPI what >> precisely that hard-wired frequency is, but I now know that to >> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >> difficult to reconcile with the statement in the SDM : >> 17.16.4 Invariant Time-Keeping >> The invariant TSC is based on the invariant timekeeping hardware >> (called Always Running Timer or ART), that runs at the core crystal >> clock >> frequency. The ratio defined by CPUID leaf 15H expresses the >> frequency >> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] >> !>> 0 >> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >> relationship holds between TSC and the ART hardware: >> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >> / CPUID.15H:EAX[31:0] + K >> Where 'K' is an offset that can be adjusted by a privileged agent*2. >> When ART hardware is reset, both invariant TSC and K are also reset. >> >> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >> that >> the "Nominal TSC Frequency" formulae in the manul must apply to all >> CPUs with InvariantTSC . >> >> Do I understand correctly , that since I do have InvariantTSC , the >> TSC_Value is in fact calculated according to the above formula, but with >> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >> TSC frequency ? >> It was obvious this nominal TSC Frequency had nothing to do with the >> actual TSC frequency used by Linux, which is 'tsc_khz' . >> I guess wishful thinking led me to believe CPUID:15h was actually >> supported somehow , because I thought InvariantTSC meant it had ART >> hardware . >> >> I do strongly suggest that Linux exports its calibrated TSC Khz >> somewhere to user >> space . >> >> I think the best long-term solution would be to allow programs to >> somehow read the TSC without invoking >> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >> having to enter the kernel, which incurs an overhead of > 120ns on my >> system >> . >> >> >> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >> 'clocksource->shift' values to /sysfs somehow ? >> >> For instance , only if the 'current_clocksource' is 'tsc', then these >> values could be exported as: >> /sys/devices/system/clocksource/clocksource0/shift >> /sys/devices/system/clocksource/clocksource0/mult >> /sys/devices/system/clocksource/clocksource0/freq >> >> So user-space programs could know that the value returned by >> clock_gettime(CLOCK_MONOTONIC_RAW) >> would be >> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >> } >> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >> >> That would save user-space programs from having to know 'tsc_khz' by >> parsing the 'Refined TSC' frequency from log files or by examining the >> running kernel with objdump to obtain this value & figure out 'mult' & >> 'shift' themselves. >> >> And why not a >> /sys/devices/system/clocksource/clocksource0/value >> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >> expression as a long integer? >> And perhaps a >> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >> file that actually prints out the number of real-time nano-seconds since >> the >> contents of the existing >> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >> files using the current TSC value? >> To read the rtc0/{date,time} files is already faster than entering the >> kernel to call >> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >> >> I will work on developing a patch to this effect if no-one else is. >> >> Also, am I right in assuming that the maximum granularity of the >> real-time >> clock >> on my system is 1/64th of a second ? : >> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >> 64 >> This is the maximum granularity that can be stored in CMOS , not >> returned by TSC? Couldn't we have something similar that gave an >> accurate idea of TSC frequency and the precise formula applied to TSC >> value to get clock_gettime >> (CLOCK_MONOTONIC_RAW) value ? >> >> Regards, >> Jason >> >> >> This code does produce good timestamps with a latency of @20ns >> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >> values, but it depends on a global variable that is initialized to >> the 'tsc_khz' value >> computed by running kernel parsed from objdump /proc/kcore output : >> >> static inline __attribute__((always_inline)) >> U64_t >> IA64_tsc_now() >> { if(!( _ia64_invariant_tsc_enabled >> ||(( _cpu0id_fd = -1) && IA64_invariant_tsc_is_enabled(NULL,NULL)) >> ) >> ) >> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >> TSC enabled.\n"); >> return 0; >> } >> U32_t tsc_hi, tsc_lo; >> register UL_t tsc; >> asm volatile >> ( "rdtscp\n\t" >> "mov %%edx, %0\n\t" >> "mov %%eax, %1\n\t" >> "mov %%ecx, %2\n\t" >> : "=m" (tsc_hi) , >> "=m" (tsc_lo) , >> "=m" (_ia64_tsc_user_cpu) : >> : "%eax","%ecx","%edx" >> ); >> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >> return tsc; >> } >> >> __thread >> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >> >> static inline __attribute__((always_inline)) >> U64_t IA64_tsc_ticks_since_start() >> { if(_ia64_first_tsc = 0xffffffffffffffffUL) >> { _ia64_first_tsc = IA64_tsc_now(); >> return 0; >> } >> return (IA64_tsc_now() - _ia64_first_tsc) ; >> } >> >> static inline __attribute__((always_inline)) >> void >> ia64_tsc_calc_mult_shift >> ( register U32_t *mult, >> register U32_t *shift >> ) >> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: >> * calculates second + nanosecond mult + shift in same way linux does. >> * we want to be compatible with what linux returns in struct >> timespec ts after call to >> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >> */ >> const U32_t scale\x1000U; >> register U32_t from= IA64_tsc_khz(); >> register U32_t to = NSEC_PER_SEC / scale; >> register U64_t sec = ( ~0UL / from ) / scale; >> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >> register U64_t maxsec = sec * scale; >> UL_t tmp; >> U32_t sft, sftacc2; >> /* >> * Calculate the shift factor which is limiting the conversion >> * range: >> */ >> tmp = (maxsec * from) >> 32; >> while (tmp) >> { tmp >>=1; >> sftacc--; >> } >> /* >> * Find the conversion shift/mult pair which has the best >> * accuracy and fits the maxsec conversion range: >> */ >> for (sft = 32; sft > 0; sft--) >> { tmp = ((UL_t) to) << sft; >> tmp += from / 2; >> tmp = tmp / from; >> if ((tmp >> sftacc) = 0) >> break; >> } >> *mult = tmp; >> *shift = sft; >> } >> >> __thread >> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >> >> static inline __attribute__((always_inline)) >> U64_t IA64_s_ns_since_start() >> { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) ) >> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >> register U64_t cycles = IA64_tsc_ticks_since_start(); >> register U64_t ns = ((cycles >> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >> NSEC_PER_SEC)&0x3fffffffUL) ); >> /* Yes, we are purposefully ignoring durations of more than 4.2 >> billion seconds here! */ >> } >> >> >> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >> somehow, >> then user-space libraries could have more confidence in using 'rdtsc' >> or 'rdtscp' >> if Linux's current_clocksource is 'tsc'. >> >> Regards, >> Jason >> >> >> >> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>> >>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>> in detect_art() in tsc.c, >>> >>> By some definition of available. You can feed CPUID random leaf numbers >>> and >>> it will return something, usually the value of the last valid CPUID >>> leaf, >>> which is 13 on your CPU. A similar CPU model has >>> >>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>> edx=0x00000000 >>> >>> i.e. 7, 832, 832, 0 >>> >>> Looks familiar, right? >>> >>> You can verify that with 'cpuid -1 -r' on your machine. >>> >>>> Linux does not think ART is enabled, and does not set the synthesized >>>> CPUID + >>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>> see this bit set . >>> >>> Rightfully so. This is a Haswell Core model. >>> >>>> if an e1000 NIC card had been installed, PTP would not be available. >>> >>> PTP is independent of the ART kernel feature . ART just provides >>> enhanced >>> PTP features. You are confusing things here. >>> >>> The ART feature as the kernel sees it is a hardware extension which >>> feeds >>> the ART clock to peripherals for timestamping and time correlation >>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 >>> so >>> the kernel can make use of that correlation, e.g. for enhanced PTP >>> accuracy. >>> >>> It's correct, that the NONSTOP_TSC feature depends on the availability >>> of >>> ART, but that has nothing to do with the feature bit, which solely >>> describes the ratio between TSC and the ART frequency which is exposed >>> to >>> peripherals. That frequency is not necessarily the real ART frequency. >>> >>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >>>> because the CPU will always get a fault reading the MSR since it has >>>> never been written. >>> >>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>> really >>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 >>> has >>> new code which utilizes the TSC_ADJUST MSR. >>> >>>> It would be nice for user-space programs that want to use the TSC with >>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>> bug report, >>>> could have confidence that Linux is actually generating the results of >>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>> in a predictable way from the TSC by looking at the >>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>> use of TSC values, so that they can correlate TSC values with linux >>>> clock_gettime() values. >>> >>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>> >>> Nothing at all, really. >>> >>> The kernel makes use of the proper information values already. >>> >>> The TSC frequency is determined from: >>> >>> 1) CPUID(0x16) if available >>> 2) MSRs if available >>> 3) By calibration against a known clock >>> >>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values >>> are >>> correct whether that machine has ART exposed to peripherals or not. >>> >>>> has tsc: 1 constant: 1 >>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>> >>> And that voodoo math tells us what? That you found a way to correlate >>> CPUID(0xd) to the TSC frequency on that machine. >>> >>> Now I'm curious how you do that on this other machine which returns for >>> cpuid(15): 1, 1, 1 >>> >>> You can't because all of this is completely wrong. >>> >>> Thanks, >>> >>> tglx >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-22 16:18 ` Jason Vas Dias @ 2017-02-22 17:27 ` Jason Vas Dias -1 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 17:27 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is read or written . It is probably because it genuinuely does not support any cpuid > 13 , or the modern TSC_ADJUST interface . This is probably why my clock_gettime() latencies are so bad. Now I have to develop a patch to disable all access to TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . I really have an unlucky CPU :-) . But really, I think this issue goes deeper into the fundamental limits of time measurement on Linux : it is never going to be possible to measure minimum times with clock_gettime() comparable with those returned by rdtscp instruction - the time taken to enter the kernel through the VDSO, queue an access to vsyscall_gtod_data via a workqueue, access it & do computations & copy value to user-space is NEVER going to be up to the job of measuring small real-time durations of the order of 10-20 TSC ticks . I think the best way to solve this problem going forward would be to store the entire vsyscall_gtod_data data structure representing the current clocksource in a shared page which is memory-mappable (read-only) by user-space . I think sser-space programs should be able to do something like : int fd = open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); size_t psz = getpagesize(); void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); msync(gtod,psz,MS_SYNC); Then they could all read the real-time clock values as they are updated in real-time by the kernel, and know exactly how to interpret them . I also think that all mktime() / gmtime() / localtime() timezone handling functionality should be moved to user-space, and that the kernel should actually load and link in some /lib/libtzdata.so library, provided by glibc / libc implementations, that is exactly the same library used by glibc() code to parse tzdata ; tzdata should be loaded at boot time by the kernel from the same places glibc loads it, and both the kernel and glibc should use identical mktime(), gmtime(), etc. functions to access it, and glibc using code would not need to enter the kernel at all for any time-handling code. This tzdata-library code be automatically loaded into process images the same way the vdso region is , and the whole system could access only one copy of it and the 'gtod.page' in memory. That's just my two-cents worth, and how I'd like to eventually get things working on my system. All the best, Regards, Jason On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> RE: >>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >> >> I just built an unpatched linux v4.10 with tglx's TSC improvements - >> much else improved in this kernel (like iwlwifi) - thanks! >> >> I have attached an updated version of the test program which >> doesn't print the bogus "Nominal TSC Frequency" (the previous >> version printed it, but equally ignored it). >> >> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >> >> $ uname -r >> 4.10.0 >> $ ./ttsc1 >> max_extended_leaf: 80000008 >> has tsc: 1 constant: 1 >> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 >> ts3 - ts2: 178 ns1: 0.000000592 >> ts3 - ts2: 14 ns1: 0.000000577 >> ts3 - ts2: 14 ns1: 0.000000651 >> ts3 - ts2: 17 ns1: 0.000000625 >> ts3 - ts2: 17 ns1: 0.000000677 >> ts3 - ts2: 17 ns1: 0.000000626 >> ts3 - ts2: 17 ns1: 0.000000627 >> ts3 - ts2: 17 ns1: 0.000000627 >> ts3 - ts2: 18 ns1: 0.000000655 >> ts3 - ts2: 17 ns1: 0.000000631 >> t1 - t0: 89067 - ns2: 0.000091411 >> > > > Oops, going blind in my old age. These latencies are actually 3 times > greater than under 4.8 !! > > Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as > shown > in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: > > ts3 - ts2: 24 ns1: 0.000000162 > ts3 - ts2: 17 ns1: 0.000000143 > ts3 - ts2: 17 ns1: 0.000000146 > ts3 - ts2: 17 ns1: 0.000000149 > ts3 - ts2: 17 ns1: 0.000000141 > ts3 - ts2: 16 ns1: 0.000000142 > > now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ > 600ns, @ 4 times more than under 4.8 . > But I'm glad the TSC_ADJUST problems are fixed. > > Will programs reading : > $ cat /sys/devices/msr/events/tsc > event=0x00 > read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the > TSC ? > >> I think this is because under Linux 4.8, the CPU got a fault every >> time it read the TSC_ADJUST MSR. > > maybe it still is! > > >> But user programs wanting to use the TSC and correlate its value to >> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above >> program still have to dig the TSC frequency value out of the kernel >> with objdump - this was really the point of the bug #194609. >> >> I would still like to investigate exporting 'tsc_khz' & 'mult' + >> 'shift' values via sysfs. >> >> Regards, >> Jason. >> >> >> >> >> >> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>> Thank You for enlightening me - >>> >>> I was just having a hard time believing that Intel would ship a chip >>> that features a monotonic, fixed frequency timestamp counter >>> without specifying in either documentation or on-chip or in ACPI what >>> precisely that hard-wired frequency is, but I now know that to >>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >>> difficult to reconcile with the statement in the SDM : >>> 17.16.4 Invariant Time-Keeping >>> The invariant TSC is based on the invariant timekeeping hardware >>> (called Always Running Timer or ART), that runs at the core crystal >>> clock >>> frequency. The ratio defined by CPUID leaf 15H expresses the >>> frequency >>> relationship between the ART hardware and TSC. If >>> CPUID.15H:EBX[31:0] >>> != >>> 0 >>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >>> relationship holds between TSC and the ART hardware: >>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >>> / CPUID.15H:EAX[31:0] + K >>> Where 'K' is an offset that can be adjusted by a privileged agent*2. >>> When ART hardware is reset, both invariant TSC and K are also >>> reset. >>> >>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >>> that >>> the "Nominal TSC Frequency" formulae in the manul must apply to all >>> CPUs with InvariantTSC . >>> >>> Do I understand correctly , that since I do have InvariantTSC , the >>> TSC_Value is in fact calculated according to the above formula, but with >>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >>> TSC frequency ? >>> It was obvious this nominal TSC Frequency had nothing to do with the >>> actual TSC frequency used by Linux, which is 'tsc_khz' . >>> I guess wishful thinking led me to believe CPUID:15h was actually >>> supported somehow , because I thought InvariantTSC meant it had ART >>> hardware . >>> >>> I do strongly suggest that Linux exports its calibrated TSC Khz >>> somewhere to user >>> space . >>> >>> I think the best long-term solution would be to allow programs to >>> somehow read the TSC without invoking >>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >>> having to enter the kernel, which incurs an overhead of > 120ns on my >>> system >>> . >>> >>> >>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >>> 'clocksource->shift' values to /sysfs somehow ? >>> >>> For instance , only if the 'current_clocksource' is 'tsc', then these >>> values could be exported as: >>> /sys/devices/system/clocksource/clocksource0/shift >>> /sys/devices/system/clocksource/clocksource0/mult >>> /sys/devices/system/clocksource/clocksource0/freq >>> >>> So user-space programs could know that the value returned by >>> clock_gettime(CLOCK_MONOTONIC_RAW) >>> would be >>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >>> } >>> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >>> >>> That would save user-space programs from having to know 'tsc_khz' by >>> parsing the 'Refined TSC' frequency from log files or by examining the >>> running kernel with objdump to obtain this value & figure out 'mult' & >>> 'shift' themselves. >>> >>> And why not a >>> /sys/devices/system/clocksource/clocksource0/value >>> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >>> expression as a long integer? >>> And perhaps a >>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >>> file that actually prints out the number of real-time nano-seconds since >>> the >>> contents of the existing >>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >>> files using the current TSC value? >>> To read the rtc0/{date,time} files is already faster than entering the >>> kernel to call >>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >>> >>> I will work on developing a patch to this effect if no-one else is. >>> >>> Also, am I right in assuming that the maximum granularity of the >>> real-time >>> clock >>> on my system is 1/64th of a second ? : >>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >>> 64 >>> This is the maximum granularity that can be stored in CMOS , not >>> returned by TSC? Couldn't we have something similar that gave an >>> accurate idea of TSC frequency and the precise formula applied to TSC >>> value to get clock_gettime >>> (CLOCK_MONOTONIC_RAW) value ? >>> >>> Regards, >>> Jason >>> >>> >>> This code does produce good timestamps with a latency of @20ns >>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >>> values, but it depends on a global variable that is initialized to >>> the 'tsc_khz' value >>> computed by running kernel parsed from objdump /proc/kcore output : >>> >>> static inline __attribute__((always_inline)) >>> U64_t >>> IA64_tsc_now() >>> { if(!( _ia64_invariant_tsc_enabled >>> ||(( _cpu0id_fd == -1) && >>> IA64_invariant_tsc_is_enabled(NULL,NULL)) >>> ) >>> ) >>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >>> TSC enabled.\n"); >>> return 0; >>> } >>> U32_t tsc_hi, tsc_lo; >>> register UL_t tsc; >>> asm volatile >>> ( "rdtscp\n\t" >>> "mov %%edx, %0\n\t" >>> "mov %%eax, %1\n\t" >>> "mov %%ecx, %2\n\t" >>> : "=m" (tsc_hi) , >>> "=m" (tsc_lo) , >>> "=m" (_ia64_tsc_user_cpu) : >>> : "%eax","%ecx","%edx" >>> ); >>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >>> return tsc; >>> } >>> >>> __thread >>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >>> >>> static inline __attribute__((always_inline)) >>> U64_t IA64_tsc_ticks_since_start() >>> { if(_ia64_first_tsc == 0xffffffffffffffffUL) >>> { _ia64_first_tsc = IA64_tsc_now(); >>> return 0; >>> } >>> return (IA64_tsc_now() - _ia64_first_tsc) ; >>> } >>> >>> static inline __attribute__((always_inline)) >>> void >>> ia64_tsc_calc_mult_shift >>> ( register U32_t *mult, >>> register U32_t *shift >>> ) >>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() >>> function: >>> * calculates second + nanosecond mult + shift in same way linux does. >>> * we want to be compatible with what linux returns in struct >>> timespec ts after call to >>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >>> */ >>> const U32_t scale=1000U; >>> register U32_t from= IA64_tsc_khz(); >>> register U32_t to = NSEC_PER_SEC / scale; >>> register U64_t sec = ( ~0UL / from ) / scale; >>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >>> register U64_t maxsec = sec * scale; >>> UL_t tmp; >>> U32_t sft, sftacc=32; >>> /* >>> * Calculate the shift factor which is limiting the conversion >>> * range: >>> */ >>> tmp = (maxsec * from) >> 32; >>> while (tmp) >>> { tmp >>=1; >>> sftacc--; >>> } >>> /* >>> * Find the conversion shift/mult pair which has the best >>> * accuracy and fits the maxsec conversion range: >>> */ >>> for (sft = 32; sft > 0; sft--) >>> { tmp = ((UL_t) to) << sft; >>> tmp += from / 2; >>> tmp = tmp / from; >>> if ((tmp >> sftacc) == 0) >>> break; >>> } >>> *mult = tmp; >>> *shift = sft; >>> } >>> >>> __thread >>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >>> >>> static inline __attribute__((always_inline)) >>> U64_t IA64_s_ns_since_start() >>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) >>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >>> register U64_t cycles = IA64_tsc_ticks_since_start(); >>> register U64_t ns = ((cycles >>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >>> NSEC_PER_SEC)&0x3fffffffUL) ); >>> /* Yes, we are purposefully ignoring durations of more than 4.2 >>> billion seconds here! */ >>> } >>> >>> >>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >>> somehow, >>> then user-space libraries could have more confidence in using 'rdtsc' >>> or 'rdtscp' >>> if Linux's current_clocksource is 'tsc'. >>> >>> Regards, >>> Jason >>> >>> >>> >>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>>> >>>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>>> in detect_art() in tsc.c, >>>> >>>> By some definition of available. You can feed CPUID random leaf numbers >>>> and >>>> it will return something, usually the value of the last valid CPUID >>>> leaf, >>>> which is 13 on your CPU. A similar CPU model has >>>> >>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>>> edx=0x00000000 >>>> >>>> i.e. 7, 832, 832, 0 >>>> >>>> Looks familiar, right? >>>> >>>> You can verify that with 'cpuid -1 -r' on your machine. >>>> >>>>> Linux does not think ART is enabled, and does not set the synthesized >>>>> CPUID + >>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>>> see this bit set . >>>> >>>> Rightfully so. This is a Haswell Core model. >>>> >>>>> if an e1000 NIC card had been installed, PTP would not be available. >>>> >>>> PTP is independent of the ART kernel feature . ART just provides >>>> enhanced >>>> PTP features. You are confusing things here. >>>> >>>> The ART feature as the kernel sees it is a hardware extension which >>>> feeds >>>> the ART clock to peripherals for timestamping and time correlation >>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 >>>> so >>>> the kernel can make use of that correlation, e.g. for enhanced PTP >>>> accuracy. >>>> >>>> It's correct, that the NONSTOP_TSC feature depends on the availability >>>> of >>>> ART, but that has nothing to do with the feature bit, which solely >>>> describes the ratio between TSC and the ART frequency which is exposed >>>> to >>>> peripherals. That frequency is not necessarily the real ART frequency. >>>> >>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to >>>>> be >>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is >>>>> 0 >>>>> because the CPU will always get a fault reading the MSR since it has >>>>> never been written. >>>> >>>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>>> really >>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 >>>> has >>>> new code which utilizes the TSC_ADJUST MSR. >>>> >>>>> It would be nice for user-space programs that want to use the TSC with >>>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>>> bug report, >>>>> could have confidence that Linux is actually generating the results of >>>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>>> in a predictable way from the TSC by looking at the >>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>>> use of TSC values, so that they can correlate TSC values with linux >>>>> clock_gettime() values. >>>> >>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>>> >>>> Nothing at all, really. >>>> >>>> The kernel makes use of the proper information values already. >>>> >>>> The TSC frequency is determined from: >>>> >>>> 1) CPUID(0x16) if available >>>> 2) MSRs if available >>>> 3) By calibration against a known clock >>>> >>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values >>>> are >>>> correct whether that machine has ART exposed to peripherals or not. >>>> >>>>> has tsc: 1 constant: 1 >>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>>> >>>> And that voodoo math tells us what? That you found a way to correlate >>>> CPUID(0xd) to the TSC frequency on that machine. >>>> >>>> Now I'm curious how you do that on this other machine which returns for >>>> cpuid(15): 1, 1, 1 >>>> >>>> You can't because all of this is completely wrong. >>>> >>>> Thanks, >>>> >>>> tglx >>>> >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-22 17:27 ` Jason Vas Dias 0 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 17:27 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is read or written . It is probably because it genuinuely does not support any cpuid > 13 , or the modern TSC_ADJUST interface . This is probably why my clock_gettime() latencies are so bad. Now I have to develop a patch to disable all access to TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . I really have an unlucky CPU :-) . But really, I think this issue goes deeper into the fundamental limits of time measurement on Linux : it is never going to be possible to measure minimum times with clock_gettime() comparable with those returned by rdtscp instruction - the time taken to enter the kernel through the VDSO, queue an access to vsyscall_gtod_data via a workqueue, access it & do computations & copy value to user-space is NEVER going to be up to the job of measuring small real-time durations of the order of 10-20 TSC ticks . I think the best way to solve this problem going forward would be to store the entire vsyscall_gtod_data data structure representing the current clocksource in a shared page which is memory-mappable (read-only) by user-space . I think sser-space programs should be able to do something like : int fd = open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); size_t psz = getpagesize(); void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); msync(gtod,psz,MS_SYNC); Then they could all read the real-time clock values as they are updated in real-time by the kernel, and know exactly how to interpret them . I also think that all mktime() / gmtime() / localtime() timezone handling functionality should be moved to user-space, and that the kernel should actually load and link in some /lib/libtzdata.so library, provided by glibc / libc implementations, that is exactly the same library used by glibc() code to parse tzdata ; tzdata should be loaded at boot time by the kernel from the same places glibc loads it, and both the kernel and glibc should use identical mktime(), gmtime(), etc. functions to access it, and glibc using code would not need to enter the kernel at all for any time-handling code. This tzdata-library code be automatically loaded into process images the same way the vdso region is , and the whole system could access only one copy of it and the 'gtod.page' in memory. That's just my two-cents worth, and how I'd like to eventually get things working on my system. All the best, Regards, Jason On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> RE: >>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >> >> I just built an unpatched linux v4.10 with tglx's TSC improvements - >> much else improved in this kernel (like iwlwifi) - thanks! >> >> I have attached an updated version of the test program which >> doesn't print the bogus "Nominal TSC Frequency" (the previous >> version printed it, but equally ignored it). >> >> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >> >> $ uname -r >> 4.10.0 >> $ ./ttsc1 >> max_extended_leaf: 80000008 >> has tsc: 1 constant: 1 >> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 >> ts3 - ts2: 178 ns1: 0.000000592 >> ts3 - ts2: 14 ns1: 0.000000577 >> ts3 - ts2: 14 ns1: 0.000000651 >> ts3 - ts2: 17 ns1: 0.000000625 >> ts3 - ts2: 17 ns1: 0.000000677 >> ts3 - ts2: 17 ns1: 0.000000626 >> ts3 - ts2: 17 ns1: 0.000000627 >> ts3 - ts2: 17 ns1: 0.000000627 >> ts3 - ts2: 18 ns1: 0.000000655 >> ts3 - ts2: 17 ns1: 0.000000631 >> t1 - t0: 89067 - ns2: 0.000091411 >> > > > Oops, going blind in my old age. These latencies are actually 3 times > greater than under 4.8 !! > > Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as > shown > in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: > > ts3 - ts2: 24 ns1: 0.000000162 > ts3 - ts2: 17 ns1: 0.000000143 > ts3 - ts2: 17 ns1: 0.000000146 > ts3 - ts2: 17 ns1: 0.000000149 > ts3 - ts2: 17 ns1: 0.000000141 > ts3 - ts2: 16 ns1: 0.000000142 > > now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ > 600ns, @ 4 times more than under 4.8 . > But I'm glad the TSC_ADJUST problems are fixed. > > Will programs reading : > $ cat /sys/devices/msr/events/tsc > event=0x00 > read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the > TSC ? > >> I think this is because under Linux 4.8, the CPU got a fault every >> time it read the TSC_ADJUST MSR. > > maybe it still is! > > >> But user programs wanting to use the TSC and correlate its value to >> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above >> program still have to dig the TSC frequency value out of the kernel >> with objdump - this was really the point of the bug #194609. >> >> I would still like to investigate exporting 'tsc_khz' & 'mult' + >> 'shift' values via sysfs. >> >> Regards, >> Jason. >> >> >> >> >> >> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>> Thank You for enlightening me - >>> >>> I was just having a hard time believing that Intel would ship a chip >>> that features a monotonic, fixed frequency timestamp counter >>> without specifying in either documentation or on-chip or in ACPI what >>> precisely that hard-wired frequency is, but I now know that to >>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >>> difficult to reconcile with the statement in the SDM : >>> 17.16.4 Invariant Time-Keeping >>> The invariant TSC is based on the invariant timekeeping hardware >>> (called Always Running Timer or ART), that runs at the core crystal >>> clock >>> frequency. The ratio defined by CPUID leaf 15H expresses the >>> frequency >>> relationship between the ART hardware and TSC. If >>> CPUID.15H:EBX[31:0] >>> !>>> 0 >>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >>> relationship holds between TSC and the ART hardware: >>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >>> / CPUID.15H:EAX[31:0] + K >>> Where 'K' is an offset that can be adjusted by a privileged agent*2. >>> When ART hardware is reset, both invariant TSC and K are also >>> reset. >>> >>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >>> that >>> the "Nominal TSC Frequency" formulae in the manul must apply to all >>> CPUs with InvariantTSC . >>> >>> Do I understand correctly , that since I do have InvariantTSC , the >>> TSC_Value is in fact calculated according to the above formula, but with >>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >>> TSC frequency ? >>> It was obvious this nominal TSC Frequency had nothing to do with the >>> actual TSC frequency used by Linux, which is 'tsc_khz' . >>> I guess wishful thinking led me to believe CPUID:15h was actually >>> supported somehow , because I thought InvariantTSC meant it had ART >>> hardware . >>> >>> I do strongly suggest that Linux exports its calibrated TSC Khz >>> somewhere to user >>> space . >>> >>> I think the best long-term solution would be to allow programs to >>> somehow read the TSC without invoking >>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >>> having to enter the kernel, which incurs an overhead of > 120ns on my >>> system >>> . >>> >>> >>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >>> 'clocksource->shift' values to /sysfs somehow ? >>> >>> For instance , only if the 'current_clocksource' is 'tsc', then these >>> values could be exported as: >>> /sys/devices/system/clocksource/clocksource0/shift >>> /sys/devices/system/clocksource/clocksource0/mult >>> /sys/devices/system/clocksource/clocksource0/freq >>> >>> So user-space programs could know that the value returned by >>> clock_gettime(CLOCK_MONOTONIC_RAW) >>> would be >>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >>> } >>> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >>> >>> That would save user-space programs from having to know 'tsc_khz' by >>> parsing the 'Refined TSC' frequency from log files or by examining the >>> running kernel with objdump to obtain this value & figure out 'mult' & >>> 'shift' themselves. >>> >>> And why not a >>> /sys/devices/system/clocksource/clocksource0/value >>> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >>> expression as a long integer? >>> And perhaps a >>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >>> file that actually prints out the number of real-time nano-seconds since >>> the >>> contents of the existing >>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >>> files using the current TSC value? >>> To read the rtc0/{date,time} files is already faster than entering the >>> kernel to call >>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >>> >>> I will work on developing a patch to this effect if no-one else is. >>> >>> Also, am I right in assuming that the maximum granularity of the >>> real-time >>> clock >>> on my system is 1/64th of a second ? : >>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >>> 64 >>> This is the maximum granularity that can be stored in CMOS , not >>> returned by TSC? Couldn't we have something similar that gave an >>> accurate idea of TSC frequency and the precise formula applied to TSC >>> value to get clock_gettime >>> (CLOCK_MONOTONIC_RAW) value ? >>> >>> Regards, >>> Jason >>> >>> >>> This code does produce good timestamps with a latency of @20ns >>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >>> values, but it depends on a global variable that is initialized to >>> the 'tsc_khz' value >>> computed by running kernel parsed from objdump /proc/kcore output : >>> >>> static inline __attribute__((always_inline)) >>> U64_t >>> IA64_tsc_now() >>> { if(!( _ia64_invariant_tsc_enabled >>> ||(( _cpu0id_fd = -1) && >>> IA64_invariant_tsc_is_enabled(NULL,NULL)) >>> ) >>> ) >>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >>> TSC enabled.\n"); >>> return 0; >>> } >>> U32_t tsc_hi, tsc_lo; >>> register UL_t tsc; >>> asm volatile >>> ( "rdtscp\n\t" >>> "mov %%edx, %0\n\t" >>> "mov %%eax, %1\n\t" >>> "mov %%ecx, %2\n\t" >>> : "=m" (tsc_hi) , >>> "=m" (tsc_lo) , >>> "=m" (_ia64_tsc_user_cpu) : >>> : "%eax","%ecx","%edx" >>> ); >>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >>> return tsc; >>> } >>> >>> __thread >>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >>> >>> static inline __attribute__((always_inline)) >>> U64_t IA64_tsc_ticks_since_start() >>> { if(_ia64_first_tsc = 0xffffffffffffffffUL) >>> { _ia64_first_tsc = IA64_tsc_now(); >>> return 0; >>> } >>> return (IA64_tsc_now() - _ia64_first_tsc) ; >>> } >>> >>> static inline __attribute__((always_inline)) >>> void >>> ia64_tsc_calc_mult_shift >>> ( register U32_t *mult, >>> register U32_t *shift >>> ) >>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() >>> function: >>> * calculates second + nanosecond mult + shift in same way linux does. >>> * we want to be compatible with what linux returns in struct >>> timespec ts after call to >>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >>> */ >>> const U32_t scale\x1000U; >>> register U32_t from= IA64_tsc_khz(); >>> register U32_t to = NSEC_PER_SEC / scale; >>> register U64_t sec = ( ~0UL / from ) / scale; >>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >>> register U64_t maxsec = sec * scale; >>> UL_t tmp; >>> U32_t sft, sftacc2; >>> /* >>> * Calculate the shift factor which is limiting the conversion >>> * range: >>> */ >>> tmp = (maxsec * from) >> 32; >>> while (tmp) >>> { tmp >>=1; >>> sftacc--; >>> } >>> /* >>> * Find the conversion shift/mult pair which has the best >>> * accuracy and fits the maxsec conversion range: >>> */ >>> for (sft = 32; sft > 0; sft--) >>> { tmp = ((UL_t) to) << sft; >>> tmp += from / 2; >>> tmp = tmp / from; >>> if ((tmp >> sftacc) = 0) >>> break; >>> } >>> *mult = tmp; >>> *shift = sft; >>> } >>> >>> __thread >>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >>> >>> static inline __attribute__((always_inline)) >>> U64_t IA64_s_ns_since_start() >>> { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) ) >>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >>> register U64_t cycles = IA64_tsc_ticks_since_start(); >>> register U64_t ns = ((cycles >>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >>> NSEC_PER_SEC)&0x3fffffffUL) ); >>> /* Yes, we are purposefully ignoring durations of more than 4.2 >>> billion seconds here! */ >>> } >>> >>> >>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >>> somehow, >>> then user-space libraries could have more confidence in using 'rdtsc' >>> or 'rdtscp' >>> if Linux's current_clocksource is 'tsc'. >>> >>> Regards, >>> Jason >>> >>> >>> >>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>>> >>>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>>> in detect_art() in tsc.c, >>>> >>>> By some definition of available. You can feed CPUID random leaf numbers >>>> and >>>> it will return something, usually the value of the last valid CPUID >>>> leaf, >>>> which is 13 on your CPU. A similar CPU model has >>>> >>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>>> edx=0x00000000 >>>> >>>> i.e. 7, 832, 832, 0 >>>> >>>> Looks familiar, right? >>>> >>>> You can verify that with 'cpuid -1 -r' on your machine. >>>> >>>>> Linux does not think ART is enabled, and does not set the synthesized >>>>> CPUID + >>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>>> see this bit set . >>>> >>>> Rightfully so. This is a Haswell Core model. >>>> >>>>> if an e1000 NIC card had been installed, PTP would not be available. >>>> >>>> PTP is independent of the ART kernel feature . ART just provides >>>> enhanced >>>> PTP features. You are confusing things here. >>>> >>>> The ART feature as the kernel sees it is a hardware extension which >>>> feeds >>>> the ART clock to peripherals for timestamping and time correlation >>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 >>>> so >>>> the kernel can make use of that correlation, e.g. for enhanced PTP >>>> accuracy. >>>> >>>> It's correct, that the NONSTOP_TSC feature depends on the availability >>>> of >>>> ART, but that has nothing to do with the feature bit, which solely >>>> describes the ratio between TSC and the ART frequency which is exposed >>>> to >>>> peripherals. That frequency is not necessarily the real ART frequency. >>>> >>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to >>>>> be >>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is >>>>> 0 >>>>> because the CPU will always get a fault reading the MSR since it has >>>>> never been written. >>>> >>>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>>> really >>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 >>>> has >>>> new code which utilizes the TSC_ADJUST MSR. >>>> >>>>> It would be nice for user-space programs that want to use the TSC with >>>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>>> bug report, >>>>> could have confidence that Linux is actually generating the results of >>>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>>> in a predictable way from the TSC by looking at the >>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>>> use of TSC values, so that they can correlate TSC values with linux >>>>> clock_gettime() values. >>>> >>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>>> >>>> Nothing at all, really. >>>> >>>> The kernel makes use of the proper information values already. >>>> >>>> The TSC frequency is determined from: >>>> >>>> 1) CPUID(0x16) if available >>>> 2) MSRs if available >>>> 3) By calibration against a known clock >>>> >>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values >>>> are >>>> correct whether that machine has ART exposed to peripherals or not. >>>> >>>>> has tsc: 1 constant: 1 >>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>>> >>>> And that voodoo math tells us what? That you found a way to correlate >>>> CPUID(0xd) to the TSC frequency on that machine. >>>> >>>> Now I'm curious how you do that on this other machine which returns for >>>> cpuid(15): 1, 1, 1 >>>> >>>> You can't because all of this is completely wrong. >>>> >>>> Thanks, >>>> >>>> tglx >>>> >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-22 17:27 ` Jason Vas Dias @ 2017-02-22 19:53 ` Thomas Gleixner -1 siblings, 0 replies; 17+ messages in thread From: Thomas Gleixner @ 2017-02-22 19:53 UTC (permalink / raw) To: Jason Vas Dias Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 On Wed, 22 Feb 2017, Jason Vas Dias wrote: > Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is > read or written . It is probably because it genuinuely does not support > any cpuid > 13 , or the modern TSC_ADJUST interface. Err no. TSC_ADJUST is available when CPUID(70 EBX has bit 1 set. Please provide the output of: # cpuid -1 -r for that machine > This is probably why my clock_gettime() latencies are so bad. Now I have > to develop a patch to disable all access to TSC_ADJUST MSR if > boot_cpu_data.cpuid_level <= 13 . I really have an unlucky CPU :-) . Can you just try to boot linux 4.10 on that machine an report whether it works? It will touch the TSC_ADJUST MRS when the feature bit is set. > But really, I think this issue goes deeper into the fundamental limits of > time measurement on Linux : it is never going to be possible to measure > minimum times with clock_gettime() comparable with those returned by > rdtscp instruction - the time taken to enter the kernel through the VDSO, > queue an access to vsyscall_gtod_data via a workqueue, access it & do > computations & copy value to user-space Sorry, that's not how the VDSO works. It does not involve workqueues, copy to user space and whatever. VDSO is mapped into user space and only goes into the when TSC is not working or the VDSO access is disabled or you want to access a CLOCKID which is not supported in the VDSO. > is NEVER going to be up to the job of measuring small real-time durations > of the order of 10-20 TSC ticks . clock_gettime(CLOCK_MONOTONIC) via VDSO takes ~20ns on my haswell laptop > I think the best way to solve this problem going forward would be to store > the entire vsyscall_gtod_data data structure representing the current > clocksource > in a shared page which is memory-mappable (read-only) by user-space . This is what VDSO does. It provides the data R/O to user space and it also provides the accessor functions. CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE are handled in the VDSO (user space) and never enter the kernel. I really have a hard time to understand what you are trying to solve. Thanks, tglx ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-22 19:53 ` Thomas Gleixner 0 siblings, 0 replies; 17+ messages in thread From: Thomas Gleixner @ 2017-02-22 19:53 UTC (permalink / raw) To: Jason Vas Dias Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 On Wed, 22 Feb 2017, Jason Vas Dias wrote: > Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is > read or written . It is probably because it genuinuely does not support > any cpuid > 13 , or the modern TSC_ADJUST interface. Err no. TSC_ADJUST is available when CPUID(70 EBX has bit 1 set. Please provide the output of: # cpuid -1 -r for that machine > This is probably why my clock_gettime() latencies are so bad. Now I have > to develop a patch to disable all access to TSC_ADJUST MSR if > boot_cpu_data.cpuid_level <= 13 . I really have an unlucky CPU :-) . Can you just try to boot linux 4.10 on that machine an report whether it works? It will touch the TSC_ADJUST MRS when the feature bit is set. > But really, I think this issue goes deeper into the fundamental limits of > time measurement on Linux : it is never going to be possible to measure > minimum times with clock_gettime() comparable with those returned by > rdtscp instruction - the time taken to enter the kernel through the VDSO, > queue an access to vsyscall_gtod_data via a workqueue, access it & do > computations & copy value to user-space Sorry, that's not how the VDSO works. It does not involve workqueues, copy to user space and whatever. VDSO is mapped into user space and only goes into the when TSC is not working or the VDSO access is disabled or you want to access a CLOCKID which is not supported in the VDSO. > is NEVER going to be up to the job of measuring small real-time durations > of the order of 10-20 TSC ticks . clock_gettime(CLOCK_MONOTONIC) via VDSO takes ~20ns on my haswell laptop > I think the best way to solve this problem going forward would be to store > the entire vsyscall_gtod_data data structure representing the current > clocksource > in a shared page which is memory-mappable (read-only) by user-space . This is what VDSO does. It provides the data R/O to user space and it also provides the accessor functions. CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE are handled in the VDSO (user space) and never enter the kernel. I really have a hard time to understand what you are trying to solve. Thanks, tglx ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-22 17:27 ` Jason Vas Dias @ 2017-02-22 20:15 ` Jason Vas Dias -1 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 20:15 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 I actually tried adding a 'notsc_adjust' kernel option to disable any setting or access to the TSC_ADJUST MSR, but then I see the problems - a big disparity in values depending on which CPU the thread is scheduled - and no improvement in clock_gettime() latency. So I don't think the new TSC_ADJUST code in ts_sync.c itself is the issue - but something added @ 460ns onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . As I don't think fixing the clock_gettime() latency issue is my problem or even possible with current clock architecture approach, it is a non-issue. But please, can anyone tell me if are there any plans to move the time infrastructure out of the kernel and into glibc along the lines outlined in previous mail - if not, I am going to concentrate on this more radical overhaul approach for my own systems . At least, I think mapping the clocksource information structure itself in some kind of sharable page makes sense . Processes could map that page copy-on-write so they could start off with all the timing parameters preloaded, then keep their copy updated using the rdtscp instruction , or msync() (read-only) with the kernel's single copy to get the latest time any process has requested. All real-time parameters & adjustments could be stored in that page , & eventually a single copy of the tzdata could be used by both kernel & user-space. That is what I am working towards. Any plans to make linux real-time tsc clock user-friendly ? On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is > read or written . It is probably because it genuinuely does not > support any cpuid > 13 , > or the modern TSC_ADJUST interface . This is probably why my > clock_gettime() > latencies are so bad. Now I have to develop a patch to disable all access > to > TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . > I really have an unlucky CPU :-) . > > But really, I think this issue goes deeper into the fundamental limits of > time measurement on Linux : it is never going to be possible to measure > minimum times with clock_gettime() comparable with those returned by > rdtscp instruction - the time taken to enter the kernel through the VDSO, > queue an access to vsyscall_gtod_data via a workqueue, access it & do > computations & copy value to user-space is NEVER going to be up to the > job of measuring small real-time durations of the order of 10-20 TSC ticks > . > > I think the best way to solve this problem going forward would be to store > the entire vsyscall_gtod_data data structure representing the current > clocksource > in a shared page which is memory-mappable (read-only) by user-space . > I think sser-space programs should be able to do something like : > int fd = > open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); > size_t psz = getpagesize(); > void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); > msync(gtod,psz,MS_SYNC); > > Then they could all read the real-time clock values as they are updated > in real-time by the kernel, and know exactly how to interpret them . > > I also think that all mktime() / gmtime() / localtime() timezone handling > functionality should be > moved to user-space, and that the kernel should actually load and link in > some > /lib/libtzdata.so > library, provided by glibc / libc implementations, that is exactly the > same library > used by glibc() code to parse tzdata ; tzdata should be loaded at boot time > by the kernel from the same places glibc loads it, and both the kernel and > glibc should use identical mktime(), gmtime(), etc. functions to access it, > and > glibc using code would not need to enter the kernel at all for any > time-handling > code. This tzdata-library code be automatically loaded into process images > the > same way the vdso region is , and the whole system could access only one > copy of it and the 'gtod.page' in memory. > > That's just my two-cents worth, and how I'd like to eventually get > things working > on my system. > > All the best, Regards, > Jason > > > > > > > > > > > > > > On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>> RE: >>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >>> >>> I just built an unpatched linux v4.10 with tglx's TSC improvements - >>> much else improved in this kernel (like iwlwifi) - thanks! >>> >>> I have attached an updated version of the test program which >>> doesn't print the bogus "Nominal TSC Frequency" (the previous >>> version printed it, but equally ignored it). >>> >>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >>> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >>> >>> $ uname -r >>> 4.10.0 >>> $ ./ttsc1 >>> max_extended_leaf: 80000008 >>> has tsc: 1 constant: 1 >>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 >>> ts3 - ts2: 178 ns1: 0.000000592 >>> ts3 - ts2: 14 ns1: 0.000000577 >>> ts3 - ts2: 14 ns1: 0.000000651 >>> ts3 - ts2: 17 ns1: 0.000000625 >>> ts3 - ts2: 17 ns1: 0.000000677 >>> ts3 - ts2: 17 ns1: 0.000000626 >>> ts3 - ts2: 17 ns1: 0.000000627 >>> ts3 - ts2: 17 ns1: 0.000000627 >>> ts3 - ts2: 18 ns1: 0.000000655 >>> ts3 - ts2: 17 ns1: 0.000000631 >>> t1 - t0: 89067 - ns2: 0.000091411 >>> >> >> >> Oops, going blind in my old age. These latencies are actually 3 times >> greater than under 4.8 !! >> >> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as >> shown >> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: >> >> ts3 - ts2: 24 ns1: 0.000000162 >> ts3 - ts2: 17 ns1: 0.000000143 >> ts3 - ts2: 17 ns1: 0.000000146 >> ts3 - ts2: 17 ns1: 0.000000149 >> ts3 - ts2: 17 ns1: 0.000000141 >> ts3 - ts2: 16 ns1: 0.000000142 >> >> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ >> 600ns, @ 4 times more than under 4.8 . >> But I'm glad the TSC_ADJUST problems are fixed. >> >> Will programs reading : >> $ cat /sys/devices/msr/events/tsc >> event=0x00 >> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the >> TSC ? >> >>> I think this is because under Linux 4.8, the CPU got a fault every >>> time it read the TSC_ADJUST MSR. >> >> maybe it still is! >> >> >>> But user programs wanting to use the TSC and correlate its value to >>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above >>> program still have to dig the TSC frequency value out of the kernel >>> with objdump - this was really the point of the bug #194609. >>> >>> I would still like to investigate exporting 'tsc_khz' & 'mult' + >>> 'shift' values via sysfs. >>> >>> Regards, >>> Jason. >>> >>> >>> >>> >>> >>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>> Thank You for enlightening me - >>>> >>>> I was just having a hard time believing that Intel would ship a chip >>>> that features a monotonic, fixed frequency timestamp counter >>>> without specifying in either documentation or on-chip or in ACPI what >>>> precisely that hard-wired frequency is, but I now know that to >>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >>>> difficult to reconcile with the statement in the SDM : >>>> 17.16.4 Invariant Time-Keeping >>>> The invariant TSC is based on the invariant timekeeping hardware >>>> (called Always Running Timer or ART), that runs at the core crystal >>>> clock >>>> frequency. The ratio defined by CPUID leaf 15H expresses the >>>> frequency >>>> relationship between the ART hardware and TSC. If >>>> CPUID.15H:EBX[31:0] >>>> != >>>> 0 >>>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >>>> relationship holds between TSC and the ART hardware: >>>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >>>> / CPUID.15H:EAX[31:0] + K >>>> Where 'K' is an offset that can be adjusted by a privileged >>>> agent*2. >>>> When ART hardware is reset, both invariant TSC and K are also >>>> reset. >>>> >>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >>>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >>>> that >>>> the "Nominal TSC Frequency" formulae in the manul must apply to all >>>> CPUs with InvariantTSC . >>>> >>>> Do I understand correctly , that since I do have InvariantTSC , the >>>> TSC_Value is in fact calculated according to the above formula, but >>>> with >>>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >>>> TSC frequency ? >>>> It was obvious this nominal TSC Frequency had nothing to do with the >>>> actual TSC frequency used by Linux, which is 'tsc_khz' . >>>> I guess wishful thinking led me to believe CPUID:15h was actually >>>> supported somehow , because I thought InvariantTSC meant it had ART >>>> hardware . >>>> >>>> I do strongly suggest that Linux exports its calibrated TSC Khz >>>> somewhere to user >>>> space . >>>> >>>> I think the best long-term solution would be to allow programs to >>>> somehow read the TSC without invoking >>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >>>> having to enter the kernel, which incurs an overhead of > 120ns on my >>>> system >>>> . >>>> >>>> >>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >>>> 'clocksource->shift' values to /sysfs somehow ? >>>> >>>> For instance , only if the 'current_clocksource' is 'tsc', then these >>>> values could be exported as: >>>> /sys/devices/system/clocksource/clocksource0/shift >>>> /sys/devices/system/clocksource/clocksource0/mult >>>> /sys/devices/system/clocksource/clocksource0/freq >>>> >>>> So user-space programs could know that the value returned by >>>> clock_gettime(CLOCK_MONOTONIC_RAW) >>>> would be >>>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >>>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >>>> } >>>> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >>>> >>>> That would save user-space programs from having to know 'tsc_khz' by >>>> parsing the 'Refined TSC' frequency from log files or by examining the >>>> running kernel with objdump to obtain this value & figure out 'mult' & >>>> 'shift' themselves. >>>> >>>> And why not a >>>> /sys/devices/system/clocksource/clocksource0/value >>>> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >>>> expression as a long integer? >>>> And perhaps a >>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >>>> file that actually prints out the number of real-time nano-seconds >>>> since >>>> the >>>> contents of the existing >>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >>>> files using the current TSC value? >>>> To read the rtc0/{date,time} files is already faster than entering the >>>> kernel to call >>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >>>> >>>> I will work on developing a patch to this effect if no-one else is. >>>> >>>> Also, am I right in assuming that the maximum granularity of the >>>> real-time >>>> clock >>>> on my system is 1/64th of a second ? : >>>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >>>> 64 >>>> This is the maximum granularity that can be stored in CMOS , not >>>> returned by TSC? Couldn't we have something similar that gave an >>>> accurate idea of TSC frequency and the precise formula applied to TSC >>>> value to get clock_gettime >>>> (CLOCK_MONOTONIC_RAW) value ? >>>> >>>> Regards, >>>> Jason >>>> >>>> >>>> This code does produce good timestamps with a latency of @20ns >>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >>>> values, but it depends on a global variable that is initialized to >>>> the 'tsc_khz' value >>>> computed by running kernel parsed from objdump /proc/kcore output : >>>> >>>> static inline __attribute__((always_inline)) >>>> U64_t >>>> IA64_tsc_now() >>>> { if(!( _ia64_invariant_tsc_enabled >>>> ||(( _cpu0id_fd == -1) && >>>> IA64_invariant_tsc_is_enabled(NULL,NULL)) >>>> ) >>>> ) >>>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >>>> TSC enabled.\n"); >>>> return 0; >>>> } >>>> U32_t tsc_hi, tsc_lo; >>>> register UL_t tsc; >>>> asm volatile >>>> ( "rdtscp\n\t" >>>> "mov %%edx, %0\n\t" >>>> "mov %%eax, %1\n\t" >>>> "mov %%ecx, %2\n\t" >>>> : "=m" (tsc_hi) , >>>> "=m" (tsc_lo) , >>>> "=m" (_ia64_tsc_user_cpu) : >>>> : "%eax","%ecx","%edx" >>>> ); >>>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >>>> return tsc; >>>> } >>>> >>>> __thread >>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >>>> >>>> static inline __attribute__((always_inline)) >>>> U64_t IA64_tsc_ticks_since_start() >>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL) >>>> { _ia64_first_tsc = IA64_tsc_now(); >>>> return 0; >>>> } >>>> return (IA64_tsc_now() - _ia64_first_tsc) ; >>>> } >>>> >>>> static inline __attribute__((always_inline)) >>>> void >>>> ia64_tsc_calc_mult_shift >>>> ( register U32_t *mult, >>>> register U32_t *shift >>>> ) >>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() >>>> function: >>>> * calculates second + nanosecond mult + shift in same way linux >>>> does. >>>> * we want to be compatible with what linux returns in struct >>>> timespec ts after call to >>>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >>>> */ >>>> const U32_t scale=1000U; >>>> register U32_t from= IA64_tsc_khz(); >>>> register U32_t to = NSEC_PER_SEC / scale; >>>> register U64_t sec = ( ~0UL / from ) / scale; >>>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >>>> register U64_t maxsec = sec * scale; >>>> UL_t tmp; >>>> U32_t sft, sftacc=32; >>>> /* >>>> * Calculate the shift factor which is limiting the conversion >>>> * range: >>>> */ >>>> tmp = (maxsec * from) >> 32; >>>> while (tmp) >>>> { tmp >>=1; >>>> sftacc--; >>>> } >>>> /* >>>> * Find the conversion shift/mult pair which has the best >>>> * accuracy and fits the maxsec conversion range: >>>> */ >>>> for (sft = 32; sft > 0; sft--) >>>> { tmp = ((UL_t) to) << sft; >>>> tmp += from / 2; >>>> tmp = tmp / from; >>>> if ((tmp >> sftacc) == 0) >>>> break; >>>> } >>>> *mult = tmp; >>>> *shift = sft; >>>> } >>>> >>>> __thread >>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >>>> >>>> static inline __attribute__((always_inline)) >>>> U64_t IA64_s_ns_since_start() >>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) >>>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >>>> register U64_t cycles = IA64_tsc_ticks_since_start(); >>>> register U64_t ns = ((cycles >>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >>>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >>>> NSEC_PER_SEC)&0x3fffffffUL) ); >>>> /* Yes, we are purposefully ignoring durations of more than 4.2 >>>> billion seconds here! */ >>>> } >>>> >>>> >>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >>>> somehow, >>>> then user-space libraries could have more confidence in using 'rdtsc' >>>> or 'rdtscp' >>>> if Linux's current_clocksource is 'tsc'. >>>> >>>> Regards, >>>> Jason >>>> >>>> >>>> >>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>>>> >>>>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>>>> in detect_art() in tsc.c, >>>>> >>>>> By some definition of available. You can feed CPUID random leaf >>>>> numbers >>>>> and >>>>> it will return something, usually the value of the last valid CPUID >>>>> leaf, >>>>> which is 13 on your CPU. A similar CPU model has >>>>> >>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>>>> edx=0x00000000 >>>>> >>>>> i.e. 7, 832, 832, 0 >>>>> >>>>> Looks familiar, right? >>>>> >>>>> You can verify that with 'cpuid -1 -r' on your machine. >>>>> >>>>>> Linux does not think ART is enabled, and does not set the synthesized >>>>>> CPUID + >>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>>>> see this bit set . >>>>> >>>>> Rightfully so. This is a Haswell Core model. >>>>> >>>>>> if an e1000 NIC card had been installed, PTP would not be available. >>>>> >>>>> PTP is independent of the ART kernel feature . ART just provides >>>>> enhanced >>>>> PTP features. You are confusing things here. >>>>> >>>>> The ART feature as the kernel sees it is a hardware extension which >>>>> feeds >>>>> the ART clock to peripherals for timestamping and time correlation >>>>> purposes. The ratio between ART and TSC is described by CPUID leaf >>>>> 0x15 >>>>> so >>>>> the kernel can make use of that correlation, e.g. for enhanced PTP >>>>> accuracy. >>>>> >>>>> It's correct, that the NONSTOP_TSC feature depends on the availability >>>>> of >>>>> ART, but that has nothing to do with the feature bit, which solely >>>>> describes the ratio between TSC and the ART frequency which is exposed >>>>> to >>>>> peripherals. That frequency is not necessarily the real ART frequency. >>>>> >>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to >>>>>> be >>>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is >>>>>> 0 >>>>>> because the CPU will always get a fault reading the MSR since it has >>>>>> never been written. >>>>> >>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>>>> really >>>>> wrong. And writing it unconditionally to 0 is not going to happen. >>>>> 4.10 >>>>> has >>>>> new code which utilizes the TSC_ADJUST MSR. >>>>> >>>>>> It would be nice for user-space programs that want to use the TSC >>>>>> with >>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>>>> bug report, >>>>>> could have confidence that Linux is actually generating the results >>>>>> of >>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>>>> in a predictable way from the TSC by looking at the >>>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>>>> use of TSC values, so that they can correlate TSC values with linux >>>>>> clock_gettime() values. >>>>> >>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>>>> >>>>> Nothing at all, really. >>>>> >>>>> The kernel makes use of the proper information values already. >>>>> >>>>> The TSC frequency is determined from: >>>>> >>>>> 1) CPUID(0x16) if available >>>>> 2) MSRs if available >>>>> 3) By calibration against a known clock >>>>> >>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* >>>>> values >>>>> are >>>>> correct whether that machine has ART exposed to peripherals or not. >>>>> >>>>>> has tsc: 1 constant: 1 >>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>>>> >>>>> And that voodoo math tells us what? That you found a way to correlate >>>>> CPUID(0xd) to the TSC frequency on that machine. >>>>> >>>>> Now I'm curious how you do that on this other machine which returns >>>>> for >>>>> cpuid(15): 1, 1, 1 >>>>> >>>>> You can't because all of this is completely wrong. >>>>> >>>>> Thanks, >>>>> >>>>> tglx >>>>> >>>> >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 @ 2017-02-22 20:15 ` Jason Vas Dias 0 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 20:15 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 I actually tried adding a 'notsc_adjust' kernel option to disable any setting or access to the TSC_ADJUST MSR, but then I see the problems - a big disparity in values depending on which CPU the thread is scheduled - and no improvement in clock_gettime() latency. So I don't think the new TSC_ADJUST code in ts_sync.c itself is the issue - but something added @ 460ns onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . As I don't think fixing the clock_gettime() latency issue is my problem or even possible with current clock architecture approach, it is a non-issue. But please, can anyone tell me if are there any plans to move the time infrastructure out of the kernel and into glibc along the lines outlined in previous mail - if not, I am going to concentrate on this more radical overhaul approach for my own systems . At least, I think mapping the clocksource information structure itself in some kind of sharable page makes sense . Processes could map that page copy-on-write so they could start off with all the timing parameters preloaded, then keep their copy updated using the rdtscp instruction , or msync() (read-only) with the kernel's single copy to get the latest time any process has requested. All real-time parameters & adjustments could be stored in that page , & eventually a single copy of the tzdata could be used by both kernel & user-space. That is what I am working towards. Any plans to make linux real-time tsc clock user-friendly ? On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is > read or written . It is probably because it genuinuely does not > support any cpuid > 13 , > or the modern TSC_ADJUST interface . This is probably why my > clock_gettime() > latencies are so bad. Now I have to develop a patch to disable all access > to > TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . > I really have an unlucky CPU :-) . > > But really, I think this issue goes deeper into the fundamental limits of > time measurement on Linux : it is never going to be possible to measure > minimum times with clock_gettime() comparable with those returned by > rdtscp instruction - the time taken to enter the kernel through the VDSO, > queue an access to vsyscall_gtod_data via a workqueue, access it & do > computations & copy value to user-space is NEVER going to be up to the > job of measuring small real-time durations of the order of 10-20 TSC ticks > . > > I think the best way to solve this problem going forward would be to store > the entire vsyscall_gtod_data data structure representing the current > clocksource > in a shared page which is memory-mappable (read-only) by user-space . > I think sser-space programs should be able to do something like : > int fd > open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); > size_t psz = getpagesize(); > void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); > msync(gtod,psz,MS_SYNC); > > Then they could all read the real-time clock values as they are updated > in real-time by the kernel, and know exactly how to interpret them . > > I also think that all mktime() / gmtime() / localtime() timezone handling > functionality should be > moved to user-space, and that the kernel should actually load and link in > some > /lib/libtzdata.so > library, provided by glibc / libc implementations, that is exactly the > same library > used by glibc() code to parse tzdata ; tzdata should be loaded at boot time > by the kernel from the same places glibc loads it, and both the kernel and > glibc should use identical mktime(), gmtime(), etc. functions to access it, > and > glibc using code would not need to enter the kernel at all for any > time-handling > code. This tzdata-library code be automatically loaded into process images > the > same way the vdso region is , and the whole system could access only one > copy of it and the 'gtod.page' in memory. > > That's just my two-cents worth, and how I'd like to eventually get > things working > on my system. > > All the best, Regards, > Jason > > > > > > > > > > > > > > On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>> RE: >>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >>> >>> I just built an unpatched linux v4.10 with tglx's TSC improvements - >>> much else improved in this kernel (like iwlwifi) - thanks! >>> >>> I have attached an updated version of the test program which >>> doesn't print the bogus "Nominal TSC Frequency" (the previous >>> version printed it, but equally ignored it). >>> >>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >>> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >>> >>> $ uname -r >>> 4.10.0 >>> $ ./ttsc1 >>> max_extended_leaf: 80000008 >>> has tsc: 1 constant: 1 >>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 >>> ts3 - ts2: 178 ns1: 0.000000592 >>> ts3 - ts2: 14 ns1: 0.000000577 >>> ts3 - ts2: 14 ns1: 0.000000651 >>> ts3 - ts2: 17 ns1: 0.000000625 >>> ts3 - ts2: 17 ns1: 0.000000677 >>> ts3 - ts2: 17 ns1: 0.000000626 >>> ts3 - ts2: 17 ns1: 0.000000627 >>> ts3 - ts2: 17 ns1: 0.000000627 >>> ts3 - ts2: 18 ns1: 0.000000655 >>> ts3 - ts2: 17 ns1: 0.000000631 >>> t1 - t0: 89067 - ns2: 0.000091411 >>> >> >> >> Oops, going blind in my old age. These latencies are actually 3 times >> greater than under 4.8 !! >> >> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as >> shown >> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: >> >> ts3 - ts2: 24 ns1: 0.000000162 >> ts3 - ts2: 17 ns1: 0.000000143 >> ts3 - ts2: 17 ns1: 0.000000146 >> ts3 - ts2: 17 ns1: 0.000000149 >> ts3 - ts2: 17 ns1: 0.000000141 >> ts3 - ts2: 16 ns1: 0.000000142 >> >> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ >> 600ns, @ 4 times more than under 4.8 . >> But I'm glad the TSC_ADJUST problems are fixed. >> >> Will programs reading : >> $ cat /sys/devices/msr/events/tsc >> event=0x00 >> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the >> TSC ? >> >>> I think this is because under Linux 4.8, the CPU got a fault every >>> time it read the TSC_ADJUST MSR. >> >> maybe it still is! >> >> >>> But user programs wanting to use the TSC and correlate its value to >>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above >>> program still have to dig the TSC frequency value out of the kernel >>> with objdump - this was really the point of the bug #194609. >>> >>> I would still like to investigate exporting 'tsc_khz' & 'mult' + >>> 'shift' values via sysfs. >>> >>> Regards, >>> Jason. >>> >>> >>> >>> >>> >>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>> Thank You for enlightening me - >>>> >>>> I was just having a hard time believing that Intel would ship a chip >>>> that features a monotonic, fixed frequency timestamp counter >>>> without specifying in either documentation or on-chip or in ACPI what >>>> precisely that hard-wired frequency is, but I now know that to >>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >>>> difficult to reconcile with the statement in the SDM : >>>> 17.16.4 Invariant Time-Keeping >>>> The invariant TSC is based on the invariant timekeeping hardware >>>> (called Always Running Timer or ART), that runs at the core crystal >>>> clock >>>> frequency. The ratio defined by CPUID leaf 15H expresses the >>>> frequency >>>> relationship between the ART hardware and TSC. If >>>> CPUID.15H:EBX[31:0] >>>> !>>>> 0 >>>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >>>> relationship holds between TSC and the ART hardware: >>>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >>>> / CPUID.15H:EAX[31:0] + K >>>> Where 'K' is an offset that can be adjusted by a privileged >>>> agent*2. >>>> When ART hardware is reset, both invariant TSC and K are also >>>> reset. >>>> >>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >>>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >>>> that >>>> the "Nominal TSC Frequency" formulae in the manul must apply to all >>>> CPUs with InvariantTSC . >>>> >>>> Do I understand correctly , that since I do have InvariantTSC , the >>>> TSC_Value is in fact calculated according to the above formula, but >>>> with >>>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >>>> TSC frequency ? >>>> It was obvious this nominal TSC Frequency had nothing to do with the >>>> actual TSC frequency used by Linux, which is 'tsc_khz' . >>>> I guess wishful thinking led me to believe CPUID:15h was actually >>>> supported somehow , because I thought InvariantTSC meant it had ART >>>> hardware . >>>> >>>> I do strongly suggest that Linux exports its calibrated TSC Khz >>>> somewhere to user >>>> space . >>>> >>>> I think the best long-term solution would be to allow programs to >>>> somehow read the TSC without invoking >>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >>>> having to enter the kernel, which incurs an overhead of > 120ns on my >>>> system >>>> . >>>> >>>> >>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >>>> 'clocksource->shift' values to /sysfs somehow ? >>>> >>>> For instance , only if the 'current_clocksource' is 'tsc', then these >>>> values could be exported as: >>>> /sys/devices/system/clocksource/clocksource0/shift >>>> /sys/devices/system/clocksource/clocksource0/mult >>>> /sys/devices/system/clocksource/clocksource0/freq >>>> >>>> So user-space programs could know that the value returned by >>>> clock_gettime(CLOCK_MONOTONIC_RAW) >>>> would be >>>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >>>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >>>> } >>>> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >>>> >>>> That would save user-space programs from having to know 'tsc_khz' by >>>> parsing the 'Refined TSC' frequency from log files or by examining the >>>> running kernel with objdump to obtain this value & figure out 'mult' & >>>> 'shift' themselves. >>>> >>>> And why not a >>>> /sys/devices/system/clocksource/clocksource0/value >>>> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >>>> expression as a long integer? >>>> And perhaps a >>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >>>> file that actually prints out the number of real-time nano-seconds >>>> since >>>> the >>>> contents of the existing >>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >>>> files using the current TSC value? >>>> To read the rtc0/{date,time} files is already faster than entering the >>>> kernel to call >>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >>>> >>>> I will work on developing a patch to this effect if no-one else is. >>>> >>>> Also, am I right in assuming that the maximum granularity of the >>>> real-time >>>> clock >>>> on my system is 1/64th of a second ? : >>>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >>>> 64 >>>> This is the maximum granularity that can be stored in CMOS , not >>>> returned by TSC? Couldn't we have something similar that gave an >>>> accurate idea of TSC frequency and the precise formula applied to TSC >>>> value to get clock_gettime >>>> (CLOCK_MONOTONIC_RAW) value ? >>>> >>>> Regards, >>>> Jason >>>> >>>> >>>> This code does produce good timestamps with a latency of @20ns >>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >>>> values, but it depends on a global variable that is initialized to >>>> the 'tsc_khz' value >>>> computed by running kernel parsed from objdump /proc/kcore output : >>>> >>>> static inline __attribute__((always_inline)) >>>> U64_t >>>> IA64_tsc_now() >>>> { if(!( _ia64_invariant_tsc_enabled >>>> ||(( _cpu0id_fd = -1) && >>>> IA64_invariant_tsc_is_enabled(NULL,NULL)) >>>> ) >>>> ) >>>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >>>> TSC enabled.\n"); >>>> return 0; >>>> } >>>> U32_t tsc_hi, tsc_lo; >>>> register UL_t tsc; >>>> asm volatile >>>> ( "rdtscp\n\t" >>>> "mov %%edx, %0\n\t" >>>> "mov %%eax, %1\n\t" >>>> "mov %%ecx, %2\n\t" >>>> : "=m" (tsc_hi) , >>>> "=m" (tsc_lo) , >>>> "=m" (_ia64_tsc_user_cpu) : >>>> : "%eax","%ecx","%edx" >>>> ); >>>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >>>> return tsc; >>>> } >>>> >>>> __thread >>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >>>> >>>> static inline __attribute__((always_inline)) >>>> U64_t IA64_tsc_ticks_since_start() >>>> { if(_ia64_first_tsc = 0xffffffffffffffffUL) >>>> { _ia64_first_tsc = IA64_tsc_now(); >>>> return 0; >>>> } >>>> return (IA64_tsc_now() - _ia64_first_tsc) ; >>>> } >>>> >>>> static inline __attribute__((always_inline)) >>>> void >>>> ia64_tsc_calc_mult_shift >>>> ( register U32_t *mult, >>>> register U32_t *shift >>>> ) >>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() >>>> function: >>>> * calculates second + nanosecond mult + shift in same way linux >>>> does. >>>> * we want to be compatible with what linux returns in struct >>>> timespec ts after call to >>>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >>>> */ >>>> const U32_t scale\x1000U; >>>> register U32_t from= IA64_tsc_khz(); >>>> register U32_t to = NSEC_PER_SEC / scale; >>>> register U64_t sec = ( ~0UL / from ) / scale; >>>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >>>> register U64_t maxsec = sec * scale; >>>> UL_t tmp; >>>> U32_t sft, sftacc2; >>>> /* >>>> * Calculate the shift factor which is limiting the conversion >>>> * range: >>>> */ >>>> tmp = (maxsec * from) >> 32; >>>> while (tmp) >>>> { tmp >>=1; >>>> sftacc--; >>>> } >>>> /* >>>> * Find the conversion shift/mult pair which has the best >>>> * accuracy and fits the maxsec conversion range: >>>> */ >>>> for (sft = 32; sft > 0; sft--) >>>> { tmp = ((UL_t) to) << sft; >>>> tmp += from / 2; >>>> tmp = tmp / from; >>>> if ((tmp >> sftacc) = 0) >>>> break; >>>> } >>>> *mult = tmp; >>>> *shift = sft; >>>> } >>>> >>>> __thread >>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >>>> >>>> static inline __attribute__((always_inline)) >>>> U64_t IA64_s_ns_since_start() >>>> { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) ) >>>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >>>> register U64_t cycles = IA64_tsc_ticks_since_start(); >>>> register U64_t ns = ((cycles >>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >>>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >>>> NSEC_PER_SEC)&0x3fffffffUL) ); >>>> /* Yes, we are purposefully ignoring durations of more than 4.2 >>>> billion seconds here! */ >>>> } >>>> >>>> >>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >>>> somehow, >>>> then user-space libraries could have more confidence in using 'rdtsc' >>>> or 'rdtscp' >>>> if Linux's current_clocksource is 'tsc'. >>>> >>>> Regards, >>>> Jason >>>> >>>> >>>> >>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>>>> >>>>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>>>> in detect_art() in tsc.c, >>>>> >>>>> By some definition of available. You can feed CPUID random leaf >>>>> numbers >>>>> and >>>>> it will return something, usually the value of the last valid CPUID >>>>> leaf, >>>>> which is 13 on your CPU. A similar CPU model has >>>>> >>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>>>> edx=0x00000000 >>>>> >>>>> i.e. 7, 832, 832, 0 >>>>> >>>>> Looks familiar, right? >>>>> >>>>> You can verify that with 'cpuid -1 -r' on your machine. >>>>> >>>>>> Linux does not think ART is enabled, and does not set the synthesized >>>>>> CPUID + >>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>>>> see this bit set . >>>>> >>>>> Rightfully so. This is a Haswell Core model. >>>>> >>>>>> if an e1000 NIC card had been installed, PTP would not be available. >>>>> >>>>> PTP is independent of the ART kernel feature . ART just provides >>>>> enhanced >>>>> PTP features. You are confusing things here. >>>>> >>>>> The ART feature as the kernel sees it is a hardware extension which >>>>> feeds >>>>> the ART clock to peripherals for timestamping and time correlation >>>>> purposes. The ratio between ART and TSC is described by CPUID leaf >>>>> 0x15 >>>>> so >>>>> the kernel can make use of that correlation, e.g. for enhanced PTP >>>>> accuracy. >>>>> >>>>> It's correct, that the NONSTOP_TSC feature depends on the availability >>>>> of >>>>> ART, but that has nothing to do with the feature bit, which solely >>>>> describes the ratio between TSC and the ART frequency which is exposed >>>>> to >>>>> peripherals. That frequency is not necessarily the real ART frequency. >>>>> >>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to >>>>>> be >>>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is >>>>>> 0 >>>>>> because the CPU will always get a fault reading the MSR since it has >>>>>> never been written. >>>>> >>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>>>> really >>>>> wrong. And writing it unconditionally to 0 is not going to happen. >>>>> 4.10 >>>>> has >>>>> new code which utilizes the TSC_ADJUST MSR. >>>>> >>>>>> It would be nice for user-space programs that want to use the TSC >>>>>> with >>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>>>> bug report, >>>>>> could have confidence that Linux is actually generating the results >>>>>> of >>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>>>> in a predictable way from the TSC by looking at the >>>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>>>> use of TSC values, so that they can correlate TSC values with linux >>>>>> clock_gettime() values. >>>>> >>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>>>> >>>>> Nothing at all, really. >>>>> >>>>> The kernel makes use of the proper information values already. >>>>> >>>>> The TSC frequency is determined from: >>>>> >>>>> 1) CPUID(0x16) if available >>>>> 2) MSRs if available >>>>> 3) By calibration against a known clock >>>>> >>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* >>>>> values >>>>> are >>>>> correct whether that machine has ART exposed to peripherals or not. >>>>> >>>>>> has tsc: 1 constant: 1 >>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>>>> >>>>> And that voodoo math tells us what? That you found a way to correlate >>>>> CPUID(0xd) to the TSC frequency on that machine. >>>>> >>>>> Now I'm curious how you do that on this other machine which returns >>>>> for >>>>> cpuid(15): 1, 1, 1 >>>>> >>>>> You can't because all of this is completely wrong. >>>>> >>>>> Thanks, >>>>> >>>>> tglx >>>>> >>>> >>> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-22 20:15 ` Jason Vas Dias (?) @ 2017-02-22 20:26 ` Jason Vas Dias 2017-02-23 18:05 ` Jason Vas Dias -1 siblings, 1 reply; 17+ messages in thread From: Jason Vas Dias @ 2017-02-22 20:26 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 [-- Attachment #1: Type: text/plain, Size: 21617 bytes --] OK, last post on this issue today - can anyone explain why, with standard 4.10.0 kernel & no new 'notsc_adjust' option, and the same maths being used, these two runs should display such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts) values ? : $ J/pub/ttsc/ttsc1 max_extended_leaf: 80000008 has tsc: 1 constant: 1 Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.000000641 ns2: 0.000002850 ts3 - ts2: 175 ns1: 0.000000659 ts3 - ts2: 18 ns1: 0.000000643 ts3 - ts2: 18 ns1: 0.000000618 ts3 - ts2: 17 ns1: 0.000000620 ts3 - ts2: 17 ns1: 0.000000616 ts3 - ts2: 18 ns1: 0.000000641 ts3 - ts2: 18 ns1: 0.000000709 ts3 - ts2: 20 ns1: 0.000000763 ts3 - ts2: 20 ns1: 0.000000735 ts3 - ts2: 20 ns1: 0.000000761 t1 - t0: 78200 - ns2: 0.000080824 $ J/pub/ttsc/ttsc1 max_extended_leaf: 80000008 has tsc: 1 constant: 1 Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.000001294 ns2: 0.000005375 ts3 - ts2: 210 ns1: 0.000001418 ts3 - ts2: 23 ns1: 0.000001399 ts3 - ts2: 22 ns1: 0.000001445 ts3 - ts2: 25 ns1: 0.000001321 ts3 - ts2: 20 ns1: 0.000001428 ts3 - ts2: 25 ns1: 0.000001367 ts3 - ts2: 23 ns1: 0.000001425 ts3 - ts2: 23 ns1: 0.000001357 ts3 - ts2: 22 ns1: 0.000001487 ts3 - ts2: 25 ns1: 0.000001377 t1 - t0: 145753 - ns2: 0.000150781 (complete source of test program ttsc1 attached in ttsc.tar $ tar -xpf ttsc.tar $ cd ttsc $ make ). On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > I actually tried adding a 'notsc_adjust' kernel option to disable any > setting or > access to the TSC_ADJUST MSR, but then I see the problems - a big > disparity > in values depending on which CPU the thread is scheduled - and no > improvement in clock_gettime() latency. So I don't think the new > TSC_ADJUST > code in ts_sync.c itself is the issue - but something added @ 460ns > onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . > As I don't think fixing the clock_gettime() latency issue is my problem or > even > possible with current clock architecture approach, it is a non-issue. > > But please, can anyone tell me if are there any plans to move the time > infrastructure out of the kernel and into glibc along the lines > outlined > in previous mail - if not, I am going to concentrate on this more radical > overhaul approach for my own systems . > > At least, I think mapping the clocksource information structure itself in > some > kind of sharable page makes sense . Processes could map that page > copy-on-write > so they could start off with all the timing parameters preloaded, then > keep > their copy updated using the rdtscp instruction , or msync() (read-only) > with the kernel's single copy to get the latest time any process has > requested. > All real-time parameters & adjustments could be stored in that page , > & eventually a single copy of the tzdata could be used by both kernel > & user-space. > That is what I am working towards. Any plans to make linux real-time tsc > clock user-friendly ? > > > > On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is >> read or written . It is probably because it genuinuely does not >> support any cpuid > 13 , >> or the modern TSC_ADJUST interface . This is probably why my >> clock_gettime() >> latencies are so bad. Now I have to develop a patch to disable all access >> to >> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . >> I really have an unlucky CPU :-) . >> >> But really, I think this issue goes deeper into the fundamental limits of >> time measurement on Linux : it is never going to be possible to measure >> minimum times with clock_gettime() comparable with those returned by >> rdtscp instruction - the time taken to enter the kernel through the VDSO, >> queue an access to vsyscall_gtod_data via a workqueue, access it & do >> computations & copy value to user-space is NEVER going to be up to the >> job of measuring small real-time durations of the order of 10-20 TSC >> ticks >> . >> >> I think the best way to solve this problem going forward would be to >> store >> the entire vsyscall_gtod_data data structure representing the current >> clocksource >> in a shared page which is memory-mappable (read-only) by user-space . >> I think sser-space programs should be able to do something like : >> int fd = >> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); >> size_t psz = getpagesize(); >> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); >> msync(gtod,psz,MS_SYNC); >> >> Then they could all read the real-time clock values as they are updated >> in real-time by the kernel, and know exactly how to interpret them . >> >> I also think that all mktime() / gmtime() / localtime() timezone handling >> functionality should be >> moved to user-space, and that the kernel should actually load and link in >> some >> /lib/libtzdata.so >> library, provided by glibc / libc implementations, that is exactly the >> same library >> used by glibc() code to parse tzdata ; tzdata should be loaded at boot >> time >> by the kernel from the same places glibc loads it, and both the kernel >> and >> glibc should use identical mktime(), gmtime(), etc. functions to access >> it, >> and >> glibc using code would not need to enter the kernel at all for any >> time-handling >> code. This tzdata-library code be automatically loaded into process >> images >> the >> same way the vdso region is , and the whole system could access only one >> copy of it and the 'gtod.page' in memory. >> >> That's just my two-cents worth, and how I'd like to eventually get >> things working >> on my system. >> >> All the best, Regards, >> Jason >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>> RE: >>>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >>>> >>>> I just built an unpatched linux v4.10 with tglx's TSC improvements - >>>> much else improved in this kernel (like iwlwifi) - thanks! >>>> >>>> I have attached an updated version of the test program which >>>> doesn't print the bogus "Nominal TSC Frequency" (the previous >>>> version printed it, but equally ignored it). >>>> >>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >>>> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >>>> >>>> $ uname -r >>>> 4.10.0 >>>> $ ./ttsc1 >>>> max_extended_leaf: 80000008 >>>> has tsc: 1 constant: 1 >>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 >>>> ts3 - ts2: 178 ns1: 0.000000592 >>>> ts3 - ts2: 14 ns1: 0.000000577 >>>> ts3 - ts2: 14 ns1: 0.000000651 >>>> ts3 - ts2: 17 ns1: 0.000000625 >>>> ts3 - ts2: 17 ns1: 0.000000677 >>>> ts3 - ts2: 17 ns1: 0.000000626 >>>> ts3 - ts2: 17 ns1: 0.000000627 >>>> ts3 - ts2: 17 ns1: 0.000000627 >>>> ts3 - ts2: 18 ns1: 0.000000655 >>>> ts3 - ts2: 17 ns1: 0.000000631 >>>> t1 - t0: 89067 - ns2: 0.000091411 >>>> >>> >>> >>> Oops, going blind in my old age. These latencies are actually 3 times >>> greater than under 4.8 !! >>> >>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, >>> as >>> shown >>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: >>> >>> ts3 - ts2: 24 ns1: 0.000000162 >>> ts3 - ts2: 17 ns1: 0.000000143 >>> ts3 - ts2: 17 ns1: 0.000000146 >>> ts3 - ts2: 17 ns1: 0.000000149 >>> ts3 - ts2: 17 ns1: 0.000000141 >>> ts3 - ts2: 16 ns1: 0.000000142 >>> >>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ >>> 600ns, @ 4 times more than under 4.8 . >>> But I'm glad the TSC_ADJUST problems are fixed. >>> >>> Will programs reading : >>> $ cat /sys/devices/msr/events/tsc >>> event=0x00 >>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on >>> the >>> TSC ? >>> >>>> I think this is because under Linux 4.8, the CPU got a fault every >>>> time it read the TSC_ADJUST MSR. >>> >>> maybe it still is! >>> >>> >>>> But user programs wanting to use the TSC and correlate its value to >>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above >>>> program still have to dig the TSC frequency value out of the kernel >>>> with objdump - this was really the point of the bug #194609. >>>> >>>> I would still like to investigate exporting 'tsc_khz' & 'mult' + >>>> 'shift' values via sysfs. >>>> >>>> Regards, >>>> Jason. >>>> >>>> >>>> >>>> >>>> >>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>>> Thank You for enlightening me - >>>>> >>>>> I was just having a hard time believing that Intel would ship a chip >>>>> that features a monotonic, fixed frequency timestamp counter >>>>> without specifying in either documentation or on-chip or in ACPI what >>>>> precisely that hard-wired frequency is, but I now know that to >>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >>>>> difficult to reconcile with the statement in the SDM : >>>>> 17.16.4 Invariant Time-Keeping >>>>> The invariant TSC is based on the invariant timekeeping hardware >>>>> (called Always Running Timer or ART), that runs at the core >>>>> crystal >>>>> clock >>>>> frequency. The ratio defined by CPUID leaf 15H expresses the >>>>> frequency >>>>> relationship between the ART hardware and TSC. If >>>>> CPUID.15H:EBX[31:0] >>>>> != >>>>> 0 >>>>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >>>>> relationship holds between TSC and the ART hardware: >>>>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >>>>> / CPUID.15H:EAX[31:0] + K >>>>> Where 'K' is an offset that can be adjusted by a privileged >>>>> agent*2. >>>>> When ART hardware is reset, both invariant TSC and K are also >>>>> reset. >>>>> >>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >>>>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >>>>> that >>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all >>>>> CPUs with InvariantTSC . >>>>> >>>>> Do I understand correctly , that since I do have InvariantTSC , the >>>>> TSC_Value is in fact calculated according to the above formula, but >>>>> with >>>>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >>>>> TSC frequency ? >>>>> It was obvious this nominal TSC Frequency had nothing to do with the >>>>> actual TSC frequency used by Linux, which is 'tsc_khz' . >>>>> I guess wishful thinking led me to believe CPUID:15h was actually >>>>> supported somehow , because I thought InvariantTSC meant it had ART >>>>> hardware . >>>>> >>>>> I do strongly suggest that Linux exports its calibrated TSC Khz >>>>> somewhere to user >>>>> space . >>>>> >>>>> I think the best long-term solution would be to allow programs to >>>>> somehow read the TSC without invoking >>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >>>>> having to enter the kernel, which incurs an overhead of > 120ns on my >>>>> system >>>>> . >>>>> >>>>> >>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >>>>> 'clocksource->shift' values to /sysfs somehow ? >>>>> >>>>> For instance , only if the 'current_clocksource' is 'tsc', then these >>>>> values could be exported as: >>>>> /sys/devices/system/clocksource/clocksource0/shift >>>>> /sys/devices/system/clocksource/clocksource0/mult >>>>> /sys/devices/system/clocksource/clocksource0/freq >>>>> >>>>> So user-space programs could know that the value returned by >>>>> clock_gettime(CLOCK_MONOTONIC_RAW) >>>>> would be >>>>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >>>>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >>>>> } >>>>> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >>>>> >>>>> That would save user-space programs from having to know 'tsc_khz' by >>>>> parsing the 'Refined TSC' frequency from log files or by examining the >>>>> running kernel with objdump to obtain this value & figure out 'mult' & >>>>> 'shift' themselves. >>>>> >>>>> And why not a >>>>> /sys/devices/system/clocksource/clocksource0/value >>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >>>>> expression as a long integer? >>>>> And perhaps a >>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >>>>> file that actually prints out the number of real-time nano-seconds >>>>> since >>>>> the >>>>> contents of the existing >>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >>>>> files using the current TSC value? >>>>> To read the rtc0/{date,time} files is already faster than entering the >>>>> kernel to call >>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >>>>> >>>>> I will work on developing a patch to this effect if no-one else is. >>>>> >>>>> Also, am I right in assuming that the maximum granularity of the >>>>> real-time >>>>> clock >>>>> on my system is 1/64th of a second ? : >>>>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >>>>> 64 >>>>> This is the maximum granularity that can be stored in CMOS , not >>>>> returned by TSC? Couldn't we have something similar that gave an >>>>> accurate idea of TSC frequency and the precise formula applied to TSC >>>>> value to get clock_gettime >>>>> (CLOCK_MONOTONIC_RAW) value ? >>>>> >>>>> Regards, >>>>> Jason >>>>> >>>>> >>>>> This code does produce good timestamps with a latency of @20ns >>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >>>>> values, but it depends on a global variable that is initialized to >>>>> the 'tsc_khz' value >>>>> computed by running kernel parsed from objdump /proc/kcore output : >>>>> >>>>> static inline __attribute__((always_inline)) >>>>> U64_t >>>>> IA64_tsc_now() >>>>> { if(!( _ia64_invariant_tsc_enabled >>>>> ||(( _cpu0id_fd == -1) && >>>>> IA64_invariant_tsc_is_enabled(NULL,NULL)) >>>>> ) >>>>> ) >>>>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >>>>> TSC enabled.\n"); >>>>> return 0; >>>>> } >>>>> U32_t tsc_hi, tsc_lo; >>>>> register UL_t tsc; >>>>> asm volatile >>>>> ( "rdtscp\n\t" >>>>> "mov %%edx, %0\n\t" >>>>> "mov %%eax, %1\n\t" >>>>> "mov %%ecx, %2\n\t" >>>>> : "=m" (tsc_hi) , >>>>> "=m" (tsc_lo) , >>>>> "=m" (_ia64_tsc_user_cpu) : >>>>> : "%eax","%ecx","%edx" >>>>> ); >>>>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >>>>> return tsc; >>>>> } >>>>> >>>>> __thread >>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >>>>> >>>>> static inline __attribute__((always_inline)) >>>>> U64_t IA64_tsc_ticks_since_start() >>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL) >>>>> { _ia64_first_tsc = IA64_tsc_now(); >>>>> return 0; >>>>> } >>>>> return (IA64_tsc_now() - _ia64_first_tsc) ; >>>>> } >>>>> >>>>> static inline __attribute__((always_inline)) >>>>> void >>>>> ia64_tsc_calc_mult_shift >>>>> ( register U32_t *mult, >>>>> register U32_t *shift >>>>> ) >>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() >>>>> function: >>>>> * calculates second + nanosecond mult + shift in same way linux >>>>> does. >>>>> * we want to be compatible with what linux returns in struct >>>>> timespec ts after call to >>>>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >>>>> */ >>>>> const U32_t scale=1000U; >>>>> register U32_t from= IA64_tsc_khz(); >>>>> register U32_t to = NSEC_PER_SEC / scale; >>>>> register U64_t sec = ( ~0UL / from ) / scale; >>>>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >>>>> register U64_t maxsec = sec * scale; >>>>> UL_t tmp; >>>>> U32_t sft, sftacc=32; >>>>> /* >>>>> * Calculate the shift factor which is limiting the conversion >>>>> * range: >>>>> */ >>>>> tmp = (maxsec * from) >> 32; >>>>> while (tmp) >>>>> { tmp >>=1; >>>>> sftacc--; >>>>> } >>>>> /* >>>>> * Find the conversion shift/mult pair which has the best >>>>> * accuracy and fits the maxsec conversion range: >>>>> */ >>>>> for (sft = 32; sft > 0; sft--) >>>>> { tmp = ((UL_t) to) << sft; >>>>> tmp += from / 2; >>>>> tmp = tmp / from; >>>>> if ((tmp >> sftacc) == 0) >>>>> break; >>>>> } >>>>> *mult = tmp; >>>>> *shift = sft; >>>>> } >>>>> >>>>> __thread >>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >>>>> >>>>> static inline __attribute__((always_inline)) >>>>> U64_t IA64_s_ns_since_start() >>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) >>>>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >>>>> register U64_t cycles = IA64_tsc_ticks_since_start(); >>>>> register U64_t ns = ((cycles >>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >>>>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >>>>> NSEC_PER_SEC)&0x3fffffffUL) ); >>>>> /* Yes, we are purposefully ignoring durations of more than 4.2 >>>>> billion seconds here! */ >>>>> } >>>>> >>>>> >>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >>>>> somehow, >>>>> then user-space libraries could have more confidence in using 'rdtsc' >>>>> or 'rdtscp' >>>>> if Linux's current_clocksource is 'tsc'. >>>>> >>>>> Regards, >>>>> Jason >>>>> >>>>> >>>>> >>>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>>>>> >>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>>>>> in detect_art() in tsc.c, >>>>>> >>>>>> By some definition of available. You can feed CPUID random leaf >>>>>> numbers >>>>>> and >>>>>> it will return something, usually the value of the last valid CPUID >>>>>> leaf, >>>>>> which is 13 on your CPU. A similar CPU model has >>>>>> >>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>>>>> edx=0x00000000 >>>>>> >>>>>> i.e. 7, 832, 832, 0 >>>>>> >>>>>> Looks familiar, right? >>>>>> >>>>>> You can verify that with 'cpuid -1 -r' on your machine. >>>>>> >>>>>>> Linux does not think ART is enabled, and does not set the >>>>>>> synthesized >>>>>>> CPUID + >>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>>>>> see this bit set . >>>>>> >>>>>> Rightfully so. This is a Haswell Core model. >>>>>> >>>>>>> if an e1000 NIC card had been installed, PTP would not be available. >>>>>> >>>>>> PTP is independent of the ART kernel feature . ART just provides >>>>>> enhanced >>>>>> PTP features. You are confusing things here. >>>>>> >>>>>> The ART feature as the kernel sees it is a hardware extension which >>>>>> feeds >>>>>> the ART clock to peripherals for timestamping and time correlation >>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf >>>>>> 0x15 >>>>>> so >>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP >>>>>> accuracy. >>>>>> >>>>>> It's correct, that the NONSTOP_TSC feature depends on the >>>>>> availability >>>>>> of >>>>>> ART, but that has nothing to do with the feature bit, which solely >>>>>> describes the ratio between TSC and the ART frequency which is >>>>>> exposed >>>>>> to >>>>>> peripherals. That frequency is not necessarily the real ART >>>>>> frequency. >>>>>> >>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to >>>>>>> be >>>>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART >>>>>>> is >>>>>>> 0 >>>>>>> because the CPU will always get a fault reading the MSR since it has >>>>>>> never been written. >>>>>> >>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>>>>> really >>>>>> wrong. And writing it unconditionally to 0 is not going to happen. >>>>>> 4.10 >>>>>> has >>>>>> new code which utilizes the TSC_ADJUST MSR. >>>>>> >>>>>>> It would be nice for user-space programs that want to use the TSC >>>>>>> with >>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to >>>>>>> the >>>>>>> bug report, >>>>>>> could have confidence that Linux is actually generating the results >>>>>>> of >>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>>>>> in a predictable way from the TSC by looking at the >>>>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>>>>> use of TSC values, so that they can correlate TSC values with linux >>>>>>> clock_gettime() values. >>>>>> >>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>>>>> >>>>>> Nothing at all, really. >>>>>> >>>>>> The kernel makes use of the proper information values already. >>>>>> >>>>>> The TSC frequency is determined from: >>>>>> >>>>>> 1) CPUID(0x16) if available >>>>>> 2) MSRs if available >>>>>> 3) By calibration against a known clock >>>>>> >>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* >>>>>> values >>>>>> are >>>>>> correct whether that machine has ART exposed to peripherals or not. >>>>>> >>>>>>> has tsc: 1 constant: 1 >>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>>>>> >>>>>> And that voodoo math tells us what? That you found a way to correlate >>>>>> CPUID(0xd) to the TSC frequency on that machine. >>>>>> >>>>>> Now I'm curious how you do that on this other machine which returns >>>>>> for >>>>>> cpuid(15): 1, 1, 1 >>>>>> >>>>>> You can't because all of this is completely wrong. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> tglx >>>>>> >>>>> >>>> >>> >> > [-- Attachment #2: ttsc.tar --] [-- Type: application/x-tar, Size: 40960 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 2017-02-22 20:26 ` Jason Vas Dias @ 2017-02-23 18:05 ` Jason Vas Dias 0 siblings, 0 replies; 17+ messages in thread From: Jason Vas Dias @ 2017-02-23 18:05 UTC (permalink / raw) To: Thomas Gleixner Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin, Prarit Bhargava, x86 [-- Attachment #1: Type: text/plain, Size: 24167 bytes --] I have found a new source of weirdness with TSC using clock_gettime(CLOCK_MONOTONIC_RAW,&ts) : The vsyscall_gtod_data.mult field changes somewhat between calls to clock_gettime(CLOCK_MONOTONIC_RAW,&ts), so that sometimes an extra (2^24) nanoseconds are added or removed from the value derived from the TSC and stored in 'ts' . This is demonstrated by the output of the test program in the attached ttsc.tar file: $ ./tlgtd it worked! - GTOD: clock:1 mult:5798662 shift:24 synced - mult now: 5798661 What it is doing is finding the address of the 'vsyscall_gtod_data' structure from /proc/kallsyms, and mapping the virtual address to an ELF section offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure into user-space memory . Really, this 'mult' value, which is used to return the seconds|nanoseconds value: ( tsc_cycles * mult ) >> shift (where shift is 24 ), should not change from the first time it is initialized . The TSC is meant to be FIXED FREQUENCY, right ? So how could / why should the conversion function from TSC ticks to nanoseconds change ? So now it is doubly difficult for user-space libraries to maintain their RDTSC derived seconds|nanoseconds values to correlate well those returned by the kernel, because they must regularly read the updated 'mult' value used by the kernel . I really don't think the kernel should randomly be deciding to increase / decrease the TSC tick period by 2^24 nanoseconds! Is this a bug or intentional ? I am searching for all places where a '[.>]mult.*=' occurs, but this returns rather alot of matches. Please could a future version of linux at least export the 'mult' and 'shift' values for the current clocksource ! Regards, Jason On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > OK, last post on this issue today - > can anyone explain why, with standard 4.10.0 kernel & no new > 'notsc_adjust' option, and the same maths being used, these two runs > should display > such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts) > values ? : > > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 80000008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.000000641 ns2: 0.000002850 > ts3 - ts2: 175 ns1: 0.000000659 > ts3 - ts2: 18 ns1: 0.000000643 > ts3 - ts2: 18 ns1: 0.000000618 > ts3 - ts2: 17 ns1: 0.000000620 > ts3 - ts2: 17 ns1: 0.000000616 > ts3 - ts2: 18 ns1: 0.000000641 > ts3 - ts2: 18 ns1: 0.000000709 > ts3 - ts2: 20 ns1: 0.000000763 > ts3 - ts2: 20 ns1: 0.000000735 > ts3 - ts2: 20 ns1: 0.000000761 > t1 - t0: 78200 - ns2: 0.000080824 > $ J/pub/ttsc/ttsc1 > max_extended_leaf: 80000008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1. > ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.000001294 ns2: 0.000005375 > ts3 - ts2: 210 ns1: 0.000001418 > ts3 - ts2: 23 ns1: 0.000001399 > ts3 - ts2: 22 ns1: 0.000001445 > ts3 - ts2: 25 ns1: 0.000001321 > ts3 - ts2: 20 ns1: 0.000001428 > ts3 - ts2: 25 ns1: 0.000001367 > ts3 - ts2: 23 ns1: 0.000001425 > ts3 - ts2: 23 ns1: 0.000001357 > ts3 - ts2: 22 ns1: 0.000001487 > ts3 - ts2: 25 ns1: 0.000001377 > t1 - t0: 145753 - ns2: 0.000150781 > > (complete source of test program ttsc1 attached in ttsc.tar > $ tar -xpf ttsc.tar > $ cd ttsc > $ make > ). > > On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >> I actually tried adding a 'notsc_adjust' kernel option to disable any >> setting or >> access to the TSC_ADJUST MSR, but then I see the problems - a big >> disparity >> in values depending on which CPU the thread is scheduled - and no >> improvement in clock_gettime() latency. So I don't think the new >> TSC_ADJUST >> code in ts_sync.c itself is the issue - but something added @ 460ns >> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 . >> As I don't think fixing the clock_gettime() latency issue is my problem >> or >> even >> possible with current clock architecture approach, it is a non-issue. >> >> But please, can anyone tell me if are there any plans to move the time >> infrastructure out of the kernel and into glibc along the lines >> outlined >> in previous mail - if not, I am going to concentrate on this more radical >> overhaul approach for my own systems . >> >> At least, I think mapping the clocksource information structure itself in >> some >> kind of sharable page makes sense . Processes could map that page >> copy-on-write >> so they could start off with all the timing parameters preloaded, then >> keep >> their copy updated using the rdtscp instruction , or msync() (read-only) >> with the kernel's single copy to get the latest time any process has >> requested. >> All real-time parameters & adjustments could be stored in that page , >> & eventually a single copy of the tzdata could be used by both kernel >> & user-space. >> That is what I am working towards. Any plans to make linux real-time tsc >> clock user-friendly ? >> >> >> >> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is >>> read or written . It is probably because it genuinuely does not >>> support any cpuid > 13 , >>> or the modern TSC_ADJUST interface . This is probably why my >>> clock_gettime() >>> latencies are so bad. Now I have to develop a patch to disable all >>> access >>> to >>> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 . >>> I really have an unlucky CPU :-) . >>> >>> But really, I think this issue goes deeper into the fundamental limits >>> of >>> time measurement on Linux : it is never going to be possible to measure >>> minimum times with clock_gettime() comparable with those returned by >>> rdtscp instruction - the time taken to enter the kernel through the >>> VDSO, >>> queue an access to vsyscall_gtod_data via a workqueue, access it & do >>> computations & copy value to user-space is NEVER going to be up to the >>> job of measuring small real-time durations of the order of 10-20 TSC >>> ticks >>> . >>> >>> I think the best way to solve this problem going forward would be to >>> store >>> the entire vsyscall_gtod_data data structure representing the current >>> clocksource >>> in a shared page which is memory-mappable (read-only) by user-space . >>> I think sser-space programs should be able to do something like : >>> int fd = >>> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY); >>> size_t psz = getpagesize(); >>> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 ); >>> msync(gtod,psz,MS_SYNC); >>> >>> Then they could all read the real-time clock values as they are updated >>> in real-time by the kernel, and know exactly how to interpret them . >>> >>> I also think that all mktime() / gmtime() / localtime() timezone >>> handling >>> functionality should be >>> moved to user-space, and that the kernel should actually load and link >>> in >>> some >>> /lib/libtzdata.so >>> library, provided by glibc / libc implementations, that is exactly the >>> same library >>> used by glibc() code to parse tzdata ; tzdata should be loaded at boot >>> time >>> by the kernel from the same places glibc loads it, and both the kernel >>> and >>> glibc should use identical mktime(), gmtime(), etc. functions to access >>> it, >>> and >>> glibc using code would not need to enter the kernel at all for any >>> time-handling >>> code. This tzdata-library code be automatically loaded into process >>> images >>> the >>> same way the vdso region is , and the whole system could access only one >>> copy of it and the 'gtod.page' in memory. >>> >>> That's just my two-cents worth, and how I'd like to eventually get >>> things working >>> on my system. >>> >>> All the best, Regards, >>> Jason >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>>> RE: >>>>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR. >>>>> >>>>> I just built an unpatched linux v4.10 with tglx's TSC improvements - >>>>> much else improved in this kernel (like iwlwifi) - thanks! >>>>> >>>>> I have attached an updated version of the test program which >>>>> doesn't print the bogus "Nominal TSC Frequency" (the previous >>>>> version printed it, but equally ignored it). >>>>> >>>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by >>>>> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : >>>>> >>>>> $ uname -r >>>>> 4.10.0 >>>>> $ ./ttsc1 >>>>> max_extended_leaf: 80000008 >>>>> has tsc: 1 constant: 1 >>>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. >>>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 >>>>> ts3 - ts2: 178 ns1: 0.000000592 >>>>> ts3 - ts2: 14 ns1: 0.000000577 >>>>> ts3 - ts2: 14 ns1: 0.000000651 >>>>> ts3 - ts2: 17 ns1: 0.000000625 >>>>> ts3 - ts2: 17 ns1: 0.000000677 >>>>> ts3 - ts2: 17 ns1: 0.000000626 >>>>> ts3 - ts2: 17 ns1: 0.000000627 >>>>> ts3 - ts2: 17 ns1: 0.000000627 >>>>> ts3 - ts2: 18 ns1: 0.000000655 >>>>> ts3 - ts2: 17 ns1: 0.000000631 >>>>> t1 - t0: 89067 - ns2: 0.000091411 >>>>> >>>> >>>> >>>> Oops, going blind in my old age. These latencies are actually 3 times >>>> greater than under 4.8 !! >>>> >>>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, >>>> as >>>> shown >>>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: >>>> >>>> ts3 - ts2: 24 ns1: 0.000000162 >>>> ts3 - ts2: 17 ns1: 0.000000143 >>>> ts3 - ts2: 17 ns1: 0.000000146 >>>> ts3 - ts2: 17 ns1: 0.000000149 >>>> ts3 - ts2: 17 ns1: 0.000000141 >>>> ts3 - ts2: 16 ns1: 0.000000142 >>>> >>>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ >>>> 600ns, @ 4 times more than under 4.8 . >>>> But I'm glad the TSC_ADJUST problems are fixed. >>>> >>>> Will programs reading : >>>> $ cat /sys/devices/msr/events/tsc >>>> event=0x00 >>>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on >>>> the >>>> TSC ? >>>> >>>>> I think this is because under Linux 4.8, the CPU got a fault every >>>>> time it read the TSC_ADJUST MSR. >>>> >>>> maybe it still is! >>>> >>>> >>>>> But user programs wanting to use the TSC and correlate its value to >>>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above >>>>> program still have to dig the TSC frequency value out of the kernel >>>>> with objdump - this was really the point of the bug #194609. >>>>> >>>>> I would still like to investigate exporting 'tsc_khz' & 'mult' + >>>>> 'shift' values via sysfs. >>>>> >>>>> Regards, >>>>> Jason. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote: >>>>>> Thank You for enlightening me - >>>>>> >>>>>> I was just having a hard time believing that Intel would ship a chip >>>>>> that features a monotonic, fixed frequency timestamp counter >>>>>> without specifying in either documentation or on-chip or in ACPI what >>>>>> precisely that hard-wired frequency is, but I now know that to >>>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >>>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >>>>>> difficult to reconcile with the statement in the SDM : >>>>>> 17.16.4 Invariant Time-Keeping >>>>>> The invariant TSC is based on the invariant timekeeping hardware >>>>>> (called Always Running Timer or ART), that runs at the core >>>>>> crystal >>>>>> clock >>>>>> frequency. The ratio defined by CPUID leaf 15H expresses the >>>>>> frequency >>>>>> relationship between the ART hardware and TSC. If >>>>>> CPUID.15H:EBX[31:0] >>>>>> != >>>>>> 0 >>>>>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following >>>>>> linearity >>>>>> relationship holds between TSC and the ART hardware: >>>>>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >>>>>> / CPUID.15H:EAX[31:0] + K >>>>>> Where 'K' is an offset that can be adjusted by a privileged >>>>>> agent*2. >>>>>> When ART hardware is reset, both invariant TSC and K are also >>>>>> reset. >>>>>> >>>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >>>>>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >>>>>> that >>>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all >>>>>> CPUs with InvariantTSC . >>>>>> >>>>>> Do I understand correctly , that since I do have InvariantTSC , the >>>>>> TSC_Value is in fact calculated according to the above formula, but >>>>>> with >>>>>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >>>>>> TSC frequency ? >>>>>> It was obvious this nominal TSC Frequency had nothing to do with the >>>>>> actual TSC frequency used by Linux, which is 'tsc_khz' . >>>>>> I guess wishful thinking led me to believe CPUID:15h was actually >>>>>> supported somehow , because I thought InvariantTSC meant it had ART >>>>>> hardware . >>>>>> >>>>>> I do strongly suggest that Linux exports its calibrated TSC Khz >>>>>> somewhere to user >>>>>> space . >>>>>> >>>>>> I think the best long-term solution would be to allow programs to >>>>>> somehow read the TSC without invoking >>>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >>>>>> having to enter the kernel, which incurs an overhead of > 120ns on my >>>>>> system >>>>>> . >>>>>> >>>>>> >>>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >>>>>> 'clocksource->shift' values to /sysfs somehow ? >>>>>> >>>>>> For instance , only if the 'current_clocksource' is 'tsc', then >>>>>> these >>>>>> values could be exported as: >>>>>> /sys/devices/system/clocksource/clocksource0/shift >>>>>> /sys/devices/system/clocksource/clocksource0/mult >>>>>> /sys/devices/system/clocksource/clocksource0/freq >>>>>> >>>>>> So user-space programs could know that the value returned by >>>>>> clock_gettime(CLOCK_MONOTONIC_RAW) >>>>>> would be >>>>>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >>>>>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >>>>>> } >>>>>> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >>>>>> >>>>>> That would save user-space programs from having to know 'tsc_khz' by >>>>>> parsing the 'Refined TSC' frequency from log files or by examining >>>>>> the >>>>>> running kernel with objdump to obtain this value & figure out 'mult' >>>>>> & >>>>>> 'shift' themselves. >>>>>> >>>>>> And why not a >>>>>> /sys/devices/system/clocksource/clocksource0/value >>>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >>>>>> expression as a long integer? >>>>>> And perhaps a >>>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >>>>>> file that actually prints out the number of real-time nano-seconds >>>>>> since >>>>>> the >>>>>> contents of the existing >>>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >>>>>> files using the current TSC value? >>>>>> To read the rtc0/{date,time} files is already faster than entering >>>>>> the >>>>>> kernel to call >>>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >>>>>> >>>>>> I will work on developing a patch to this effect if no-one else is. >>>>>> >>>>>> Also, am I right in assuming that the maximum granularity of the >>>>>> real-time >>>>>> clock >>>>>> on my system is 1/64th of a second ? : >>>>>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >>>>>> 64 >>>>>> This is the maximum granularity that can be stored in CMOS , not >>>>>> returned by TSC? Couldn't we have something similar that gave an >>>>>> accurate idea of TSC frequency and the precise formula applied to TSC >>>>>> value to get clock_gettime >>>>>> (CLOCK_MONOTONIC_RAW) value ? >>>>>> >>>>>> Regards, >>>>>> Jason >>>>>> >>>>>> >>>>>> This code does produce good timestamps with a latency of @20ns >>>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >>>>>> values, but it depends on a global variable that is initialized to >>>>>> the 'tsc_khz' value >>>>>> computed by running kernel parsed from objdump /proc/kcore output : >>>>>> >>>>>> static inline __attribute__((always_inline)) >>>>>> U64_t >>>>>> IA64_tsc_now() >>>>>> { if(!( _ia64_invariant_tsc_enabled >>>>>> ||(( _cpu0id_fd == -1) && >>>>>> IA64_invariant_tsc_is_enabled(NULL,NULL)) >>>>>> ) >>>>>> ) >>>>>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >>>>>> TSC enabled.\n"); >>>>>> return 0; >>>>>> } >>>>>> U32_t tsc_hi, tsc_lo; >>>>>> register UL_t tsc; >>>>>> asm volatile >>>>>> ( "rdtscp\n\t" >>>>>> "mov %%edx, %0\n\t" >>>>>> "mov %%eax, %1\n\t" >>>>>> "mov %%ecx, %2\n\t" >>>>>> : "=m" (tsc_hi) , >>>>>> "=m" (tsc_lo) , >>>>>> "=m" (_ia64_tsc_user_cpu) : >>>>>> : "%eax","%ecx","%edx" >>>>>> ); >>>>>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >>>>>> return tsc; >>>>>> } >>>>>> >>>>>> __thread >>>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >>>>>> >>>>>> static inline __attribute__((always_inline)) >>>>>> U64_t IA64_tsc_ticks_since_start() >>>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL) >>>>>> { _ia64_first_tsc = IA64_tsc_now(); >>>>>> return 0; >>>>>> } >>>>>> return (IA64_tsc_now() - _ia64_first_tsc) ; >>>>>> } >>>>>> >>>>>> static inline __attribute__((always_inline)) >>>>>> void >>>>>> ia64_tsc_calc_mult_shift >>>>>> ( register U32_t *mult, >>>>>> register U32_t *shift >>>>>> ) >>>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() >>>>>> function: >>>>>> * calculates second + nanosecond mult + shift in same way linux >>>>>> does. >>>>>> * we want to be compatible with what linux returns in struct >>>>>> timespec ts after call to >>>>>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >>>>>> */ >>>>>> const U32_t scale=1000U; >>>>>> register U32_t from= IA64_tsc_khz(); >>>>>> register U32_t to = NSEC_PER_SEC / scale; >>>>>> register U64_t sec = ( ~0UL / from ) / scale; >>>>>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >>>>>> register U64_t maxsec = sec * scale; >>>>>> UL_t tmp; >>>>>> U32_t sft, sftacc=32; >>>>>> /* >>>>>> * Calculate the shift factor which is limiting the conversion >>>>>> * range: >>>>>> */ >>>>>> tmp = (maxsec * from) >> 32; >>>>>> while (tmp) >>>>>> { tmp >>=1; >>>>>> sftacc--; >>>>>> } >>>>>> /* >>>>>> * Find the conversion shift/mult pair which has the best >>>>>> * accuracy and fits the maxsec conversion range: >>>>>> */ >>>>>> for (sft = 32; sft > 0; sft--) >>>>>> { tmp = ((UL_t) to) << sft; >>>>>> tmp += from / 2; >>>>>> tmp = tmp / from; >>>>>> if ((tmp >> sftacc) == 0) >>>>>> break; >>>>>> } >>>>>> *mult = tmp; >>>>>> *shift = sft; >>>>>> } >>>>>> >>>>>> __thread >>>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >>>>>> >>>>>> static inline __attribute__((always_inline)) >>>>>> U64_t IA64_s_ns_since_start() >>>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) >>>>>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >>>>>> register U64_t cycles = IA64_tsc_ticks_since_start(); >>>>>> register U64_t ns = ((cycles >>>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >>>>>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >>>>>> NSEC_PER_SEC)&0x3fffffffUL) ); >>>>>> /* Yes, we are purposefully ignoring durations of more than 4.2 >>>>>> billion seconds here! */ >>>>>> } >>>>>> >>>>>> >>>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >>>>>> somehow, >>>>>> then user-space libraries could have more confidence in using 'rdtsc' >>>>>> or 'rdtscp' >>>>>> if Linux's current_clocksource is 'tsc'. >>>>>> >>>>>> Regards, >>>>>> Jason >>>>>> >>>>>> >>>>>> >>>>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote: >>>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>>>>>> >>>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , >>>>>>>> so >>>>>>>> in detect_art() in tsc.c, >>>>>>> >>>>>>> By some definition of available. You can feed CPUID random leaf >>>>>>> numbers >>>>>>> and >>>>>>> it will return something, usually the value of the last valid CPUID >>>>>>> leaf, >>>>>>> which is 13 on your CPU. A similar CPU model has >>>>>>> >>>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>>>>>> edx=0x00000000 >>>>>>> >>>>>>> i.e. 7, 832, 832, 0 >>>>>>> >>>>>>> Looks familiar, right? >>>>>>> >>>>>>> You can verify that with 'cpuid -1 -r' on your machine. >>>>>>> >>>>>>>> Linux does not think ART is enabled, and does not set the >>>>>>>> synthesized >>>>>>>> CPUID + >>>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>>>>>> see this bit set . >>>>>>> >>>>>>> Rightfully so. This is a Haswell Core model. >>>>>>> >>>>>>>> if an e1000 NIC card had been installed, PTP would not be >>>>>>>> available. >>>>>>> >>>>>>> PTP is independent of the ART kernel feature . ART just provides >>>>>>> enhanced >>>>>>> PTP features. You are confusing things here. >>>>>>> >>>>>>> The ART feature as the kernel sees it is a hardware extension which >>>>>>> feeds >>>>>>> the ART clock to peripherals for timestamping and time correlation >>>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf >>>>>>> 0x15 >>>>>>> so >>>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP >>>>>>> accuracy. >>>>>>> >>>>>>> It's correct, that the NONSTOP_TSC feature depends on the >>>>>>> availability >>>>>>> of >>>>>>> ART, but that has nothing to do with the feature bit, which solely >>>>>>> describes the ratio between TSC and the ART frequency which is >>>>>>> exposed >>>>>>> to >>>>>>> peripherals. That frequency is not necessarily the real ART >>>>>>> frequency. >>>>>>> >>>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems >>>>>>>> to >>>>>>>> be >>>>>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART >>>>>>>> is >>>>>>>> 0 >>>>>>>> because the CPU will always get a fault reading the MSR since it >>>>>>>> has >>>>>>>> never been written. >>>>>>> >>>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>>>>>> really >>>>>>> wrong. And writing it unconditionally to 0 is not going to happen. >>>>>>> 4.10 >>>>>>> has >>>>>>> new code which utilizes the TSC_ADJUST MSR. >>>>>>> >>>>>>>> It would be nice for user-space programs that want to use the TSC >>>>>>>> with >>>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to >>>>>>>> the >>>>>>>> bug report, >>>>>>>> could have confidence that Linux is actually generating the results >>>>>>>> of >>>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>>>>>> in a predictable way from the TSC by looking at the >>>>>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>>>>>> use of TSC values, so that they can correlate TSC values with linux >>>>>>>> clock_gettime() values. >>>>>>> >>>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>>>>>> >>>>>>> Nothing at all, really. >>>>>>> >>>>>>> The kernel makes use of the proper information values already. >>>>>>> >>>>>>> The TSC frequency is determined from: >>>>>>> >>>>>>> 1) CPUID(0x16) if available >>>>>>> 2) MSRs if available >>>>>>> 3) By calibration against a known clock >>>>>>> >>>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* >>>>>>> values >>>>>>> are >>>>>>> correct whether that machine has ART exposed to peripherals or not. >>>>>>> >>>>>>>> has tsc: 1 constant: 1 >>>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>>>>>> >>>>>>> And that voodoo math tells us what? That you found a way to >>>>>>> correlate >>>>>>> CPUID(0xd) to the TSC frequency on that machine. >>>>>>> >>>>>>> Now I'm curious how you do that on this other machine which returns >>>>>>> for >>>>>>> cpuid(15): 1, 1, 1 >>>>>>> >>>>>>> You can't because all of this is completely wrong. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> tglx >>>>>>> >>>>>> >>>>> >>>> >>> >> > [-- Attachment #2: ttsc.tar --] [-- Type: application/x-tar, Size: 40960 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2017-02-23 18:05 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-19 0:31 [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 Jason Vas Dias 2017-02-19 15:35 ` Jason Vas Dias 2017-02-20 21:49 ` Thomas Gleixner 2017-02-20 21:49 ` Thomas Gleixner 2017-02-21 23:39 ` Jason Vas Dias 2017-02-21 23:39 ` Jason Vas Dias 2017-02-22 16:07 ` Jason Vas Dias 2017-02-22 16:18 ` Jason Vas Dias 2017-02-22 16:18 ` Jason Vas Dias 2017-02-22 17:27 ` Jason Vas Dias 2017-02-22 17:27 ` Jason Vas Dias 2017-02-22 19:53 ` Thomas Gleixner 2017-02-22 19:53 ` Thomas Gleixner 2017-02-22 20:15 ` Jason Vas Dias 2017-02-22 20:15 ` Jason Vas Dias 2017-02-22 20:26 ` Jason Vas Dias 2017-02-23 18:05 ` Jason Vas Dias
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.