All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-19  0:31 Jason Vas Dias
  2017-02-19 15:35 ` Jason Vas Dias
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-19  0:31 UTC (permalink / raw)
  To: kernel-janitors, linux-kernel, prarit

[-- Attachment #1: Type: text/plain, Size: 3499 bytes --]

I originally reported this issue on bugzilla.kernel.org : bug # 194609 :
https://bugzilla.kernel.org/show_bug.cgi?id=194609
, but it was not posted to the list .

My CPU reports 'model name' as
"Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" ,
has 4 physical & 8 hyperthreading cores with a frequency scalable from 800000
to 3900000 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and
flags :
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm
ida arat pln pts

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$

CPUID:15H is available in user-space, returning the integers : ( 7,
832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
in detect_art() in tsc.c,
Linux does not think ART is enabled, and does not set the synthesized CPUID +
((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
see this bit set .
if an e1000 NIC card had been installed, PTP would not be available.
Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
because the CPU will always get a fault reading the MSR since it has
never been written.

So the attached patch makes tsc.c set X86_FEATURE_ART correctly in tsc.c ,
and set the TSC_ADJUST to 0 if the rdmsr gets an error .
Please consider applying it to a future linux version.

It would be nice for user-space programs that want to use the TSC with
rdtsc / rdtscp instructions, such as the demo program attached to the
bug report,
could have confidence that Linux is actually generating the results of
clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
in a predictable way from the TSC by looking at the
 /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
use of TSC values, so that they can correlate TSC values with linux
clock_gettime() values.

The patch applies to linux kernels v4.8 & v4.9.10 GIT tags  and the
kernels build
and run & the demo program produces results like :
 $ ./ttsc1
has tsc: 1 constant: 1
832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
Hooray! TSC is enabled with KHz: 2893300
ts2 - ts1: 261 ts3 - ts2: 211 ns1: 0.000000146 ns2: 0.000001629
ts3 - ts2: 27 ns1: 0.000000168
ts3 - ts2: 20 ns1: 0.000000147
ts3 - ts2: 14 ns1: 0.000000152
ts3 - ts2: 15 ns1: 0.000000151
ts3 - ts2: 15 ns1: 0.000000153
ts3 - ts2: 15 ns1: 0.000000150
ts3 - ts2: 20 ns1: 0.000000148
ts3 - ts2: 19 ns1: 0.000000164
ts3 - ts2: 19 ns1: 0.000000164
ts3 - ts2: 19 ns1: 0.000000160
t1 - t0: 52901 - ns2: 0.000053951

The value 'ts3 - ts2' is the number of nanoseconds measured by
successive calls to
'rdtscp'; the 'ns1' value is the number of nanoseconds (shown as
decimal seconds)
measured by
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts2) -
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)
when casting each {ts.tv_sec, ts.tv_nsec} to a 128 bit long long integer .
It shows a user-space program can read the TSC with a latency of @20ns
but can only measure times >= @ 140ns using Linux clock_gettime()  on this CPU.

[-- Attachment #2: x86_kernel_tsc-bz194609.patch --]
[-- Type: application/octet-stream, Size: 2993 bytes --]

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 46b2f41..f76cca8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling);
 #endif /* CONFIG_CPU_FREQ */
 
 #define ART_CPUID_LEAF (0x15)
+#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008)
 #define ART_MIN_DENOMINATOR (1)
 
 
@@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling);
  */
 static void detect_art(void)
 {
-	unsigned int unused[2];
-
-	if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
-		return;
-
-	cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
-	      &art_to_tsc_numerator, unused, unused+1);
-
+	unsigned int v[2];
+
+	if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
+        {
+                if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART)
+                {
+                        pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.\n",
+                        boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level);
+                }else
+                        return;
+        }                
+
+        cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
+              &art_to_tsc_numerator, v, v+1);
+        
 	/* Don't enable ART in a VM, non-stop TSC required */
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-	    !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
-	    art_to_tsc_denominator < ART_MIN_DENOMINATOR)
-		return;
-
-	if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))
-		return;
-
+           !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+	   art_to_tsc_denominator < ART_MIN_DENOMINATOR)
+        {
+                pr_info("Not using Intel ART for TSC - HYPERVISOR:%d  NO NONSTOP_TSC:%d  bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator );
+                return;
+        }
+	if (  (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */
+        {
+                if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0)
+                {
+                        pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] );
+                        return;
+                }else
+                {
+                        art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */
+                        pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to %llu.\n",art_to_tsc_offset);
+                }
+        }
 	/* Make this sticky over multiple CPU init calls */
+        pr_info("Using Intel Always Running Timer (ART) feature %x for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n", X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator, art_to_tsc_offset );
 	setup_force_cpu_cap(X86_FEATURE_ART);
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-19  0:31 [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 Jason Vas Dias
@ 2017-02-19 15:35 ` Jason Vas Dias
  2017-02-20 21:49     ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-19 15:35 UTC (permalink / raw)
  To: kernel-janitors, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Prarit Bhargava, x86

[-- Attachment #1: Type: text/plain, Size: 7467 bytes --]

Patch to make tsc.c set X86_FEATURE_ART and setup the TSC_ADJUST MSR
correctly on my "i7-4910MQ" CPU, which reports
( boot_cpu_data.cpuid_level==0x13  &&
  boot_cpu_data.extended_cpuid_level==0x80000008
), so the code didn't think it supported CPUID:15h, but it does .

Patch:
<quote><code><pre><patch>
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 46b2f41..f76cca8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling);
 #endif /* CONFIG_CPU_FREQ */

 #define ART_CPUID_LEAF (0x15)
+#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008)
 #define ART_MIN_DENOMINATOR (1)


@@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling);
  */
 static void detect_art(void)
 {
-	unsigned int unused[2];
-
-	if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
-		return;
-
-	cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
-	      &art_to_tsc_numerator, unused, unused+1);
-
+	unsigned int v[2];
+
+	if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
+        {
+                if(boot_cpu_data.extended_cpuid_level >=
MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART)
+                {
+                        pr_info("Would normally not use ART -
cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for
ART support.\n",
+                        boot_cpu_data.cpuid_level, ART_CPUID_LEAF,
boot_cpu_data.extended_cpuid_level);
+                }else
+                        return;
+        }
+
+        cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
+              &art_to_tsc_numerator, v, v+1);
+
 	/* Don't enable ART in a VM, non-stop TSC required */
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-	    !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
-	    art_to_tsc_denominator < ART_MIN_DENOMINATOR)
-		return;
-
-	if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))
-		return;
-
+           !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+	   art_to_tsc_denominator < ART_MIN_DENOMINATOR)
+        {
+                pr_info("Not using Intel ART for TSC - HYPERVISOR:%d
NO NONSTOP_TSC:%d  bad TSC/Crystal ratio denominator: %d.",
boot_cpu_has(X86_FEATURE_HYPERVISOR),
!boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator );
+                return;
+        }
+	if (  (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST,
&art_to_tsc_offset))!=0) /* will get fault on first read if nothing
written yet */
+        {
+                if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0)
+                {
+                        pr_info("Not using Intel ART for TSC - failed
to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] );
+                        return;
+                }else
+                {
+                        art_to_tsc_offset = 0; /* perhaps initalize
to -1 * current rdtsc value ? */
+                        pr_info("Using Intel ART for TSC - TSC_ADJUST
initialized to %llu.\n",art_to_tsc_offset);
+                }
+        }
 	/* Make this sticky over multiple CPU init calls */
+        pr_info("Using Intel Always Running Timer (ART) feature %x
for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n",
X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator,
art_to_tsc_offset );
 	setup_force_cpu_cap(X86_FEATURE_ART);
 }
</patch></quote></code></pre>

I originally reported this issue on bugzilla.kernel.org : bug # 194609 :
https://bugzilla.kernel.org/show_bug.cgi?id=194609
, but it was not posted to the list , & then I posted it to the list, but
Julia Lawell <julia.lawall@lip6.fr> kindly suggested I should re-post with
patch inline, & include extra recipients, which includes the last
person to modify tsc.c (Prarit), so am doing so.

My CPU reports 'model name' as
"Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz" ,
has 4 physical & 8 hyperthreading cores with a frequency scalable from 800000
to 3900000 (/sys/devices/system/cpu/cpu0/cpufreq/scaling_{min,max}_freq) , and
flags :
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm
ida arat pln pts

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
$

CPUID:15H is available in user-space, returning the integers : ( 7,
832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
in detect_art() in tsc.c,
Linux does not think ART is enabled, and does not set the synthesized CPUID +
((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
see this bit set .
if an e1000 NIC card had been installed, PTP would not be available.
Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
because the CPU will always get a fault reading the MSR since it has
never been written.

So the attached patch makes tsc.c set X86_FEATURE_ART correctly in tsc.c ,
and set the TSC_ADJUST to 0 if the rdmsr gets an error .
Please consider applying it to a future linux version.

It would be nice for user-space programs that want to use the TSC with
rdtsc / rdtscp instructions, such as the demo program attached to the
bug report,
could have confidence that Linux is actually generating the results of
clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
in a predictable way from the TSC by looking at the
 /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
use of TSC values, so that they can correlate TSC values with linux
clock_gettime() values.

The patch applies to linux kernels v4.8 & v4.9.10 GIT tags  and the
kernels build
and run & the demo program produces results like :
 $ ./ttsc1
has tsc: 1 constant: 1
832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
Hooray! TSC is enabled with KHz: 2893300
ts2 - ts1: 261 ts3 - ts2: 211 ns1: 0.000000146 ns2: 0.000001629
ts3 - ts2: 27 ns1: 0.000000168
ts3 - ts2: 20 ns1: 0.000000147
ts3 - ts2: 14 ns1: 0.000000152
ts3 - ts2: 15 ns1: 0.000000151
ts3 - ts2: 15 ns1: 0.000000153
ts3 - ts2: 15 ns1: 0.000000150
ts3 - ts2: 20 ns1: 0.000000148
ts3 - ts2: 19 ns1: 0.000000164
ts3 - ts2: 19 ns1: 0.000000164
ts3 - ts2: 19 ns1: 0.000000160
t1 - t0: 52901 - ns2: 0.000053951

The value 'ts3 - ts2' is the number of nanoseconds measured by
successive calls to
'rdtscp'; the 'ns1' value is the number of nanoseconds (shown as
decimal seconds)
measured by
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts2) -
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)
when casting each {ts.tv_sec, ts.tv_nsec} to a 128 bit long long integer .
It shows a user-space program can read the TSC with a latency of @20ns
but can only measure times >= @ 140ns using Linux clock_gettime()  on this CPU.

Please make Linux provide some help to programs that want to use the
TSC from user-space by applying the patch so they can determine with
confidence that Linux
is supplying TSC values with a predictable conversion function in response to
clock_gettime(CLOCK_MONOTONIC_RAW,&ts) system calls. It would also
be nice if it actually exported the 'Refined TSC calibrated frequency'
(tsc_khz)
in sysfs , but that's another story.

Best Regards,
Jason

[-- Attachment #2: x86_kernel_tsc-bz194609.patch --]
[-- Type: application/octet-stream, Size: 2993 bytes --]

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 46b2f41..f76cca8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1030,6 +1030,7 @@ core_initcall(cpufreq_register_tsc_scaling);
 #endif /* CONFIG_CPU_FREQ */
 
 #define ART_CPUID_LEAF (0x15)
+#define MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART (0x80000008)
 #define ART_MIN_DENOMINATOR (1)
 
 
@@ -1038,24 +1039,43 @@ core_initcall(cpufreq_register_tsc_scaling);
  */
 static void detect_art(void)
 {
-	unsigned int unused[2];
-
-	if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
-		return;
-
-	cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
-	      &art_to_tsc_numerator, unused, unused+1);
-
+	unsigned int v[2];
+
+	if(boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
+        {
+                if(boot_cpu_data.extended_cpuid_level >= MINIMUM_CPUID_EXTENDED_LEAF_THAT_MUST_HAVE_ART)
+                {
+                        pr_info("Would normally not use ART - cpuid_level:%d < %d - but extended_cpuid_level is: %x, so probing for ART support.\n",
+                        boot_cpu_data.cpuid_level, ART_CPUID_LEAF, boot_cpu_data.extended_cpuid_level);
+                }else
+                        return;
+        }                
+
+        cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
+              &art_to_tsc_numerator, v, v+1);
+        
 	/* Don't enable ART in a VM, non-stop TSC required */
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
-	    !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
-	    art_to_tsc_denominator < ART_MIN_DENOMINATOR)
-		return;
-
-	if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))
-		return;
-
+           !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
+	   art_to_tsc_denominator < ART_MIN_DENOMINATOR)
+        {
+                pr_info("Not using Intel ART for TSC - HYPERVISOR:%d  NO NONSTOP_TSC:%d  bad TSC/Crystal ratio denominator: %d.", boot_cpu_has(X86_FEATURE_HYPERVISOR), !boot_cpu_has(X86_FEATURE_NONSTOP_TSC), art_to_tsc_denominator );
+                return;
+        }
+	if (  (v[0]=rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))!=0) /* will get fault on first read if nothing written yet */
+        {
+                if((v[1]=wrmsrl_safe(MSR_IA32_TSC_ADJUST, 0))!=0)
+                {
+                        pr_info("Not using Intel ART for TSC - failed to initialize TSC_ADJUST: %d %d.\n", v[0], v[1] );
+                        return;
+                }else
+                {
+                        art_to_tsc_offset = 0; /* perhaps initalize to -1 * current rdtsc value ? */
+                        pr_info("Using Intel ART for TSC - TSC_ADJUST initialized to %llu.\n",art_to_tsc_offset);
+                }
+        }
 	/* Make this sticky over multiple CPU init calls */
+        pr_info("Using Intel Always Running Timer (ART) feature %x for TSC on all CPUs - TSC/CCC: %d/%d offset: %llu.\n", X86_FEATURE_ART, art_to_tsc_numerator, art_to_tsc_denominator, art_to_tsc_offset );
 	setup_force_cpu_cap(X86_FEATURE_ART);
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-19 15:35 ` Jason Vas Dias
@ 2017-02-20 21:49     ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2017-02-20 21:49 UTC (permalink / raw)
  To: Jason Vas Dias
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

On Sun, 19 Feb 2017, Jason Vas Dias wrote:

> CPUID:15H is available in user-space, returning the integers : ( 7,
> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
> in detect_art() in tsc.c,

By some definition of available. You can feed CPUID random leaf numbers and
it will return something, usually the value of the last valid CPUID leaf,
which is 13 on your CPU. A similar CPU model has

0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000

i.e. 7, 832, 832, 0

Looks familiar, right?

You can verify that with 'cpuid -1 -r' on your machine.

> Linux does not think ART is enabled, and does not set the synthesized CPUID +
> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
> see this bit set .

Rightfully so. This is a Haswell Core model.

> if an e1000 NIC card had been installed, PTP would not be available.

PTP is independent of the ART kernel feature . ART just provides enhanced
PTP features. You are confusing things here.

The ART feature as the kernel sees it is a hardware extension which feeds
the ART clock to peripherals for timestamping and time correlation
purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so
the kernel can make use of that correlation, e.g. for enhanced PTP
accuracy.

It's correct, that the NONSTOP_TSC feature depends on the availability of
ART, but that has nothing to do with the feature bit, which solely
describes the ratio between TSC and the ART frequency which is exposed to
peripherals. That frequency is not necessarily the real ART frequency.

> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
> because the CPU will always get a fault reading the MSR since it has
> never been written.

Huch? If an access to the TSC ADJUST MSR faults, then something is really
wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has
new code which utilizes the TSC_ADJUST MSR.

> It would be nice for user-space programs that want to use the TSC with
> rdtsc / rdtscp instructions, such as the demo program attached to the
> bug report,
> could have confidence that Linux is actually generating the results of
> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
> in a predictable way from the TSC by looking at the
>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
> use of TSC values, so that they can correlate TSC values with linux
> clock_gettime() values.

What has ART to do with correct CLOCK_MONOTONIC_RAW values?

Nothing at all, really.

The kernel makes use of the proper information values already.

The TSC frequency is determined from:

    1) CPUID(0x16) if available
    2) MSRs if available
    3) By calibration against a known clock

If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are
correct whether that machine has ART exposed to peripherals or not.

> has tsc: 1 constant: 1
> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1

And that voodoo math tells us what? That you found a way to correlate
CPUID(0xd) to the TSC frequency on that machine.

Now I'm curious how you do that on this other machine which returns for
cpuid(15): 1, 1, 1

You can't because all of this is completely wrong.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-20 21:49     ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2017-02-20 21:49 UTC (permalink / raw)
  To: Jason Vas Dias
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

On Sun, 19 Feb 2017, Jason Vas Dias wrote:

> CPUID:15H is available in user-space, returning the integers : ( 7,
> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
> in detect_art() in tsc.c,

By some definition of available. You can feed CPUID random leaf numbers and
it will return something, usually the value of the last valid CPUID leaf,
which is 13 on your CPU. A similar CPU model has

0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000

i.e. 7, 832, 832, 0

Looks familiar, right?

You can verify that with 'cpuid -1 -r' on your machine.

> Linux does not think ART is enabled, and does not set the synthesized CPUID +
> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
> see this bit set .

Rightfully so. This is a Haswell Core model.

> if an e1000 NIC card had been installed, PTP would not be available.

PTP is independent of the ART kernel feature . ART just provides enhanced
PTP features. You are confusing things here.

The ART feature as the kernel sees it is a hardware extension which feeds
the ART clock to peripherals for timestamping and time correlation
purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so
the kernel can make use of that correlation, e.g. for enhanced PTP
accuracy.

It's correct, that the NONSTOP_TSC feature depends on the availability of
ART, but that has nothing to do with the feature bit, which solely
describes the ratio between TSC and the ART frequency which is exposed to
peripherals. That frequency is not necessarily the real ART frequency.

> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
> because the CPU will always get a fault reading the MSR since it has
> never been written.

Huch? If an access to the TSC ADJUST MSR faults, then something is really
wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has
new code which utilizes the TSC_ADJUST MSR.

> It would be nice for user-space programs that want to use the TSC with
> rdtsc / rdtscp instructions, such as the demo program attached to the
> bug report,
> could have confidence that Linux is actually generating the results of
> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
> in a predictable way from the TSC by looking at the
>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
> use of TSC values, so that they can correlate TSC values with linux
> clock_gettime() values.

What has ART to do with correct CLOCK_MONOTONIC_RAW values?

Nothing at all, really.

The kernel makes use of the proper information values already.

The TSC frequency is determined from:

    1) CPUID(0x16) if available
    2) MSRs if available
    3) By calibration against a known clock

If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are
correct whether that machine has ART exposed to peripherals or not.

> has tsc: 1 constant: 1
> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1

And that voodoo math tells us what? That you found a way to correlate
CPUID(0xd) to the TSC frequency on that machine.

Now I'm curious how you do that on this other machine which returns for
cpuid(15): 1, 1, 1

You can't because all of this is completely wrong.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-20 21:49     ` Thomas Gleixner
@ 2017-02-21 23:39       ` Jason Vas Dias
  -1 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-21 23:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

Thank You for enlightening me -

I was just having a hard time believing that Intel would ship a chip
that features a monotonic, fixed frequency timestamp counter
without specifying in either documentation or on-chip or in ACPI what
precisely that hard-wired frequency is, but I now know that to
be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
difficult to reconcile with the statement in the SDM :
  17.16.4  Invariant Time-Keeping
    The invariant TSC is based on the invariant timekeeping hardware
    (called Always Running Timer or ART), that runs at the core crystal clock
    frequency. The ratio defined by CPUID leaf 15H expresses the frequency
    relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0
    and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
    relationship holds between TSC and the ART hardware:
    TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
                         / CPUID.15H:EAX[31:0] + K
    Where 'K' is an offset that can be adjusted by a privileged agent*2.
     When ART hardware is reset, both invariant TSC and K are also reset.

So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
that
the "Nominal TSC Frequency" formulae in the manul must apply to all
CPUs with InvariantTSC .

Do I understand correctly , that since I do have InvariantTSC ,  the
TSC_Value is in fact calculated according to the above formula, but with
a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
TSC frequency ?
It was obvious this nominal TSC Frequency had nothing to do with the
actual TSC frequency used by Linux, which is 'tsc_khz' .
I guess wishful thinking led me to believe CPUID:15h was actually
supported somehow , because I thought InvariantTSC meant it had ART
hardware .

I do strongly suggest that Linux exports its calibrated TSC Khz
somewhere to user
space .

I think the best long-term solution would be to allow programs to
somehow read the TSC without invoking
clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
having to enter the kernel, which incurs an overhead of > 120ns on my system .


Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
'clocksource->shift' values to /sysfs somehow ?

For instance , only  if the 'current_clocksource' is 'tsc', then these
values could be exported as:
/sys/devices/system/clocksource/clocksource0/shift
/sys/devices/system/clocksource/clocksource0/mult
/sys/devices/system/clocksource/clocksource0/freq

So user-space programs could  know that the value returned by
    clock_gettime(CLOCK_MONOTONIC_RAW)
  would be
    {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
      , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
    }
  and that represents ticks of period (1.0 / ( freq * 1000 )) S.

That would save user-space programs from having to know 'tsc_khz' by
parsing the 'Refined TSC' frequency from log files or by examining the
running kernel with objdump to obtain this value & figure out 'mult' &
'shift' themselves.

And why not a
  /sys/devices/system/clocksource/clocksource0/value
file that actually prints this ( ( rdtsc() * mult ) >> shift )
expression as a long integer?
And perhaps a
  /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
file that actually prints out the number of real-time nano-seconds since the
contents of the existing
  /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
files using the current TSC value?
To read the rtc0/{date,time} files is already faster than entering the
kernel to call
clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.

I will work on developing a patch to this effect if no-one else is.

Also, am I right in assuming that the maximum granularity of the real-time clock
on my system is 1/64th of a second ? :
 $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
 64
This is the maximum granularity that can be stored in CMOS , not
returned by TSC? Couldn't we have something similar that gave an
accurate idea of TSC frequency and the precise formula applied to TSC
value to get clock_gettime
(CLOCK_MONOTONIC_RAW) value ?

Regards,
Jason


This code does produce good timestamps with a latency of @20ns
that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
values, but it depends on a global variable that  is initialized to
the 'tsc_khz' value
computed by running kernel parsed from objdump /proc/kcore output :

static inline __attribute__((always_inline))
U64_t
IA64_tsc_now()
{ if(!(    _ia64_invariant_tsc_enabled
      ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL))
      )
    )
  { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
TSC enabled.\n");
    return 0;
  }
  U32_t tsc_hi, tsc_lo;
  register UL_t tsc;
  asm volatile
  ( "rdtscp\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    "mov %%ecx, %2\n\t"
  : "=m" (tsc_hi) ,
    "=m" (tsc_lo) ,
    "=m" (_ia64_tsc_user_cpu) :
  : "%eax","%ecx","%edx"
  );
  tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
  return tsc;
}

__thread
U64_t _ia64_first_tsc = 0xffffffffffffffffUL;

static inline __attribute__((always_inline))
U64_t IA64_tsc_ticks_since_start()
{ if(_ia64_first_tsc == 0xffffffffffffffffUL)
  { _ia64_first_tsc = IA64_tsc_now();
    return 0;
  }
  return (IA64_tsc_now() - _ia64_first_tsc) ;
}

static inline __attribute__((always_inline))
void
ia64_tsc_calc_mult_shift
( register U32_t *mult,
  register U32_t *shift
)
{ /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
   * calculates second + nanosecond mult + shift in same way linux does.
   * we want to be compatible with what linux returns in struct
timespec ts after call to
   * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
   */
  const U32_t scale=1000U;
  register U32_t from= IA64_tsc_khz();
  register U32_t to  = NSEC_PER_SEC / scale;
  register U64_t sec = ( ~0UL / from ) / scale;
  sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
  register U64_t maxsec = sec * scale;
  UL_t tmp;
  U32_t sft, sftacc=32;
  /*
   * Calculate the shift factor which is limiting the conversion
   * range:
   */
  tmp = (maxsec * from) >> 32;
  while (tmp)
  { tmp >>=1;
    sftacc--;
  }
  /*
   * Find the conversion shift/mult pair which has the best
   * accuracy and fits the maxsec conversion range:
   */
  for (sft = 32; sft > 0; sft--)
  { tmp = ((UL_t) to) << sft;
    tmp += from / 2;
    tmp = tmp / from;
    if ((tmp >> sftacc) == 0)
      break;
  }
  *mult = tmp;
  *shift = sft;
}

__thread
U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;

static inline __attribute__((always_inline))
U64_t IA64_s_ns_since_start()
{ if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
    ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
  register U64_t cycles = IA64_tsc_ticks_since_start();
  register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
  return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
NSEC_PER_SEC)&0x3fffffffUL) );
  /* Yes, we are purposefully ignoring durations of more than 4.2
billion seconds here! */
}


I think Linux should export the 'tsc_khz', 'mult' and 'shift' values somehow,
then user-space libraries could have more confidence in using 'rdtsc'
or 'rdtscp'
if Linux's current_clocksource is 'tsc'.

Regards,
Jason



On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>
>> CPUID:15H is available in user-space, returning the integers : ( 7,
>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>> in detect_art() in tsc.c,
>
> By some definition of available. You can feed CPUID random leaf numbers and
> it will return something, usually the value of the last valid CPUID leaf,
> which is 13 on your CPU. A similar CPU model has
>
> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
> edx=0x00000000
>
> i.e. 7, 832, 832, 0
>
> Looks familiar, right?
>
> You can verify that with 'cpuid -1 -r' on your machine.
>
>> Linux does not think ART is enabled, and does not set the synthesized
>> CPUID +
>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>> see this bit set .
>
> Rightfully so. This is a Haswell Core model.
>
>> if an e1000 NIC card had been installed, PTP would not be available.
>
> PTP is independent of the ART kernel feature . ART just provides enhanced
> PTP features. You are confusing things here.
>
> The ART feature as the kernel sees it is a hardware extension which feeds
> the ART clock to peripherals for timestamping and time correlation
> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so
> the kernel can make use of that correlation, e.g. for enhanced PTP
> accuracy.
>
> It's correct, that the NONSTOP_TSC feature depends on the availability of
> ART, but that has nothing to do with the feature bit, which solely
> describes the ratio between TSC and the ART frequency which is exposed to
> peripherals. That frequency is not necessarily the real ART frequency.
>
>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
>> because the CPU will always get a fault reading the MSR since it has
>> never been written.
>
> Huch? If an access to the TSC ADJUST MSR faults, then something is really
> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has
> new code which utilizes the TSC_ADJUST MSR.
>
>> It would be nice for user-space programs that want to use the TSC with
>> rdtsc / rdtscp instructions, such as the demo program attached to the
>> bug report,
>> could have confidence that Linux is actually generating the results of
>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>> in a predictable way from the TSC by looking at the
>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>> use of TSC values, so that they can correlate TSC values with linux
>> clock_gettime() values.
>
> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>
> Nothing at all, really.
>
> The kernel makes use of the proper information values already.
>
> The TSC frequency is determined from:
>
>     1) CPUID(0x16) if available
>     2) MSRs if available
>     3) By calibration against a known clock
>
> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are
> correct whether that machine has ART exposed to peripherals or not.
>
>> has tsc: 1 constant: 1
>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>
> And that voodoo math tells us what? That you found a way to correlate
> CPUID(0xd) to the TSC frequency on that machine.
>
> Now I'm curious how you do that on this other machine which returns for
> cpuid(15): 1, 1, 1
>
> You can't because all of this is completely wrong.
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-21 23:39       ` Jason Vas Dias
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-21 23:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

Thank You for enlightening me -

I was just having a hard time believing that Intel would ship a chip
that features a monotonic, fixed frequency timestamp counter
without specifying in either documentation or on-chip or in ACPI what
precisely that hard-wired frequency is, but I now know that to
be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
difficult to reconcile with the statement in the SDM :
  17.16.4  Invariant Time-Keeping
    The invariant TSC is based on the invariant timekeeping hardware
    (called Always Running Timer or ART), that runs at the core crystal clock
    frequency. The ratio defined by CPUID leaf 15H expresses the frequency
    relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] != 0
    and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
    relationship holds between TSC and the ART hardware:
    TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
                         / CPUID.15H:EAX[31:0] + K
    Where 'K' is an offset that can be adjusted by a privileged agent*2.
     When ART hardware is reset, both invariant TSC and K are also reset.

So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
that
the "Nominal TSC Frequency" formulae in the manul must apply to all
CPUs with InvariantTSC .

Do I understand correctly , that since I do have InvariantTSC ,  the
TSC_Value is in fact calculated according to the above formula, but with
a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
TSC frequency ?
It was obvious this nominal TSC Frequency had nothing to do with the
actual TSC frequency used by Linux, which is 'tsc_khz' .
I guess wishful thinking led me to believe CPUID:15h was actually
supported somehow , because I thought InvariantTSC meant it had ART
hardware .

I do strongly suggest that Linux exports its calibrated TSC Khz
somewhere to user
space .

I think the best long-term solution would be to allow programs to
somehow read the TSC without invoking
clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
having to enter the kernel, which incurs an overhead of > 120ns on my system .


Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
'clocksource->shift' values to /sysfs somehow ?

For instance , only  if the 'current_clocksource' is 'tsc', then these
values could be exported as:
/sys/devices/system/clocksource/clocksource0/shift
/sys/devices/system/clocksource/clocksource0/mult
/sys/devices/system/clocksource/clocksource0/freq

So user-space programs could  know that the value returned by
    clock_gettime(CLOCK_MONOTONIC_RAW)
  would be
    {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
      , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
    }
  and that represents ticks of period (1.0 / ( freq * 1000 )) S.

That would save user-space programs from having to know 'tsc_khz' by
parsing the 'Refined TSC' frequency from log files or by examining the
running kernel with objdump to obtain this value & figure out 'mult' &
'shift' themselves.

And why not a
  /sys/devices/system/clocksource/clocksource0/value
file that actually prints this ( ( rdtsc() * mult ) >> shift )
expression as a long integer?
And perhaps a
  /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
file that actually prints out the number of real-time nano-seconds since the
contents of the existing
  /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
files using the current TSC value?
To read the rtc0/{date,time} files is already faster than entering the
kernel to call
clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.

I will work on developing a patch to this effect if no-one else is.

Also, am I right in assuming that the maximum granularity of the real-time clock
on my system is 1/64th of a second ? :
 $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
 64
This is the maximum granularity that can be stored in CMOS , not
returned by TSC? Couldn't we have something similar that gave an
accurate idea of TSC frequency and the precise formula applied to TSC
value to get clock_gettime
(CLOCK_MONOTONIC_RAW) value ?

Regards,
Jason


This code does produce good timestamps with a latency of @20ns
that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
values, but it depends on a global variable that  is initialized to
the 'tsc_khz' value
computed by running kernel parsed from objdump /proc/kcore output :

static inline __attribute__((always_inline))
U64_t
IA64_tsc_now()
{ if(!(    _ia64_invariant_tsc_enabled
      ||(( _cpu0id_fd = -1) && IA64_invariant_tsc_is_enabled(NULL,NULL))
      )
    )
  { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
TSC enabled.\n");
    return 0;
  }
  U32_t tsc_hi, tsc_lo;
  register UL_t tsc;
  asm volatile
  ( "rdtscp\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    "mov %%ecx, %2\n\t"
  : "=m" (tsc_hi) ,
    "=m" (tsc_lo) ,
    "=m" (_ia64_tsc_user_cpu) :
  : "%eax","%ecx","%edx"
  );
  tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
  return tsc;
}

__thread
U64_t _ia64_first_tsc = 0xffffffffffffffffUL;

static inline __attribute__((always_inline))
U64_t IA64_tsc_ticks_since_start()
{ if(_ia64_first_tsc = 0xffffffffffffffffUL)
  { _ia64_first_tsc = IA64_tsc_now();
    return 0;
  }
  return (IA64_tsc_now() - _ia64_first_tsc) ;
}

static inline __attribute__((always_inline))
void
ia64_tsc_calc_mult_shift
( register U32_t *mult,
  register U32_t *shift
)
{ /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
   * calculates second + nanosecond mult + shift in same way linux does.
   * we want to be compatible with what linux returns in struct
timespec ts after call to
   * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
   */
  const U32_t scale\x1000U;
  register U32_t from= IA64_tsc_khz();
  register U32_t to  = NSEC_PER_SEC / scale;
  register U64_t sec = ( ~0UL / from ) / scale;
  sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
  register U64_t maxsec = sec * scale;
  UL_t tmp;
  U32_t sft, sftacc2;
  /*
   * Calculate the shift factor which is limiting the conversion
   * range:
   */
  tmp = (maxsec * from) >> 32;
  while (tmp)
  { tmp >>=1;
    sftacc--;
  }
  /*
   * Find the conversion shift/mult pair which has the best
   * accuracy and fits the maxsec conversion range:
   */
  for (sft = 32; sft > 0; sft--)
  { tmp = ((UL_t) to) << sft;
    tmp += from / 2;
    tmp = tmp / from;
    if ((tmp >> sftacc) = 0)
      break;
  }
  *mult = tmp;
  *shift = sft;
}

__thread
U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;

static inline __attribute__((always_inline))
U64_t IA64_s_ns_since_start()
{ if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) )
    ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
  register U64_t cycles = IA64_tsc_ticks_since_start();
  register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
  return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
NSEC_PER_SEC)&0x3fffffffUL) );
  /* Yes, we are purposefully ignoring durations of more than 4.2
billion seconds here! */
}


I think Linux should export the 'tsc_khz', 'mult' and 'shift' values somehow,
then user-space libraries could have more confidence in using 'rdtsc'
or 'rdtscp'
if Linux's current_clocksource is 'tsc'.

Regards,
Jason



On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>
>> CPUID:15H is available in user-space, returning the integers : ( 7,
>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>> in detect_art() in tsc.c,
>
> By some definition of available. You can feed CPUID random leaf numbers and
> it will return something, usually the value of the last valid CPUID leaf,
> which is 13 on your CPU. A similar CPU model has
>
> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
> edx=0x00000000
>
> i.e. 7, 832, 832, 0
>
> Looks familiar, right?
>
> You can verify that with 'cpuid -1 -r' on your machine.
>
>> Linux does not think ART is enabled, and does not set the synthesized
>> CPUID +
>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>> see this bit set .
>
> Rightfully so. This is a Haswell Core model.
>
>> if an e1000 NIC card had been installed, PTP would not be available.
>
> PTP is independent of the ART kernel feature . ART just provides enhanced
> PTP features. You are confusing things here.
>
> The ART feature as the kernel sees it is a hardware extension which feeds
> the ART clock to peripherals for timestamping and time correlation
> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 so
> the kernel can make use of that correlation, e.g. for enhanced PTP
> accuracy.
>
> It's correct, that the NONSTOP_TSC feature depends on the availability of
> ART, but that has nothing to do with the feature bit, which solely
> describes the ratio between TSC and the ART frequency which is exposed to
> peripherals. That frequency is not necessarily the real ART frequency.
>
>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
>> because the CPU will always get a fault reading the MSR since it has
>> never been written.
>
> Huch? If an access to the TSC ADJUST MSR faults, then something is really
> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 has
> new code which utilizes the TSC_ADJUST MSR.
>
>> It would be nice for user-space programs that want to use the TSC with
>> rdtsc / rdtscp instructions, such as the demo program attached to the
>> bug report,
>> could have confidence that Linux is actually generating the results of
>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>> in a predictable way from the TSC by looking at the
>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>> use of TSC values, so that they can correlate TSC values with linux
>> clock_gettime() values.
>
> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>
> Nothing at all, really.
>
> The kernel makes use of the proper information values already.
>
> The TSC frequency is determined from:
>
>     1) CPUID(0x16) if available
>     2) MSRs if available
>     3) By calibration against a known clock
>
> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values are
> correct whether that machine has ART exposed to peripherals or not.
>
>> has tsc: 1 constant: 1
>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>
> And that voodoo math tells us what? That you found a way to correlate
> CPUID(0xd) to the TSC frequency on that machine.
>
> Now I'm curious how you do that on this other machine which returns for
> cpuid(15): 1, 1, 1
>
> You can't because all of this is completely wrong.
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-21 23:39       ` Jason Vas Dias
  (?)
@ 2017-02-22 16:07       ` Jason Vas Dias
  2017-02-22 16:18           ` Jason Vas Dias
  -1 siblings, 1 reply; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 16:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

[-- Attachment #1: Type: text/plain, Size: 13010 bytes --]

RE:
>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.

I just built an unpatched linux v4.10 with tglx's TSC improvements -
much else improved in this kernel (like iwlwifi) - thanks!

I have attached an updated version of the test program which
doesn't print the bogus "Nominal TSC Frequency" (the previous
version printed it, but equally ignored it).

The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :

$ uname -r
4.10.0
$ ./ttsc1
max_extended_leaf: 80000008
has tsc: 1 constant: 1
Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
ts3 - ts2: 178 ns1: 0.000000592
ts3 - ts2: 14 ns1: 0.000000577
ts3 - ts2: 14 ns1: 0.000000651
ts3 - ts2: 17 ns1: 0.000000625
ts3 - ts2: 17 ns1: 0.000000677
ts3 - ts2: 17 ns1: 0.000000626
ts3 - ts2: 17 ns1: 0.000000627
ts3 - ts2: 17 ns1: 0.000000627
ts3 - ts2: 18 ns1: 0.000000655
ts3 - ts2: 17 ns1: 0.000000631
t1 - t0: 89067 - ns2: 0.000091411

I think this is because under Linux 4.8, the CPU got a fault every
time it read the TSC_ADJUST MSR.

But user programs wanting to use the TSC  and correlate its value to
clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
program still have to  dig the TSC frequency value out of the kernel
with objdump  - this was really the point of the bug #194609.

I would still like to investigate exporting 'tsc_khz' & 'mult' +
'shift' values via sysfs.

Regards,
Jason.





On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> Thank You for enlightening me -
>
> I was just having a hard time believing that Intel would ship a chip
> that features a monotonic, fixed frequency timestamp counter
> without specifying in either documentation or on-chip or in ACPI what
> precisely that hard-wired frequency is, but I now know that to
> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
> difficult to reconcile with the statement in the SDM :
>   17.16.4  Invariant Time-Keeping
>     The invariant TSC is based on the invariant timekeeping hardware
>     (called Always Running Timer or ART), that runs at the core crystal
> clock
>     frequency. The ratio defined by CPUID leaf 15H expresses the frequency
>     relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] !=
> 0
>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>     relationship holds between TSC and the ART hardware:
>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>                          / CPUID.15H:EAX[31:0] + K
>     Where 'K' is an offset that can be adjusted by a privileged agent*2.
>      When ART hardware is reset, both invariant TSC and K are also reset.
>
> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
> that
> the "Nominal TSC Frequency" formulae in the manul must apply to all
> CPUs with InvariantTSC .
>
> Do I understand correctly , that since I do have InvariantTSC ,  the
> TSC_Value is in fact calculated according to the above formula, but with
> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
> TSC frequency ?
> It was obvious this nominal TSC Frequency had nothing to do with the
> actual TSC frequency used by Linux, which is 'tsc_khz' .
> I guess wishful thinking led me to believe CPUID:15h was actually
> supported somehow , because I thought InvariantTSC meant it had ART
> hardware .
>
> I do strongly suggest that Linux exports its calibrated TSC Khz
> somewhere to user
> space .
>
> I think the best long-term solution would be to allow programs to
> somehow read the TSC without invoking
> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
> having to enter the kernel, which incurs an overhead of > 120ns on my system
> .
>
>
> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
> 'clocksource->shift' values to /sysfs somehow ?
>
> For instance , only  if the 'current_clocksource' is 'tsc', then these
> values could be exported as:
> /sys/devices/system/clocksource/clocksource0/shift
> /sys/devices/system/clocksource/clocksource0/mult
> /sys/devices/system/clocksource/clocksource0/freq
>
> So user-space programs could  know that the value returned by
>     clock_gettime(CLOCK_MONOTONIC_RAW)
>   would be
>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>     }
>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>
> That would save user-space programs from having to know 'tsc_khz' by
> parsing the 'Refined TSC' frequency from log files or by examining the
> running kernel with objdump to obtain this value & figure out 'mult' &
> 'shift' themselves.
>
> And why not a
>   /sys/devices/system/clocksource/clocksource0/value
> file that actually prints this ( ( rdtsc() * mult ) >> shift )
> expression as a long integer?
> And perhaps a
>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
> file that actually prints out the number of real-time nano-seconds since
> the
> contents of the existing
>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
> files using the current TSC value?
> To read the rtc0/{date,time} files is already faster than entering the
> kernel to call
> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>
> I will work on developing a patch to this effect if no-one else is.
>
> Also, am I right in assuming that the maximum granularity of the real-time
> clock
> on my system is 1/64th of a second ? :
>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>  64
> This is the maximum granularity that can be stored in CMOS , not
> returned by TSC? Couldn't we have something similar that gave an
> accurate idea of TSC frequency and the precise formula applied to TSC
> value to get clock_gettime
> (CLOCK_MONOTONIC_RAW) value ?
>
> Regards,
> Jason
>
>
> This code does produce good timestamps with a latency of @20ns
> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
> values, but it depends on a global variable that  is initialized to
> the 'tsc_khz' value
> computed by running kernel parsed from objdump /proc/kcore output :
>
> static inline __attribute__((always_inline))
> U64_t
> IA64_tsc_now()
> { if(!(    _ia64_invariant_tsc_enabled
>       ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL))
>       )
>     )
>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
> TSC enabled.\n");
>     return 0;
>   }
>   U32_t tsc_hi, tsc_lo;
>   register UL_t tsc;
>   asm volatile
>   ( "rdtscp\n\t"
>     "mov %%edx, %0\n\t"
>     "mov %%eax, %1\n\t"
>     "mov %%ecx, %2\n\t"
>   : "=m" (tsc_hi) ,
>     "=m" (tsc_lo) ,
>     "=m" (_ia64_tsc_user_cpu) :
>   : "%eax","%ecx","%edx"
>   );
>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>   return tsc;
> }
>
> __thread
> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>
> static inline __attribute__((always_inline))
> U64_t IA64_tsc_ticks_since_start()
> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>   { _ia64_first_tsc = IA64_tsc_now();
>     return 0;
>   }
>   return (IA64_tsc_now() - _ia64_first_tsc) ;
> }
>
> static inline __attribute__((always_inline))
> void
> ia64_tsc_calc_mult_shift
> ( register U32_t *mult,
>   register U32_t *shift
> )
> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
>    * calculates second + nanosecond mult + shift in same way linux does.
>    * we want to be compatible with what linux returns in struct
> timespec ts after call to
>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>    */
>   const U32_t scale=1000U;
>   register U32_t from= IA64_tsc_khz();
>   register U32_t to  = NSEC_PER_SEC / scale;
>   register U64_t sec = ( ~0UL / from ) / scale;
>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>   register U64_t maxsec = sec * scale;
>   UL_t tmp;
>   U32_t sft, sftacc=32;
>   /*
>    * Calculate the shift factor which is limiting the conversion
>    * range:
>    */
>   tmp = (maxsec * from) >> 32;
>   while (tmp)
>   { tmp >>=1;
>     sftacc--;
>   }
>   /*
>    * Find the conversion shift/mult pair which has the best
>    * accuracy and fits the maxsec conversion range:
>    */
>   for (sft = 32; sft > 0; sft--)
>   { tmp = ((UL_t) to) << sft;
>     tmp += from / 2;
>     tmp = tmp / from;
>     if ((tmp >> sftacc) == 0)
>       break;
>   }
>   *mult = tmp;
>   *shift = sft;
> }
>
> __thread
> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>
> static inline __attribute__((always_inline))
> U64_t IA64_s_ns_since_start()
> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>   register U64_t cycles = IA64_tsc_ticks_since_start();
>   register U64_t ns = ((cycles *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
> NSEC_PER_SEC)&0x3fffffffUL) );
>   /* Yes, we are purposefully ignoring durations of more than 4.2
> billion seconds here! */
> }
>
>
> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
> somehow,
> then user-space libraries could have more confidence in using 'rdtsc'
> or 'rdtscp'
> if Linux's current_clocksource is 'tsc'.
>
> Regards,
> Jason
>
>
>
> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>
>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>> in detect_art() in tsc.c,
>>
>> By some definition of available. You can feed CPUID random leaf numbers
>> and
>> it will return something, usually the value of the last valid CPUID leaf,
>> which is 13 on your CPU. A similar CPU model has
>>
>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>> edx=0x00000000
>>
>> i.e. 7, 832, 832, 0
>>
>> Looks familiar, right?
>>
>> You can verify that with 'cpuid -1 -r' on your machine.
>>
>>> Linux does not think ART is enabled, and does not set the synthesized
>>> CPUID +
>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>> see this bit set .
>>
>> Rightfully so. This is a Haswell Core model.
>>
>>> if an e1000 NIC card had been installed, PTP would not be available.
>>
>> PTP is independent of the ART kernel feature . ART just provides enhanced
>> PTP features. You are confusing things here.
>>
>> The ART feature as the kernel sees it is a hardware extension which feeds
>> the ART clock to peripherals for timestamping and time correlation
>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15
>> so
>> the kernel can make use of that correlation, e.g. for enhanced PTP
>> accuracy.
>>
>> It's correct, that the NONSTOP_TSC feature depends on the availability of
>> ART, but that has nothing to do with the feature bit, which solely
>> describes the ratio between TSC and the ART frequency which is exposed to
>> peripherals. That frequency is not necessarily the real ART frequency.
>>
>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
>>> because the CPU will always get a fault reading the MSR since it has
>>> never been written.
>>
>> Huch? If an access to the TSC ADJUST MSR faults, then something is really
>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10
>> has
>> new code which utilizes the TSC_ADJUST MSR.
>>
>>> It would be nice for user-space programs that want to use the TSC with
>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>> bug report,
>>> could have confidence that Linux is actually generating the results of
>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>> in a predictable way from the TSC by looking at the
>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>> use of TSC values, so that they can correlate TSC values with linux
>>> clock_gettime() values.
>>
>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>
>> Nothing at all, really.
>>
>> The kernel makes use of the proper information values already.
>>
>> The TSC frequency is determined from:
>>
>>     1) CPUID(0x16) if available
>>     2) MSRs if available
>>     3) By calibration against a known clock
>>
>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values
>> are
>> correct whether that machine has ART exposed to peripherals or not.
>>
>>> has tsc: 1 constant: 1
>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>
>> And that voodoo math tells us what? That you found a way to correlate
>> CPUID(0xd) to the TSC frequency on that machine.
>>
>> Now I'm curious how you do that on this other machine which returns for
>> cpuid(15): 1, 1, 1
>>
>> You can't because all of this is completely wrong.
>>
>> Thanks,
>>
>> 	tglx
>>
>

[-- Attachment #2: ttsc.tar --]
[-- Type: application/x-tar, Size: 30720 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-22 16:07       ` Jason Vas Dias
@ 2017-02-22 16:18           ` Jason Vas Dias
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 16:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> RE:
>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>
> I just built an unpatched linux v4.10 with tglx's TSC improvements -
> much else improved in this kernel (like iwlwifi) - thanks!
>
> I have attached an updated version of the test program which
> doesn't print the bogus "Nominal TSC Frequency" (the previous
> version printed it, but equally ignored it).
>
> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>
> $ uname -r
> 4.10.0
> $ ./ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
> ts3 - ts2: 178 ns1: 0.000000592
> ts3 - ts2: 14 ns1: 0.000000577
> ts3 - ts2: 14 ns1: 0.000000651
> ts3 - ts2: 17 ns1: 0.000000625
> ts3 - ts2: 17 ns1: 0.000000677
> ts3 - ts2: 17 ns1: 0.000000626
> ts3 - ts2: 17 ns1: 0.000000627
> ts3 - ts2: 17 ns1: 0.000000627
> ts3 - ts2: 18 ns1: 0.000000655
> ts3 - ts2: 17 ns1: 0.000000631
> t1 - t0: 89067 - ns2: 0.000091411
>


Oops, going blind in my old age. These latencies are actually 3 times
greater than under 4.8 !!

Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown
in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::

ts3 - ts2: 24 ns1: 0.000000162
ts3 - ts2: 17 ns1: 0.000000143
ts3 - ts2: 17 ns1: 0.000000146
ts3 - ts2: 17 ns1: 0.000000149
ts3 - ts2: 17 ns1: 0.000000141
ts3 - ts2: 16 ns1: 0.000000142

now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
600ns, @ 4 times more than under 4.8 .
But I'm glad the TSC_ADJUST problems are fixed.

Will programs reading :
 $ cat /sys/devices/msr/events/tsc
 event=0x00
read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
TSC ?

> I think this is because under Linux 4.8, the CPU got a fault every
> time it read the TSC_ADJUST MSR.

maybe it still is!


> But user programs wanting to use the TSC  and correlate its value to
> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
> program still have to  dig the TSC frequency value out of the kernel
> with objdump  - this was really the point of the bug #194609.
>
> I would still like to investigate exporting 'tsc_khz' & 'mult' +
> 'shift' values via sysfs.
>
> Regards,
> Jason.
>
>
>
>
>
> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> Thank You for enlightening me -
>>
>> I was just having a hard time believing that Intel would ship a chip
>> that features a monotonic, fixed frequency timestamp counter
>> without specifying in either documentation or on-chip or in ACPI what
>> precisely that hard-wired frequency is, but I now know that to
>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>> difficult to reconcile with the statement in the SDM :
>>   17.16.4  Invariant Time-Keeping
>>     The invariant TSC is based on the invariant timekeeping hardware
>>     (called Always Running Timer or ART), that runs at the core crystal
>> clock
>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>> frequency
>>     relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0]
>> !=
>> 0
>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>     relationship holds between TSC and the ART hardware:
>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>                          / CPUID.15H:EAX[31:0] + K
>>     Where 'K' is an offset that can be adjusted by a privileged agent*2.
>>      When ART hardware is reset, both invariant TSC and K are also reset.
>>
>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>> that
>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>> CPUs with InvariantTSC .
>>
>> Do I understand correctly , that since I do have InvariantTSC ,  the
>> TSC_Value is in fact calculated according to the above formula, but with
>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>> TSC frequency ?
>> It was obvious this nominal TSC Frequency had nothing to do with the
>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>> I guess wishful thinking led me to believe CPUID:15h was actually
>> supported somehow , because I thought InvariantTSC meant it had ART
>> hardware .
>>
>> I do strongly suggest that Linux exports its calibrated TSC Khz
>> somewhere to user
>> space .
>>
>> I think the best long-term solution would be to allow programs to
>> somehow read the TSC without invoking
>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>> having to enter the kernel, which incurs an overhead of > 120ns on my
>> system
>> .
>>
>>
>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>> 'clocksource->shift' values to /sysfs somehow ?
>>
>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>> values could be exported as:
>> /sys/devices/system/clocksource/clocksource0/shift
>> /sys/devices/system/clocksource/clocksource0/mult
>> /sys/devices/system/clocksource/clocksource0/freq
>>
>> So user-space programs could  know that the value returned by
>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>   would be
>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>     }
>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>
>> That would save user-space programs from having to know 'tsc_khz' by
>> parsing the 'Refined TSC' frequency from log files or by examining the
>> running kernel with objdump to obtain this value & figure out 'mult' &
>> 'shift' themselves.
>>
>> And why not a
>>   /sys/devices/system/clocksource/clocksource0/value
>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>> expression as a long integer?
>> And perhaps a
>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>> file that actually prints out the number of real-time nano-seconds since
>> the
>> contents of the existing
>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>> files using the current TSC value?
>> To read the rtc0/{date,time} files is already faster than entering the
>> kernel to call
>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>
>> I will work on developing a patch to this effect if no-one else is.
>>
>> Also, am I right in assuming that the maximum granularity of the
>> real-time
>> clock
>> on my system is 1/64th of a second ? :
>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>  64
>> This is the maximum granularity that can be stored in CMOS , not
>> returned by TSC? Couldn't we have something similar that gave an
>> accurate idea of TSC frequency and the precise formula applied to TSC
>> value to get clock_gettime
>> (CLOCK_MONOTONIC_RAW) value ?
>>
>> Regards,
>> Jason
>>
>>
>> This code does produce good timestamps with a latency of @20ns
>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>> values, but it depends on a global variable that  is initialized to
>> the 'tsc_khz' value
>> computed by running kernel parsed from objdump /proc/kcore output :
>>
>> static inline __attribute__((always_inline))
>> U64_t
>> IA64_tsc_now()
>> { if(!(    _ia64_invariant_tsc_enabled
>>       ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL))
>>       )
>>     )
>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>> TSC enabled.\n");
>>     return 0;
>>   }
>>   U32_t tsc_hi, tsc_lo;
>>   register UL_t tsc;
>>   asm volatile
>>   ( "rdtscp\n\t"
>>     "mov %%edx, %0\n\t"
>>     "mov %%eax, %1\n\t"
>>     "mov %%ecx, %2\n\t"
>>   : "=m" (tsc_hi) ,
>>     "=m" (tsc_lo) ,
>>     "=m" (_ia64_tsc_user_cpu) :
>>   : "%eax","%ecx","%edx"
>>   );
>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>   return tsc;
>> }
>>
>> __thread
>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>
>> static inline __attribute__((always_inline))
>> U64_t IA64_tsc_ticks_since_start()
>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>   { _ia64_first_tsc = IA64_tsc_now();
>>     return 0;
>>   }
>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>> }
>>
>> static inline __attribute__((always_inline))
>> void
>> ia64_tsc_calc_mult_shift
>> ( register U32_t *mult,
>>   register U32_t *shift
>> )
>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
>>    * calculates second + nanosecond mult + shift in same way linux does.
>>    * we want to be compatible with what linux returns in struct
>> timespec ts after call to
>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>    */
>>   const U32_t scale=1000U;
>>   register U32_t from= IA64_tsc_khz();
>>   register U32_t to  = NSEC_PER_SEC / scale;
>>   register U64_t sec = ( ~0UL / from ) / scale;
>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>   register U64_t maxsec = sec * scale;
>>   UL_t tmp;
>>   U32_t sft, sftacc=32;
>>   /*
>>    * Calculate the shift factor which is limiting the conversion
>>    * range:
>>    */
>>   tmp = (maxsec * from) >> 32;
>>   while (tmp)
>>   { tmp >>=1;
>>     sftacc--;
>>   }
>>   /*
>>    * Find the conversion shift/mult pair which has the best
>>    * accuracy and fits the maxsec conversion range:
>>    */
>>   for (sft = 32; sft > 0; sft--)
>>   { tmp = ((UL_t) to) << sft;
>>     tmp += from / 2;
>>     tmp = tmp / from;
>>     if ((tmp >> sftacc) == 0)
>>       break;
>>   }
>>   *mult = tmp;
>>   *shift = sft;
>> }
>>
>> __thread
>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>
>> static inline __attribute__((always_inline))
>> U64_t IA64_s_ns_since_start()
>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>   register U64_t ns = ((cycles
>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>> NSEC_PER_SEC)&0x3fffffffUL) );
>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>> billion seconds here! */
>> }
>>
>>
>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>> somehow,
>> then user-space libraries could have more confidence in using 'rdtsc'
>> or 'rdtscp'
>> if Linux's current_clocksource is 'tsc'.
>>
>> Regards,
>> Jason
>>
>>
>>
>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>
>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>> in detect_art() in tsc.c,
>>>
>>> By some definition of available. You can feed CPUID random leaf numbers
>>> and
>>> it will return something, usually the value of the last valid CPUID
>>> leaf,
>>> which is 13 on your CPU. A similar CPU model has
>>>
>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>> edx=0x00000000
>>>
>>> i.e. 7, 832, 832, 0
>>>
>>> Looks familiar, right?
>>>
>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>
>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>> CPUID +
>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>> see this bit set .
>>>
>>> Rightfully so. This is a Haswell Core model.
>>>
>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>
>>> PTP is independent of the ART kernel feature . ART just provides
>>> enhanced
>>> PTP features. You are confusing things here.
>>>
>>> The ART feature as the kernel sees it is a hardware extension which
>>> feeds
>>> the ART clock to peripherals for timestamping and time correlation
>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15
>>> so
>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>> accuracy.
>>>
>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>> of
>>> ART, but that has nothing to do with the feature bit, which solely
>>> describes the ratio between TSC and the ART frequency which is exposed
>>> to
>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>
>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
>>>> because the CPU will always get a fault reading the MSR since it has
>>>> never been written.
>>>
>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>> really
>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10
>>> has
>>> new code which utilizes the TSC_ADJUST MSR.
>>>
>>>> It would be nice for user-space programs that want to use the TSC with
>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>> bug report,
>>>> could have confidence that Linux is actually generating the results of
>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>> in a predictable way from the TSC by looking at the
>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>> use of TSC values, so that they can correlate TSC values with linux
>>>> clock_gettime() values.
>>>
>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>
>>> Nothing at all, really.
>>>
>>> The kernel makes use of the proper information values already.
>>>
>>> The TSC frequency is determined from:
>>>
>>>     1) CPUID(0x16) if available
>>>     2) MSRs if available
>>>     3) By calibration against a known clock
>>>
>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values
>>> are
>>> correct whether that machine has ART exposed to peripherals or not.
>>>
>>>> has tsc: 1 constant: 1
>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>
>>> And that voodoo math tells us what? That you found a way to correlate
>>> CPUID(0xd) to the TSC frequency on that machine.
>>>
>>> Now I'm curious how you do that on this other machine which returns for
>>> cpuid(15): 1, 1, 1
>>>
>>> You can't because all of this is completely wrong.
>>>
>>> Thanks,
>>>
>>> 	tglx
>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-22 16:18           ` Jason Vas Dias
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 16:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> RE:
>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>
> I just built an unpatched linux v4.10 with tglx's TSC improvements -
> much else improved in this kernel (like iwlwifi) - thanks!
>
> I have attached an updated version of the test program which
> doesn't print the bogus "Nominal TSC Frequency" (the previous
> version printed it, but equally ignored it).
>
> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>
> $ uname -r
> 4.10.0
> $ ./ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
> ts3 - ts2: 178 ns1: 0.000000592
> ts3 - ts2: 14 ns1: 0.000000577
> ts3 - ts2: 14 ns1: 0.000000651
> ts3 - ts2: 17 ns1: 0.000000625
> ts3 - ts2: 17 ns1: 0.000000677
> ts3 - ts2: 17 ns1: 0.000000626
> ts3 - ts2: 17 ns1: 0.000000627
> ts3 - ts2: 17 ns1: 0.000000627
> ts3 - ts2: 18 ns1: 0.000000655
> ts3 - ts2: 17 ns1: 0.000000631
> t1 - t0: 89067 - ns2: 0.000091411
>


Oops, going blind in my old age. These latencies are actually 3 times
greater than under 4.8 !!

Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown
in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::

ts3 - ts2: 24 ns1: 0.000000162
ts3 - ts2: 17 ns1: 0.000000143
ts3 - ts2: 17 ns1: 0.000000146
ts3 - ts2: 17 ns1: 0.000000149
ts3 - ts2: 17 ns1: 0.000000141
ts3 - ts2: 16 ns1: 0.000000142

now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
600ns, @ 4 times more than under 4.8 .
But I'm glad the TSC_ADJUST problems are fixed.

Will programs reading :
 $ cat /sys/devices/msr/events/tsc
 event=0x00
read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
TSC ?

> I think this is because under Linux 4.8, the CPU got a fault every
> time it read the TSC_ADJUST MSR.

maybe it still is!


> But user programs wanting to use the TSC  and correlate its value to
> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
> program still have to  dig the TSC frequency value out of the kernel
> with objdump  - this was really the point of the bug #194609.
>
> I would still like to investigate exporting 'tsc_khz' & 'mult' +
> 'shift' values via sysfs.
>
> Regards,
> Jason.
>
>
>
>
>
> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> Thank You for enlightening me -
>>
>> I was just having a hard time believing that Intel would ship a chip
>> that features a monotonic, fixed frequency timestamp counter
>> without specifying in either documentation or on-chip or in ACPI what
>> precisely that hard-wired frequency is, but I now know that to
>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>> difficult to reconcile with the statement in the SDM :
>>   17.16.4  Invariant Time-Keeping
>>     The invariant TSC is based on the invariant timekeeping hardware
>>     (called Always Running Timer or ART), that runs at the core crystal
>> clock
>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>> frequency
>>     relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0]
>> !>> 0
>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>     relationship holds between TSC and the ART hardware:
>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>                          / CPUID.15H:EAX[31:0] + K
>>     Where 'K' is an offset that can be adjusted by a privileged agent*2.
>>      When ART hardware is reset, both invariant TSC and K are also reset.
>>
>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>> that
>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>> CPUs with InvariantTSC .
>>
>> Do I understand correctly , that since I do have InvariantTSC ,  the
>> TSC_Value is in fact calculated according to the above formula, but with
>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>> TSC frequency ?
>> It was obvious this nominal TSC Frequency had nothing to do with the
>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>> I guess wishful thinking led me to believe CPUID:15h was actually
>> supported somehow , because I thought InvariantTSC meant it had ART
>> hardware .
>>
>> I do strongly suggest that Linux exports its calibrated TSC Khz
>> somewhere to user
>> space .
>>
>> I think the best long-term solution would be to allow programs to
>> somehow read the TSC without invoking
>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>> having to enter the kernel, which incurs an overhead of > 120ns on my
>> system
>> .
>>
>>
>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>> 'clocksource->shift' values to /sysfs somehow ?
>>
>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>> values could be exported as:
>> /sys/devices/system/clocksource/clocksource0/shift
>> /sys/devices/system/clocksource/clocksource0/mult
>> /sys/devices/system/clocksource/clocksource0/freq
>>
>> So user-space programs could  know that the value returned by
>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>   would be
>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>     }
>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>
>> That would save user-space programs from having to know 'tsc_khz' by
>> parsing the 'Refined TSC' frequency from log files or by examining the
>> running kernel with objdump to obtain this value & figure out 'mult' &
>> 'shift' themselves.
>>
>> And why not a
>>   /sys/devices/system/clocksource/clocksource0/value
>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>> expression as a long integer?
>> And perhaps a
>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>> file that actually prints out the number of real-time nano-seconds since
>> the
>> contents of the existing
>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>> files using the current TSC value?
>> To read the rtc0/{date,time} files is already faster than entering the
>> kernel to call
>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>
>> I will work on developing a patch to this effect if no-one else is.
>>
>> Also, am I right in assuming that the maximum granularity of the
>> real-time
>> clock
>> on my system is 1/64th of a second ? :
>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>  64
>> This is the maximum granularity that can be stored in CMOS , not
>> returned by TSC? Couldn't we have something similar that gave an
>> accurate idea of TSC frequency and the precise formula applied to TSC
>> value to get clock_gettime
>> (CLOCK_MONOTONIC_RAW) value ?
>>
>> Regards,
>> Jason
>>
>>
>> This code does produce good timestamps with a latency of @20ns
>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>> values, but it depends on a global variable that  is initialized to
>> the 'tsc_khz' value
>> computed by running kernel parsed from objdump /proc/kcore output :
>>
>> static inline __attribute__((always_inline))
>> U64_t
>> IA64_tsc_now()
>> { if(!(    _ia64_invariant_tsc_enabled
>>       ||(( _cpu0id_fd = -1) && IA64_invariant_tsc_is_enabled(NULL,NULL))
>>       )
>>     )
>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>> TSC enabled.\n");
>>     return 0;
>>   }
>>   U32_t tsc_hi, tsc_lo;
>>   register UL_t tsc;
>>   asm volatile
>>   ( "rdtscp\n\t"
>>     "mov %%edx, %0\n\t"
>>     "mov %%eax, %1\n\t"
>>     "mov %%ecx, %2\n\t"
>>   : "=m" (tsc_hi) ,
>>     "=m" (tsc_lo) ,
>>     "=m" (_ia64_tsc_user_cpu) :
>>   : "%eax","%ecx","%edx"
>>   );
>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>   return tsc;
>> }
>>
>> __thread
>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>
>> static inline __attribute__((always_inline))
>> U64_t IA64_tsc_ticks_since_start()
>> { if(_ia64_first_tsc = 0xffffffffffffffffUL)
>>   { _ia64_first_tsc = IA64_tsc_now();
>>     return 0;
>>   }
>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>> }
>>
>> static inline __attribute__((always_inline))
>> void
>> ia64_tsc_calc_mult_shift
>> ( register U32_t *mult,
>>   register U32_t *shift
>> )
>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
>>    * calculates second + nanosecond mult + shift in same way linux does.
>>    * we want to be compatible with what linux returns in struct
>> timespec ts after call to
>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>    */
>>   const U32_t scale\x1000U;
>>   register U32_t from= IA64_tsc_khz();
>>   register U32_t to  = NSEC_PER_SEC / scale;
>>   register U64_t sec = ( ~0UL / from ) / scale;
>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>   register U64_t maxsec = sec * scale;
>>   UL_t tmp;
>>   U32_t sft, sftacc2;
>>   /*
>>    * Calculate the shift factor which is limiting the conversion
>>    * range:
>>    */
>>   tmp = (maxsec * from) >> 32;
>>   while (tmp)
>>   { tmp >>=1;
>>     sftacc--;
>>   }
>>   /*
>>    * Find the conversion shift/mult pair which has the best
>>    * accuracy and fits the maxsec conversion range:
>>    */
>>   for (sft = 32; sft > 0; sft--)
>>   { tmp = ((UL_t) to) << sft;
>>     tmp += from / 2;
>>     tmp = tmp / from;
>>     if ((tmp >> sftacc) = 0)
>>       break;
>>   }
>>   *mult = tmp;
>>   *shift = sft;
>> }
>>
>> __thread
>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>
>> static inline __attribute__((always_inline))
>> U64_t IA64_s_ns_since_start()
>> { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) )
>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>   register U64_t ns = ((cycles
>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>> NSEC_PER_SEC)&0x3fffffffUL) );
>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>> billion seconds here! */
>> }
>>
>>
>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>> somehow,
>> then user-space libraries could have more confidence in using 'rdtsc'
>> or 'rdtscp'
>> if Linux's current_clocksource is 'tsc'.
>>
>> Regards,
>> Jason
>>
>>
>>
>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>
>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>> in detect_art() in tsc.c,
>>>
>>> By some definition of available. You can feed CPUID random leaf numbers
>>> and
>>> it will return something, usually the value of the last valid CPUID
>>> leaf,
>>> which is 13 on your CPU. A similar CPU model has
>>>
>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>> edx=0x00000000
>>>
>>> i.e. 7, 832, 832, 0
>>>
>>> Looks familiar, right?
>>>
>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>
>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>> CPUID +
>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>> see this bit set .
>>>
>>> Rightfully so. This is a Haswell Core model.
>>>
>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>
>>> PTP is independent of the ART kernel feature . ART just provides
>>> enhanced
>>> PTP features. You are confusing things here.
>>>
>>> The ART feature as the kernel sees it is a hardware extension which
>>> feeds
>>> the ART clock to peripherals for timestamping and time correlation
>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15
>>> so
>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>> accuracy.
>>>
>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>> of
>>> ART, but that has nothing to do with the feature bit, which solely
>>> describes the ratio between TSC and the ART frequency which is exposed
>>> to
>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>
>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is 0
>>>> because the CPU will always get a fault reading the MSR since it has
>>>> never been written.
>>>
>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>> really
>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10
>>> has
>>> new code which utilizes the TSC_ADJUST MSR.
>>>
>>>> It would be nice for user-space programs that want to use the TSC with
>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>> bug report,
>>>> could have confidence that Linux is actually generating the results of
>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>> in a predictable way from the TSC by looking at the
>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>> use of TSC values, so that they can correlate TSC values with linux
>>>> clock_gettime() values.
>>>
>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>
>>> Nothing at all, really.
>>>
>>> The kernel makes use of the proper information values already.
>>>
>>> The TSC frequency is determined from:
>>>
>>>     1) CPUID(0x16) if available
>>>     2) MSRs if available
>>>     3) By calibration against a known clock
>>>
>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values
>>> are
>>> correct whether that machine has ART exposed to peripherals or not.
>>>
>>>> has tsc: 1 constant: 1
>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>
>>> And that voodoo math tells us what? That you found a way to correlate
>>> CPUID(0xd) to the TSC frequency on that machine.
>>>
>>> Now I'm curious how you do that on this other machine which returns for
>>> cpuid(15): 1, 1, 1
>>>
>>> You can't because all of this is completely wrong.
>>>
>>> Thanks,
>>>
>>> 	tglx
>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-22 16:18           ` Jason Vas Dias
@ 2017-02-22 17:27             ` Jason Vas Dias
  -1 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 17:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
read or written . It is probably because it genuinuely does not
support any cpuid > 13 ,
or the modern TSC_ADJUST interface . This is probably why my clock_gettime()
latencies are so bad. Now I have to develop a patch to disable all access to
TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
I really have an unlucky CPU :-) .

But really, I think this issue goes deeper into the fundamental limits of
time measurement on Linux : it is never going to be possible to measure
minimum times with clock_gettime() comparable with those returned by
rdtscp instruction - the time taken to enter the kernel through the VDSO,
queue an access to vsyscall_gtod_data via a workqueue, access it & do
computations & copy value to user-space is NEVER going to be up to the
job of measuring small real-time durations of the order of 10-20 TSC ticks .

I think the best way to solve this problem going forward would be to store
the entire vsyscall_gtod_data  data structure representing the current
clocksource
in a shared page which is memory-mappable (read-only) by user-space .
I think sser-space programs should be able to do something like :
    int fd = open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
    size_t psz = getpagesize();
    void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
    msync(gtod,psz,MS_SYNC);

Then they could all read the real-time clock values as they are updated
in real-time by the kernel, and know exactly how to interpret them .

I also think that all mktime() / gmtime() / localtime() timezone handling
functionality should be
moved to user-space, and that the kernel should actually load and link in some
/lib/libtzdata.so
library, provided by glibc / libc implementations, that is exactly the
same library
used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
by the kernel from the same places glibc loads it, and both the kernel and
glibc should use identical mktime(), gmtime(), etc. functions to access it, and
glibc using code would not need to enter the kernel at all for any time-handling
code. This tzdata-library code be automatically loaded into process images the
same way the vdso region is , and the whole system could access only one
copy of it and the 'gtod.page' in memory.

That's just my two-cents worth, and how I'd like to eventually get
things working
on my system.

All the best, Regards,
Jason













On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> RE:
>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>
>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>> much else improved in this kernel (like iwlwifi) - thanks!
>>
>> I have attached an updated version of the test program which
>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>> version printed it, but equally ignored it).
>>
>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>
>> $ uname -r
>> 4.10.0
>> $ ./ttsc1
>> max_extended_leaf: 80000008
>> has tsc: 1 constant: 1
>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>> ts3 - ts2: 178 ns1: 0.000000592
>> ts3 - ts2: 14 ns1: 0.000000577
>> ts3 - ts2: 14 ns1: 0.000000651
>> ts3 - ts2: 17 ns1: 0.000000625
>> ts3 - ts2: 17 ns1: 0.000000677
>> ts3 - ts2: 17 ns1: 0.000000626
>> ts3 - ts2: 17 ns1: 0.000000627
>> ts3 - ts2: 17 ns1: 0.000000627
>> ts3 - ts2: 18 ns1: 0.000000655
>> ts3 - ts2: 17 ns1: 0.000000631
>> t1 - t0: 89067 - ns2: 0.000091411
>>
>
>
> Oops, going blind in my old age. These latencies are actually 3 times
> greater than under 4.8 !!
>
> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as
> shown
> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>
> ts3 - ts2: 24 ns1: 0.000000162
> ts3 - ts2: 17 ns1: 0.000000143
> ts3 - ts2: 17 ns1: 0.000000146
> ts3 - ts2: 17 ns1: 0.000000149
> ts3 - ts2: 17 ns1: 0.000000141
> ts3 - ts2: 16 ns1: 0.000000142
>
> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
> 600ns, @ 4 times more than under 4.8 .
> But I'm glad the TSC_ADJUST problems are fixed.
>
> Will programs reading :
>  $ cat /sys/devices/msr/events/tsc
>  event=0x00
> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
> TSC ?
>
>> I think this is because under Linux 4.8, the CPU got a fault every
>> time it read the TSC_ADJUST MSR.
>
> maybe it still is!
>
>
>> But user programs wanting to use the TSC  and correlate its value to
>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>> program still have to  dig the TSC frequency value out of the kernel
>> with objdump  - this was really the point of the bug #194609.
>>
>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>> 'shift' values via sysfs.
>>
>> Regards,
>> Jason.
>>
>>
>>
>>
>>
>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>> Thank You for enlightening me -
>>>
>>> I was just having a hard time believing that Intel would ship a chip
>>> that features a monotonic, fixed frequency timestamp counter
>>> without specifying in either documentation or on-chip or in ACPI what
>>> precisely that hard-wired frequency is, but I now know that to
>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>> difficult to reconcile with the statement in the SDM :
>>>   17.16.4  Invariant Time-Keeping
>>>     The invariant TSC is based on the invariant timekeeping hardware
>>>     (called Always Running Timer or ART), that runs at the core crystal
>>> clock
>>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>>> frequency
>>>     relationship between the ART hardware and TSC. If
>>> CPUID.15H:EBX[31:0]
>>> !=
>>> 0
>>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>>     relationship holds between TSC and the ART hardware:
>>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>                          / CPUID.15H:EAX[31:0] + K
>>>     Where 'K' is an offset that can be adjusted by a privileged agent*2.
>>>      When ART hardware is reset, both invariant TSC and K are also
>>> reset.
>>>
>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>>> that
>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>> CPUs with InvariantTSC .
>>>
>>> Do I understand correctly , that since I do have InvariantTSC ,  the
>>> TSC_Value is in fact calculated according to the above formula, but with
>>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>>> TSC frequency ?
>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>> supported somehow , because I thought InvariantTSC meant it had ART
>>> hardware .
>>>
>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>> somewhere to user
>>> space .
>>>
>>> I think the best long-term solution would be to allow programs to
>>> somehow read the TSC without invoking
>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>> system
>>> .
>>>
>>>
>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>> 'clocksource->shift' values to /sysfs somehow ?
>>>
>>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>>> values could be exported as:
>>> /sys/devices/system/clocksource/clocksource0/shift
>>> /sys/devices/system/clocksource/clocksource0/mult
>>> /sys/devices/system/clocksource/clocksource0/freq
>>>
>>> So user-space programs could  know that the value returned by
>>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>>   would be
>>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>     }
>>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>
>>> That would save user-space programs from having to know 'tsc_khz' by
>>> parsing the 'Refined TSC' frequency from log files or by examining the
>>> running kernel with objdump to obtain this value & figure out 'mult' &
>>> 'shift' themselves.
>>>
>>> And why not a
>>>   /sys/devices/system/clocksource/clocksource0/value
>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>> expression as a long integer?
>>> And perhaps a
>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>> file that actually prints out the number of real-time nano-seconds since
>>> the
>>> contents of the existing
>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>> files using the current TSC value?
>>> To read the rtc0/{date,time} files is already faster than entering the
>>> kernel to call
>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>
>>> I will work on developing a patch to this effect if no-one else is.
>>>
>>> Also, am I right in assuming that the maximum granularity of the
>>> real-time
>>> clock
>>> on my system is 1/64th of a second ? :
>>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>  64
>>> This is the maximum granularity that can be stored in CMOS , not
>>> returned by TSC? Couldn't we have something similar that gave an
>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>> value to get clock_gettime
>>> (CLOCK_MONOTONIC_RAW) value ?
>>>
>>> Regards,
>>> Jason
>>>
>>>
>>> This code does produce good timestamps with a latency of @20ns
>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>> values, but it depends on a global variable that  is initialized to
>>> the 'tsc_khz' value
>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>
>>> static inline __attribute__((always_inline))
>>> U64_t
>>> IA64_tsc_now()
>>> { if(!(    _ia64_invariant_tsc_enabled
>>>       ||(( _cpu0id_fd == -1) &&
>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>       )
>>>     )
>>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>> TSC enabled.\n");
>>>     return 0;
>>>   }
>>>   U32_t tsc_hi, tsc_lo;
>>>   register UL_t tsc;
>>>   asm volatile
>>>   ( "rdtscp\n\t"
>>>     "mov %%edx, %0\n\t"
>>>     "mov %%eax, %1\n\t"
>>>     "mov %%ecx, %2\n\t"
>>>   : "=m" (tsc_hi) ,
>>>     "=m" (tsc_lo) ,
>>>     "=m" (_ia64_tsc_user_cpu) :
>>>   : "%eax","%ecx","%edx"
>>>   );
>>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>   return tsc;
>>> }
>>>
>>> __thread
>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>
>>> static inline __attribute__((always_inline))
>>> U64_t IA64_tsc_ticks_since_start()
>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>>   { _ia64_first_tsc = IA64_tsc_now();
>>>     return 0;
>>>   }
>>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>>> }
>>>
>>> static inline __attribute__((always_inline))
>>> void
>>> ia64_tsc_calc_mult_shift
>>> ( register U32_t *mult,
>>>   register U32_t *shift
>>> )
>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>> function:
>>>    * calculates second + nanosecond mult + shift in same way linux does.
>>>    * we want to be compatible with what linux returns in struct
>>> timespec ts after call to
>>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>    */
>>>   const U32_t scale=1000U;
>>>   register U32_t from= IA64_tsc_khz();
>>>   register U32_t to  = NSEC_PER_SEC / scale;
>>>   register U64_t sec = ( ~0UL / from ) / scale;
>>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>   register U64_t maxsec = sec * scale;
>>>   UL_t tmp;
>>>   U32_t sft, sftacc=32;
>>>   /*
>>>    * Calculate the shift factor which is limiting the conversion
>>>    * range:
>>>    */
>>>   tmp = (maxsec * from) >> 32;
>>>   while (tmp)
>>>   { tmp >>=1;
>>>     sftacc--;
>>>   }
>>>   /*
>>>    * Find the conversion shift/mult pair which has the best
>>>    * accuracy and fits the maxsec conversion range:
>>>    */
>>>   for (sft = 32; sft > 0; sft--)
>>>   { tmp = ((UL_t) to) << sft;
>>>     tmp += from / 2;
>>>     tmp = tmp / from;
>>>     if ((tmp >> sftacc) == 0)
>>>       break;
>>>   }
>>>   *mult = tmp;
>>>   *shift = sft;
>>> }
>>>
>>> __thread
>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>
>>> static inline __attribute__((always_inline))
>>> U64_t IA64_s_ns_since_start()
>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>>   register U64_t ns = ((cycles
>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>>> billion seconds here! */
>>> }
>>>
>>>
>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>> somehow,
>>> then user-space libraries could have more confidence in using 'rdtsc'
>>> or 'rdtscp'
>>> if Linux's current_clocksource is 'tsc'.
>>>
>>> Regards,
>>> Jason
>>>
>>>
>>>
>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>
>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>>> in detect_art() in tsc.c,
>>>>
>>>> By some definition of available. You can feed CPUID random leaf numbers
>>>> and
>>>> it will return something, usually the value of the last valid CPUID
>>>> leaf,
>>>> which is 13 on your CPU. A similar CPU model has
>>>>
>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>> edx=0x00000000
>>>>
>>>> i.e. 7, 832, 832, 0
>>>>
>>>> Looks familiar, right?
>>>>
>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>
>>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>>> CPUID +
>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>> see this bit set .
>>>>
>>>> Rightfully so. This is a Haswell Core model.
>>>>
>>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>>
>>>> PTP is independent of the ART kernel feature . ART just provides
>>>> enhanced
>>>> PTP features. You are confusing things here.
>>>>
>>>> The ART feature as the kernel sees it is a hardware extension which
>>>> feeds
>>>> the ART clock to peripherals for timestamping and time correlation
>>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15
>>>> so
>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>> accuracy.
>>>>
>>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>>> of
>>>> ART, but that has nothing to do with the feature bit, which solely
>>>> describes the ratio between TSC and the ART frequency which is exposed
>>>> to
>>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>>
>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to
>>>>> be
>>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is
>>>>> 0
>>>>> because the CPU will always get a fault reading the MSR since it has
>>>>> never been written.
>>>>
>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>> really
>>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10
>>>> has
>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>
>>>>> It would be nice for user-space programs that want to use the TSC with
>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>>> bug report,
>>>>> could have confidence that Linux is actually generating the results of
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>> in a predictable way from the TSC by looking at the
>>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>> clock_gettime() values.
>>>>
>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>
>>>> Nothing at all, really.
>>>>
>>>> The kernel makes use of the proper information values already.
>>>>
>>>> The TSC frequency is determined from:
>>>>
>>>>     1) CPUID(0x16) if available
>>>>     2) MSRs if available
>>>>     3) By calibration against a known clock
>>>>
>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values
>>>> are
>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>
>>>>> has tsc: 1 constant: 1
>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>
>>>> And that voodoo math tells us what? That you found a way to correlate
>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>
>>>> Now I'm curious how you do that on this other machine which returns for
>>>> cpuid(15): 1, 1, 1
>>>>
>>>> You can't because all of this is completely wrong.
>>>>
>>>> Thanks,
>>>>
>>>> 	tglx
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-22 17:27             ` Jason Vas Dias
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 17:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
read or written . It is probably because it genuinuely does not
support any cpuid > 13 ,
or the modern TSC_ADJUST interface . This is probably why my clock_gettime()
latencies are so bad. Now I have to develop a patch to disable all access to
TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
I really have an unlucky CPU :-) .

But really, I think this issue goes deeper into the fundamental limits of
time measurement on Linux : it is never going to be possible to measure
minimum times with clock_gettime() comparable with those returned by
rdtscp instruction - the time taken to enter the kernel through the VDSO,
queue an access to vsyscall_gtod_data via a workqueue, access it & do
computations & copy value to user-space is NEVER going to be up to the
job of measuring small real-time durations of the order of 10-20 TSC ticks .

I think the best way to solve this problem going forward would be to store
the entire vsyscall_gtod_data  data structure representing the current
clocksource
in a shared page which is memory-mappable (read-only) by user-space .
I think sser-space programs should be able to do something like :
    int fd = open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
    size_t psz = getpagesize();
    void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
    msync(gtod,psz,MS_SYNC);

Then they could all read the real-time clock values as they are updated
in real-time by the kernel, and know exactly how to interpret them .

I also think that all mktime() / gmtime() / localtime() timezone handling
functionality should be
moved to user-space, and that the kernel should actually load and link in some
/lib/libtzdata.so
library, provided by glibc / libc implementations, that is exactly the
same library
used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
by the kernel from the same places glibc loads it, and both the kernel and
glibc should use identical mktime(), gmtime(), etc. functions to access it, and
glibc using code would not need to enter the kernel at all for any time-handling
code. This tzdata-library code be automatically loaded into process images the
same way the vdso region is , and the whole system could access only one
copy of it and the 'gtod.page' in memory.

That's just my two-cents worth, and how I'd like to eventually get
things working
on my system.

All the best, Regards,
Jason













On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> RE:
>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>
>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>> much else improved in this kernel (like iwlwifi) - thanks!
>>
>> I have attached an updated version of the test program which
>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>> version printed it, but equally ignored it).
>>
>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>
>> $ uname -r
>> 4.10.0
>> $ ./ttsc1
>> max_extended_leaf: 80000008
>> has tsc: 1 constant: 1
>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>> ts3 - ts2: 178 ns1: 0.000000592
>> ts3 - ts2: 14 ns1: 0.000000577
>> ts3 - ts2: 14 ns1: 0.000000651
>> ts3 - ts2: 17 ns1: 0.000000625
>> ts3 - ts2: 17 ns1: 0.000000677
>> ts3 - ts2: 17 ns1: 0.000000626
>> ts3 - ts2: 17 ns1: 0.000000627
>> ts3 - ts2: 17 ns1: 0.000000627
>> ts3 - ts2: 18 ns1: 0.000000655
>> ts3 - ts2: 17 ns1: 0.000000631
>> t1 - t0: 89067 - ns2: 0.000091411
>>
>
>
> Oops, going blind in my old age. These latencies are actually 3 times
> greater than under 4.8 !!
>
> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as
> shown
> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>
> ts3 - ts2: 24 ns1: 0.000000162
> ts3 - ts2: 17 ns1: 0.000000143
> ts3 - ts2: 17 ns1: 0.000000146
> ts3 - ts2: 17 ns1: 0.000000149
> ts3 - ts2: 17 ns1: 0.000000141
> ts3 - ts2: 16 ns1: 0.000000142
>
> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
> 600ns, @ 4 times more than under 4.8 .
> But I'm glad the TSC_ADJUST problems are fixed.
>
> Will programs reading :
>  $ cat /sys/devices/msr/events/tsc
>  event=0x00
> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
> TSC ?
>
>> I think this is because under Linux 4.8, the CPU got a fault every
>> time it read the TSC_ADJUST MSR.
>
> maybe it still is!
>
>
>> But user programs wanting to use the TSC  and correlate its value to
>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>> program still have to  dig the TSC frequency value out of the kernel
>> with objdump  - this was really the point of the bug #194609.
>>
>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>> 'shift' values via sysfs.
>>
>> Regards,
>> Jason.
>>
>>
>>
>>
>>
>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>> Thank You for enlightening me -
>>>
>>> I was just having a hard time believing that Intel would ship a chip
>>> that features a monotonic, fixed frequency timestamp counter
>>> without specifying in either documentation or on-chip or in ACPI what
>>> precisely that hard-wired frequency is, but I now know that to
>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>> difficult to reconcile with the statement in the SDM :
>>>   17.16.4  Invariant Time-Keeping
>>>     The invariant TSC is based on the invariant timekeeping hardware
>>>     (called Always Running Timer or ART), that runs at the core crystal
>>> clock
>>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>>> frequency
>>>     relationship between the ART hardware and TSC. If
>>> CPUID.15H:EBX[31:0]
>>> !>>> 0
>>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>>     relationship holds between TSC and the ART hardware:
>>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>                          / CPUID.15H:EAX[31:0] + K
>>>     Where 'K' is an offset that can be adjusted by a privileged agent*2.
>>>      When ART hardware is reset, both invariant TSC and K are also
>>> reset.
>>>
>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>>> that
>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>> CPUs with InvariantTSC .
>>>
>>> Do I understand correctly , that since I do have InvariantTSC ,  the
>>> TSC_Value is in fact calculated according to the above formula, but with
>>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>>> TSC frequency ?
>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>> supported somehow , because I thought InvariantTSC meant it had ART
>>> hardware .
>>>
>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>> somewhere to user
>>> space .
>>>
>>> I think the best long-term solution would be to allow programs to
>>> somehow read the TSC without invoking
>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>> system
>>> .
>>>
>>>
>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>> 'clocksource->shift' values to /sysfs somehow ?
>>>
>>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>>> values could be exported as:
>>> /sys/devices/system/clocksource/clocksource0/shift
>>> /sys/devices/system/clocksource/clocksource0/mult
>>> /sys/devices/system/clocksource/clocksource0/freq
>>>
>>> So user-space programs could  know that the value returned by
>>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>>   would be
>>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>     }
>>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>
>>> That would save user-space programs from having to know 'tsc_khz' by
>>> parsing the 'Refined TSC' frequency from log files or by examining the
>>> running kernel with objdump to obtain this value & figure out 'mult' &
>>> 'shift' themselves.
>>>
>>> And why not a
>>>   /sys/devices/system/clocksource/clocksource0/value
>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>> expression as a long integer?
>>> And perhaps a
>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>> file that actually prints out the number of real-time nano-seconds since
>>> the
>>> contents of the existing
>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>> files using the current TSC value?
>>> To read the rtc0/{date,time} files is already faster than entering the
>>> kernel to call
>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>
>>> I will work on developing a patch to this effect if no-one else is.
>>>
>>> Also, am I right in assuming that the maximum granularity of the
>>> real-time
>>> clock
>>> on my system is 1/64th of a second ? :
>>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>  64
>>> This is the maximum granularity that can be stored in CMOS , not
>>> returned by TSC? Couldn't we have something similar that gave an
>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>> value to get clock_gettime
>>> (CLOCK_MONOTONIC_RAW) value ?
>>>
>>> Regards,
>>> Jason
>>>
>>>
>>> This code does produce good timestamps with a latency of @20ns
>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>> values, but it depends on a global variable that  is initialized to
>>> the 'tsc_khz' value
>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>
>>> static inline __attribute__((always_inline))
>>> U64_t
>>> IA64_tsc_now()
>>> { if(!(    _ia64_invariant_tsc_enabled
>>>       ||(( _cpu0id_fd = -1) &&
>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>       )
>>>     )
>>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>> TSC enabled.\n");
>>>     return 0;
>>>   }
>>>   U32_t tsc_hi, tsc_lo;
>>>   register UL_t tsc;
>>>   asm volatile
>>>   ( "rdtscp\n\t"
>>>     "mov %%edx, %0\n\t"
>>>     "mov %%eax, %1\n\t"
>>>     "mov %%ecx, %2\n\t"
>>>   : "=m" (tsc_hi) ,
>>>     "=m" (tsc_lo) ,
>>>     "=m" (_ia64_tsc_user_cpu) :
>>>   : "%eax","%ecx","%edx"
>>>   );
>>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>   return tsc;
>>> }
>>>
>>> __thread
>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>
>>> static inline __attribute__((always_inline))
>>> U64_t IA64_tsc_ticks_since_start()
>>> { if(_ia64_first_tsc = 0xffffffffffffffffUL)
>>>   { _ia64_first_tsc = IA64_tsc_now();
>>>     return 0;
>>>   }
>>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>>> }
>>>
>>> static inline __attribute__((always_inline))
>>> void
>>> ia64_tsc_calc_mult_shift
>>> ( register U32_t *mult,
>>>   register U32_t *shift
>>> )
>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>> function:
>>>    * calculates second + nanosecond mult + shift in same way linux does.
>>>    * we want to be compatible with what linux returns in struct
>>> timespec ts after call to
>>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>    */
>>>   const U32_t scale\x1000U;
>>>   register U32_t from= IA64_tsc_khz();
>>>   register U32_t to  = NSEC_PER_SEC / scale;
>>>   register U64_t sec = ( ~0UL / from ) / scale;
>>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>   register U64_t maxsec = sec * scale;
>>>   UL_t tmp;
>>>   U32_t sft, sftacc2;
>>>   /*
>>>    * Calculate the shift factor which is limiting the conversion
>>>    * range:
>>>    */
>>>   tmp = (maxsec * from) >> 32;
>>>   while (tmp)
>>>   { tmp >>=1;
>>>     sftacc--;
>>>   }
>>>   /*
>>>    * Find the conversion shift/mult pair which has the best
>>>    * accuracy and fits the maxsec conversion range:
>>>    */
>>>   for (sft = 32; sft > 0; sft--)
>>>   { tmp = ((UL_t) to) << sft;
>>>     tmp += from / 2;
>>>     tmp = tmp / from;
>>>     if ((tmp >> sftacc) = 0)
>>>       break;
>>>   }
>>>   *mult = tmp;
>>>   *shift = sft;
>>> }
>>>
>>> __thread
>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>
>>> static inline __attribute__((always_inline))
>>> U64_t IA64_s_ns_since_start()
>>> { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) )
>>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>>   register U64_t ns = ((cycles
>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>>> billion seconds here! */
>>> }
>>>
>>>
>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>> somehow,
>>> then user-space libraries could have more confidence in using 'rdtsc'
>>> or 'rdtscp'
>>> if Linux's current_clocksource is 'tsc'.
>>>
>>> Regards,
>>> Jason
>>>
>>>
>>>
>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>
>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>>> in detect_art() in tsc.c,
>>>>
>>>> By some definition of available. You can feed CPUID random leaf numbers
>>>> and
>>>> it will return something, usually the value of the last valid CPUID
>>>> leaf,
>>>> which is 13 on your CPU. A similar CPU model has
>>>>
>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>> edx=0x00000000
>>>>
>>>> i.e. 7, 832, 832, 0
>>>>
>>>> Looks familiar, right?
>>>>
>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>
>>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>>> CPUID +
>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>> see this bit set .
>>>>
>>>> Rightfully so. This is a Haswell Core model.
>>>>
>>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>>
>>>> PTP is independent of the ART kernel feature . ART just provides
>>>> enhanced
>>>> PTP features. You are confusing things here.
>>>>
>>>> The ART feature as the kernel sees it is a hardware extension which
>>>> feeds
>>>> the ART clock to peripherals for timestamping and time correlation
>>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15
>>>> so
>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>> accuracy.
>>>>
>>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>>> of
>>>> ART, but that has nothing to do with the feature bit, which solely
>>>> describes the ratio between TSC and the ART frequency which is exposed
>>>> to
>>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>>
>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to
>>>>> be
>>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is
>>>>> 0
>>>>> because the CPU will always get a fault reading the MSR since it has
>>>>> never been written.
>>>>
>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>> really
>>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10
>>>> has
>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>
>>>>> It would be nice for user-space programs that want to use the TSC with
>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>>> bug report,
>>>>> could have confidence that Linux is actually generating the results of
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>> in a predictable way from the TSC by looking at the
>>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>> clock_gettime() values.
>>>>
>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>
>>>> Nothing at all, really.
>>>>
>>>> The kernel makes use of the proper information values already.
>>>>
>>>> The TSC frequency is determined from:
>>>>
>>>>     1) CPUID(0x16) if available
>>>>     2) MSRs if available
>>>>     3) By calibration against a known clock
>>>>
>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values
>>>> are
>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>
>>>>> has tsc: 1 constant: 1
>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>
>>>> And that voodoo math tells us what? That you found a way to correlate
>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>
>>>> Now I'm curious how you do that on this other machine which returns for
>>>> cpuid(15): 1, 1, 1
>>>>
>>>> You can't because all of this is completely wrong.
>>>>
>>>> Thanks,
>>>>
>>>> 	tglx
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-22 17:27             ` Jason Vas Dias
@ 2017-02-22 19:53               ` Thomas Gleixner
  -1 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2017-02-22 19:53 UTC (permalink / raw)
  To: Jason Vas Dias
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

On Wed, 22 Feb 2017, Jason Vas Dias wrote:

> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
> read or written . It is probably because it genuinuely does not support
> any cpuid > 13 , or the modern TSC_ADJUST interface.

Err no. TSC_ADJUST is available when CPUID(70 EBX has bit 1 set.

Please provide the output of:

# cpuid -1 -r

for that machine

> This is probably why my clock_gettime() latencies are so bad. Now I have
> to develop a patch to disable all access to TSC_ADJUST MSR if
> boot_cpu_data.cpuid_level <= 13 .  I really have an unlucky CPU :-) .

Can you just try to boot linux 4.10 on that machine an report whether it
works? It will touch the TSC_ADJUST MRS when the feature bit is set.

> But really, I think this issue goes deeper into the fundamental limits of
> time measurement on Linux : it is never going to be possible to measure
> minimum times with clock_gettime() comparable with those returned by
> rdtscp instruction - the time taken to enter the kernel through the VDSO,
> queue an access to vsyscall_gtod_data via a workqueue, access it & do
> computations & copy value to user-space

Sorry, that's not how the VDSO works. It does not involve workqueues, copy
to user space and whatever. VDSO is mapped into user space and only goes
into the when TSC is not working or the VDSO access is disabled or you want
to access a CLOCKID which is not supported in the VDSO.

> is NEVER going to be up to the job of measuring small real-time durations
> of the order of 10-20 TSC ticks .

clock_gettime(CLOCK_MONOTONIC) via VDSO takes ~20ns on my haswell laptop

> I think the best way to solve this problem going forward would be to store
> the entire vsyscall_gtod_data  data structure representing the current
> clocksource
> in a shared page which is memory-mappable (read-only) by user-space .

This is what VDSO does. It provides the data R/O to user space and it also
provides the accessor functions.

CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_MONOTONIC_COARSE and
CLOCK_REALTIME_COARSE are handled in the VDSO (user space) and never enter
the kernel.

I really have a hard time to understand what you are trying to solve.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-22 19:53               ` Thomas Gleixner
  0 siblings, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2017-02-22 19:53 UTC (permalink / raw)
  To: Jason Vas Dias
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

On Wed, 22 Feb 2017, Jason Vas Dias wrote:

> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
> read or written . It is probably because it genuinuely does not support
> any cpuid > 13 , or the modern TSC_ADJUST interface.

Err no. TSC_ADJUST is available when CPUID(70 EBX has bit 1 set.

Please provide the output of:

# cpuid -1 -r

for that machine

> This is probably why my clock_gettime() latencies are so bad. Now I have
> to develop a patch to disable all access to TSC_ADJUST MSR if
> boot_cpu_data.cpuid_level <= 13 .  I really have an unlucky CPU :-) .

Can you just try to boot linux 4.10 on that machine an report whether it
works? It will touch the TSC_ADJUST MRS when the feature bit is set.

> But really, I think this issue goes deeper into the fundamental limits of
> time measurement on Linux : it is never going to be possible to measure
> minimum times with clock_gettime() comparable with those returned by
> rdtscp instruction - the time taken to enter the kernel through the VDSO,
> queue an access to vsyscall_gtod_data via a workqueue, access it & do
> computations & copy value to user-space

Sorry, that's not how the VDSO works. It does not involve workqueues, copy
to user space and whatever. VDSO is mapped into user space and only goes
into the when TSC is not working or the VDSO access is disabled or you want
to access a CLOCKID which is not supported in the VDSO.

> is NEVER going to be up to the job of measuring small real-time durations
> of the order of 10-20 TSC ticks .

clock_gettime(CLOCK_MONOTONIC) via VDSO takes ~20ns on my haswell laptop

> I think the best way to solve this problem going forward would be to store
> the entire vsyscall_gtod_data  data structure representing the current
> clocksource
> in a shared page which is memory-mappable (read-only) by user-space .

This is what VDSO does. It provides the data R/O to user space and it also
provides the accessor functions.

CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_MONOTONIC_COARSE and
CLOCK_REALTIME_COARSE are handled in the VDSO (user space) and never enter
the kernel.

I really have a hard time to understand what you are trying to solve.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-22 17:27             ` Jason Vas Dias
@ 2017-02-22 20:15               ` Jason Vas Dias
  -1 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 20:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

I actually tried adding a 'notsc_adjust' kernel option to disable any setting or
access to the TSC_ADJUST MSR, but then I see the problems  - a big disparity
in values depending on which CPU the thread is scheduled -  and no
improvement in clock_gettime() latency.  So I don't think the new
TSC_ADJUST
code in ts_sync.c itself is the issue - but something added @ 460ns
onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
As I don't think fixing the clock_gettime() latency issue is my problem or even
possible with current clock architecture approach, it is a non-issue.

But please, can anyone tell me if are there any plans to move the time
infrastructure  out of the kernel and into glibc along the lines
outlined
in previous mail - if not, I am going to concentrate on this more radical
overhaul approach for my own systems .

At least, I think mapping the clocksource information structure itself in some
kind of sharable page makes sense . Processes could map that page copy-on-write
so they could start off with all the timing parameters preloaded,  then keep
their copy updated using the rdtscp instruction , or msync() (read-only)
with the kernel's single copy to get the latest time any process has requested.
All real-time parameters & adjustments could be stored in that page ,
& eventually a single copy of the tzdata could be used by both kernel
& user-space.
That is what I am working towards. Any plans to make linux real-time tsc
clock user-friendly ?



On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
> read or written . It is probably because it genuinuely does not
> support any cpuid > 13 ,
> or the modern TSC_ADJUST interface . This is probably why my
> clock_gettime()
> latencies are so bad. Now I have to develop a patch to disable all access
> to
> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
> I really have an unlucky CPU :-) .
>
> But really, I think this issue goes deeper into the fundamental limits of
> time measurement on Linux : it is never going to be possible to measure
> minimum times with clock_gettime() comparable with those returned by
> rdtscp instruction - the time taken to enter the kernel through the VDSO,
> queue an access to vsyscall_gtod_data via a workqueue, access it & do
> computations & copy value to user-space is NEVER going to be up to the
> job of measuring small real-time durations of the order of 10-20 TSC ticks
> .
>
> I think the best way to solve this problem going forward would be to store
> the entire vsyscall_gtod_data  data structure representing the current
> clocksource
> in a shared page which is memory-mappable (read-only) by user-space .
> I think sser-space programs should be able to do something like :
>     int fd =
> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
>     size_t psz = getpagesize();
>     void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
>     msync(gtod,psz,MS_SYNC);
>
> Then they could all read the real-time clock values as they are updated
> in real-time by the kernel, and know exactly how to interpret them .
>
> I also think that all mktime() / gmtime() / localtime() timezone handling
> functionality should be
> moved to user-space, and that the kernel should actually load and link in
> some
> /lib/libtzdata.so
> library, provided by glibc / libc implementations, that is exactly the
> same library
> used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
> by the kernel from the same places glibc loads it, and both the kernel and
> glibc should use identical mktime(), gmtime(), etc. functions to access it,
> and
> glibc using code would not need to enter the kernel at all for any
> time-handling
> code. This tzdata-library code be automatically loaded into process images
> the
> same way the vdso region is , and the whole system could access only one
> copy of it and the 'gtod.page' in memory.
>
> That's just my two-cents worth, and how I'd like to eventually get
> things working
> on my system.
>
> All the best, Regards,
> Jason
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>> RE:
>>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>>
>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>
>>> I have attached an updated version of the test program which
>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>> version printed it, but equally ignored it).
>>>
>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>>
>>> $ uname -r
>>> 4.10.0
>>> $ ./ttsc1
>>> max_extended_leaf: 80000008
>>> has tsc: 1 constant: 1
>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>>> ts3 - ts2: 178 ns1: 0.000000592
>>> ts3 - ts2: 14 ns1: 0.000000577
>>> ts3 - ts2: 14 ns1: 0.000000651
>>> ts3 - ts2: 17 ns1: 0.000000625
>>> ts3 - ts2: 17 ns1: 0.000000677
>>> ts3 - ts2: 17 ns1: 0.000000626
>>> ts3 - ts2: 17 ns1: 0.000000627
>>> ts3 - ts2: 17 ns1: 0.000000627
>>> ts3 - ts2: 18 ns1: 0.000000655
>>> ts3 - ts2: 17 ns1: 0.000000631
>>> t1 - t0: 89067 - ns2: 0.000091411
>>>
>>
>>
>> Oops, going blind in my old age. These latencies are actually 3 times
>> greater than under 4.8 !!
>>
>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as
>> shown
>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>>
>> ts3 - ts2: 24 ns1: 0.000000162
>> ts3 - ts2: 17 ns1: 0.000000143
>> ts3 - ts2: 17 ns1: 0.000000146
>> ts3 - ts2: 17 ns1: 0.000000149
>> ts3 - ts2: 17 ns1: 0.000000141
>> ts3 - ts2: 16 ns1: 0.000000142
>>
>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
>> 600ns, @ 4 times more than under 4.8 .
>> But I'm glad the TSC_ADJUST problems are fixed.
>>
>> Will programs reading :
>>  $ cat /sys/devices/msr/events/tsc
>>  event=0x00
>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
>> TSC ?
>>
>>> I think this is because under Linux 4.8, the CPU got a fault every
>>> time it read the TSC_ADJUST MSR.
>>
>> maybe it still is!
>>
>>
>>> But user programs wanting to use the TSC  and correlate its value to
>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>>> program still have to  dig the TSC frequency value out of the kernel
>>> with objdump  - this was really the point of the bug #194609.
>>>
>>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>>> 'shift' values via sysfs.
>>>
>>> Regards,
>>> Jason.
>>>
>>>
>>>
>>>
>>>
>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>> Thank You for enlightening me -
>>>>
>>>> I was just having a hard time believing that Intel would ship a chip
>>>> that features a monotonic, fixed frequency timestamp counter
>>>> without specifying in either documentation or on-chip or in ACPI what
>>>> precisely that hard-wired frequency is, but I now know that to
>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>>> difficult to reconcile with the statement in the SDM :
>>>>   17.16.4  Invariant Time-Keeping
>>>>     The invariant TSC is based on the invariant timekeeping hardware
>>>>     (called Always Running Timer or ART), that runs at the core crystal
>>>> clock
>>>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>>>> frequency
>>>>     relationship between the ART hardware and TSC. If
>>>> CPUID.15H:EBX[31:0]
>>>> !=
>>>> 0
>>>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>>>     relationship holds between TSC and the ART hardware:
>>>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>>                          / CPUID.15H:EAX[31:0] + K
>>>>     Where 'K' is an offset that can be adjusted by a privileged
>>>> agent*2.
>>>>      When ART hardware is reset, both invariant TSC and K are also
>>>> reset.
>>>>
>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>>>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>>>> that
>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>>> CPUs with InvariantTSC .
>>>>
>>>> Do I understand correctly , that since I do have InvariantTSC ,  the
>>>> TSC_Value is in fact calculated according to the above formula, but
>>>> with
>>>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>>>> TSC frequency ?
>>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>>> supported somehow , because I thought InvariantTSC meant it had ART
>>>> hardware .
>>>>
>>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>>> somewhere to user
>>>> space .
>>>>
>>>> I think the best long-term solution would be to allow programs to
>>>> somehow read the TSC without invoking
>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>>> system
>>>> .
>>>>
>>>>
>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>>> 'clocksource->shift' values to /sysfs somehow ?
>>>>
>>>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>>>> values could be exported as:
>>>> /sys/devices/system/clocksource/clocksource0/shift
>>>> /sys/devices/system/clocksource/clocksource0/mult
>>>> /sys/devices/system/clocksource/clocksource0/freq
>>>>
>>>> So user-space programs could  know that the value returned by
>>>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>>>   would be
>>>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>>     }
>>>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>>
>>>> That would save user-space programs from having to know 'tsc_khz' by
>>>> parsing the 'Refined TSC' frequency from log files or by examining the
>>>> running kernel with objdump to obtain this value & figure out 'mult' &
>>>> 'shift' themselves.
>>>>
>>>> And why not a
>>>>   /sys/devices/system/clocksource/clocksource0/value
>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>>> expression as a long integer?
>>>> And perhaps a
>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>>> file that actually prints out the number of real-time nano-seconds
>>>> since
>>>> the
>>>> contents of the existing
>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>>> files using the current TSC value?
>>>> To read the rtc0/{date,time} files is already faster than entering the
>>>> kernel to call
>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>>
>>>> I will work on developing a patch to this effect if no-one else is.
>>>>
>>>> Also, am I right in assuming that the maximum granularity of the
>>>> real-time
>>>> clock
>>>> on my system is 1/64th of a second ? :
>>>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>>  64
>>>> This is the maximum granularity that can be stored in CMOS , not
>>>> returned by TSC? Couldn't we have something similar that gave an
>>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>>> value to get clock_gettime
>>>> (CLOCK_MONOTONIC_RAW) value ?
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>> This code does produce good timestamps with a latency of @20ns
>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>>> values, but it depends on a global variable that  is initialized to
>>>> the 'tsc_khz' value
>>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t
>>>> IA64_tsc_now()
>>>> { if(!(    _ia64_invariant_tsc_enabled
>>>>       ||(( _cpu0id_fd == -1) &&
>>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>>       )
>>>>     )
>>>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>>> TSC enabled.\n");
>>>>     return 0;
>>>>   }
>>>>   U32_t tsc_hi, tsc_lo;
>>>>   register UL_t tsc;
>>>>   asm volatile
>>>>   ( "rdtscp\n\t"
>>>>     "mov %%edx, %0\n\t"
>>>>     "mov %%eax, %1\n\t"
>>>>     "mov %%ecx, %2\n\t"
>>>>   : "=m" (tsc_hi) ,
>>>>     "=m" (tsc_lo) ,
>>>>     "=m" (_ia64_tsc_user_cpu) :
>>>>   : "%eax","%ecx","%edx"
>>>>   );
>>>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>>   return tsc;
>>>> }
>>>>
>>>> __thread
>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t IA64_tsc_ticks_since_start()
>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>>>   { _ia64_first_tsc = IA64_tsc_now();
>>>>     return 0;
>>>>   }
>>>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>>>> }
>>>>
>>>> static inline __attribute__((always_inline))
>>>> void
>>>> ia64_tsc_calc_mult_shift
>>>> ( register U32_t *mult,
>>>>   register U32_t *shift
>>>> )
>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>>> function:
>>>>    * calculates second + nanosecond mult + shift in same way linux
>>>> does.
>>>>    * we want to be compatible with what linux returns in struct
>>>> timespec ts after call to
>>>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>>    */
>>>>   const U32_t scale=1000U;
>>>>   register U32_t from= IA64_tsc_khz();
>>>>   register U32_t to  = NSEC_PER_SEC / scale;
>>>>   register U64_t sec = ( ~0UL / from ) / scale;
>>>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>>   register U64_t maxsec = sec * scale;
>>>>   UL_t tmp;
>>>>   U32_t sft, sftacc=32;
>>>>   /*
>>>>    * Calculate the shift factor which is limiting the conversion
>>>>    * range:
>>>>    */
>>>>   tmp = (maxsec * from) >> 32;
>>>>   while (tmp)
>>>>   { tmp >>=1;
>>>>     sftacc--;
>>>>   }
>>>>   /*
>>>>    * Find the conversion shift/mult pair which has the best
>>>>    * accuracy and fits the maxsec conversion range:
>>>>    */
>>>>   for (sft = 32; sft > 0; sft--)
>>>>   { tmp = ((UL_t) to) << sft;
>>>>     tmp += from / 2;
>>>>     tmp = tmp / from;
>>>>     if ((tmp >> sftacc) == 0)
>>>>       break;
>>>>   }
>>>>   *mult = tmp;
>>>>   *shift = sft;
>>>> }
>>>>
>>>> __thread
>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t IA64_s_ns_since_start()
>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>>>   register U64_t ns = ((cycles
>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>>>> billion seconds here! */
>>>> }
>>>>
>>>>
>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>>> somehow,
>>>> then user-space libraries could have more confidence in using 'rdtsc'
>>>> or 'rdtscp'
>>>> if Linux's current_clocksource is 'tsc'.
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>>
>>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>>
>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>>>> in detect_art() in tsc.c,
>>>>>
>>>>> By some definition of available. You can feed CPUID random leaf
>>>>> numbers
>>>>> and
>>>>> it will return something, usually the value of the last valid CPUID
>>>>> leaf,
>>>>> which is 13 on your CPU. A similar CPU model has
>>>>>
>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>>> edx=0x00000000
>>>>>
>>>>> i.e. 7, 832, 832, 0
>>>>>
>>>>> Looks familiar, right?
>>>>>
>>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>>
>>>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>>>> CPUID +
>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>>> see this bit set .
>>>>>
>>>>> Rightfully so. This is a Haswell Core model.
>>>>>
>>>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>>>
>>>>> PTP is independent of the ART kernel feature . ART just provides
>>>>> enhanced
>>>>> PTP features. You are confusing things here.
>>>>>
>>>>> The ART feature as the kernel sees it is a hardware extension which
>>>>> feeds
>>>>> the ART clock to peripherals for timestamping and time correlation
>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf
>>>>> 0x15
>>>>> so
>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>>> accuracy.
>>>>>
>>>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>>>> of
>>>>> ART, but that has nothing to do with the feature bit, which solely
>>>>> describes the ratio between TSC and the ART frequency which is exposed
>>>>> to
>>>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>>>
>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to
>>>>>> be
>>>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is
>>>>>> 0
>>>>>> because the CPU will always get a fault reading the MSR since it has
>>>>>> never been written.
>>>>>
>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>>> really
>>>>> wrong. And writing it unconditionally to 0 is not going to happen.
>>>>> 4.10
>>>>> has
>>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>>
>>>>>> It would be nice for user-space programs that want to use the TSC
>>>>>> with
>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>>>> bug report,
>>>>>> could have confidence that Linux is actually generating the results
>>>>>> of
>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>>> in a predictable way from the TSC by looking at the
>>>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>>> clock_gettime() values.
>>>>>
>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>>
>>>>> Nothing at all, really.
>>>>>
>>>>> The kernel makes use of the proper information values already.
>>>>>
>>>>> The TSC frequency is determined from:
>>>>>
>>>>>     1) CPUID(0x16) if available
>>>>>     2) MSRs if available
>>>>>     3) By calibration against a known clock
>>>>>
>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_*
>>>>> values
>>>>> are
>>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>>
>>>>>> has tsc: 1 constant: 1
>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>>
>>>>> And that voodoo math tells us what? That you found a way to correlate
>>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>>
>>>>> Now I'm curious how you do that on this other machine which returns
>>>>> for
>>>>> cpuid(15): 1, 1, 1
>>>>>
>>>>> You can't because all of this is completely wrong.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> 	tglx
>>>>>
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
@ 2017-02-22 20:15               ` Jason Vas Dias
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 20:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

I actually tried adding a 'notsc_adjust' kernel option to disable any setting or
access to the TSC_ADJUST MSR, but then I see the problems  - a big disparity
in values depending on which CPU the thread is scheduled -  and no
improvement in clock_gettime() latency.  So I don't think the new
TSC_ADJUST
code in ts_sync.c itself is the issue - but something added @ 460ns
onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
As I don't think fixing the clock_gettime() latency issue is my problem or even
possible with current clock architecture approach, it is a non-issue.

But please, can anyone tell me if are there any plans to move the time
infrastructure  out of the kernel and into glibc along the lines
outlined
in previous mail - if not, I am going to concentrate on this more radical
overhaul approach for my own systems .

At least, I think mapping the clocksource information structure itself in some
kind of sharable page makes sense . Processes could map that page copy-on-write
so they could start off with all the timing parameters preloaded,  then keep
their copy updated using the rdtscp instruction , or msync() (read-only)
with the kernel's single copy to get the latest time any process has requested.
All real-time parameters & adjustments could be stored in that page ,
& eventually a single copy of the tzdata could be used by both kernel
& user-space.
That is what I am working towards. Any plans to make linux real-time tsc
clock user-friendly ?



On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
> read or written . It is probably because it genuinuely does not
> support any cpuid > 13 ,
> or the modern TSC_ADJUST interface . This is probably why my
> clock_gettime()
> latencies are so bad. Now I have to develop a patch to disable all access
> to
> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
> I really have an unlucky CPU :-) .
>
> But really, I think this issue goes deeper into the fundamental limits of
> time measurement on Linux : it is never going to be possible to measure
> minimum times with clock_gettime() comparable with those returned by
> rdtscp instruction - the time taken to enter the kernel through the VDSO,
> queue an access to vsyscall_gtod_data via a workqueue, access it & do
> computations & copy value to user-space is NEVER going to be up to the
> job of measuring small real-time durations of the order of 10-20 TSC ticks
> .
>
> I think the best way to solve this problem going forward would be to store
> the entire vsyscall_gtod_data  data structure representing the current
> clocksource
> in a shared page which is memory-mappable (read-only) by user-space .
> I think sser-space programs should be able to do something like :
>     int fd > open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
>     size_t psz = getpagesize();
>     void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
>     msync(gtod,psz,MS_SYNC);
>
> Then they could all read the real-time clock values as they are updated
> in real-time by the kernel, and know exactly how to interpret them .
>
> I also think that all mktime() / gmtime() / localtime() timezone handling
> functionality should be
> moved to user-space, and that the kernel should actually load and link in
> some
> /lib/libtzdata.so
> library, provided by glibc / libc implementations, that is exactly the
> same library
> used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
> by the kernel from the same places glibc loads it, and both the kernel and
> glibc should use identical mktime(), gmtime(), etc. functions to access it,
> and
> glibc using code would not need to enter the kernel at all for any
> time-handling
> code. This tzdata-library code be automatically loaded into process images
> the
> same way the vdso region is , and the whole system could access only one
> copy of it and the 'gtod.page' in memory.
>
> That's just my two-cents worth, and how I'd like to eventually get
> things working
> on my system.
>
> All the best, Regards,
> Jason
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>> RE:
>>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>>
>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>
>>> I have attached an updated version of the test program which
>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>> version printed it, but equally ignored it).
>>>
>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>>
>>> $ uname -r
>>> 4.10.0
>>> $ ./ttsc1
>>> max_extended_leaf: 80000008
>>> has tsc: 1 constant: 1
>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>>> ts3 - ts2: 178 ns1: 0.000000592
>>> ts3 - ts2: 14 ns1: 0.000000577
>>> ts3 - ts2: 14 ns1: 0.000000651
>>> ts3 - ts2: 17 ns1: 0.000000625
>>> ts3 - ts2: 17 ns1: 0.000000677
>>> ts3 - ts2: 17 ns1: 0.000000626
>>> ts3 - ts2: 17 ns1: 0.000000627
>>> ts3 - ts2: 17 ns1: 0.000000627
>>> ts3 - ts2: 18 ns1: 0.000000655
>>> ts3 - ts2: 17 ns1: 0.000000631
>>> t1 - t0: 89067 - ns2: 0.000091411
>>>
>>
>>
>> Oops, going blind in my old age. These latencies are actually 3 times
>> greater than under 4.8 !!
>>
>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as
>> shown
>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>>
>> ts3 - ts2: 24 ns1: 0.000000162
>> ts3 - ts2: 17 ns1: 0.000000143
>> ts3 - ts2: 17 ns1: 0.000000146
>> ts3 - ts2: 17 ns1: 0.000000149
>> ts3 - ts2: 17 ns1: 0.000000141
>> ts3 - ts2: 16 ns1: 0.000000142
>>
>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
>> 600ns, @ 4 times more than under 4.8 .
>> But I'm glad the TSC_ADJUST problems are fixed.
>>
>> Will programs reading :
>>  $ cat /sys/devices/msr/events/tsc
>>  event=0x00
>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
>> TSC ?
>>
>>> I think this is because under Linux 4.8, the CPU got a fault every
>>> time it read the TSC_ADJUST MSR.
>>
>> maybe it still is!
>>
>>
>>> But user programs wanting to use the TSC  and correlate its value to
>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>>> program still have to  dig the TSC frequency value out of the kernel
>>> with objdump  - this was really the point of the bug #194609.
>>>
>>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>>> 'shift' values via sysfs.
>>>
>>> Regards,
>>> Jason.
>>>
>>>
>>>
>>>
>>>
>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>> Thank You for enlightening me -
>>>>
>>>> I was just having a hard time believing that Intel would ship a chip
>>>> that features a monotonic, fixed frequency timestamp counter
>>>> without specifying in either documentation or on-chip or in ACPI what
>>>> precisely that hard-wired frequency is, but I now know that to
>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>>> difficult to reconcile with the statement in the SDM :
>>>>   17.16.4  Invariant Time-Keeping
>>>>     The invariant TSC is based on the invariant timekeeping hardware
>>>>     (called Always Running Timer or ART), that runs at the core crystal
>>>> clock
>>>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>>>> frequency
>>>>     relationship between the ART hardware and TSC. If
>>>> CPUID.15H:EBX[31:0]
>>>> !>>>> 0
>>>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>>>     relationship holds between TSC and the ART hardware:
>>>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>>                          / CPUID.15H:EAX[31:0] + K
>>>>     Where 'K' is an offset that can be adjusted by a privileged
>>>> agent*2.
>>>>      When ART hardware is reset, both invariant TSC and K are also
>>>> reset.
>>>>
>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>>>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>>>> that
>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>>> CPUs with InvariantTSC .
>>>>
>>>> Do I understand correctly , that since I do have InvariantTSC ,  the
>>>> TSC_Value is in fact calculated according to the above formula, but
>>>> with
>>>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>>>> TSC frequency ?
>>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>>> supported somehow , because I thought InvariantTSC meant it had ART
>>>> hardware .
>>>>
>>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>>> somewhere to user
>>>> space .
>>>>
>>>> I think the best long-term solution would be to allow programs to
>>>> somehow read the TSC without invoking
>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>>> system
>>>> .
>>>>
>>>>
>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>>> 'clocksource->shift' values to /sysfs somehow ?
>>>>
>>>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>>>> values could be exported as:
>>>> /sys/devices/system/clocksource/clocksource0/shift
>>>> /sys/devices/system/clocksource/clocksource0/mult
>>>> /sys/devices/system/clocksource/clocksource0/freq
>>>>
>>>> So user-space programs could  know that the value returned by
>>>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>>>   would be
>>>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>>     }
>>>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>>
>>>> That would save user-space programs from having to know 'tsc_khz' by
>>>> parsing the 'Refined TSC' frequency from log files or by examining the
>>>> running kernel with objdump to obtain this value & figure out 'mult' &
>>>> 'shift' themselves.
>>>>
>>>> And why not a
>>>>   /sys/devices/system/clocksource/clocksource0/value
>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>>> expression as a long integer?
>>>> And perhaps a
>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>>> file that actually prints out the number of real-time nano-seconds
>>>> since
>>>> the
>>>> contents of the existing
>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>>> files using the current TSC value?
>>>> To read the rtc0/{date,time} files is already faster than entering the
>>>> kernel to call
>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>>
>>>> I will work on developing a patch to this effect if no-one else is.
>>>>
>>>> Also, am I right in assuming that the maximum granularity of the
>>>> real-time
>>>> clock
>>>> on my system is 1/64th of a second ? :
>>>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>>  64
>>>> This is the maximum granularity that can be stored in CMOS , not
>>>> returned by TSC? Couldn't we have something similar that gave an
>>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>>> value to get clock_gettime
>>>> (CLOCK_MONOTONIC_RAW) value ?
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>> This code does produce good timestamps with a latency of @20ns
>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>>> values, but it depends on a global variable that  is initialized to
>>>> the 'tsc_khz' value
>>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t
>>>> IA64_tsc_now()
>>>> { if(!(    _ia64_invariant_tsc_enabled
>>>>       ||(( _cpu0id_fd = -1) &&
>>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>>       )
>>>>     )
>>>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>>> TSC enabled.\n");
>>>>     return 0;
>>>>   }
>>>>   U32_t tsc_hi, tsc_lo;
>>>>   register UL_t tsc;
>>>>   asm volatile
>>>>   ( "rdtscp\n\t"
>>>>     "mov %%edx, %0\n\t"
>>>>     "mov %%eax, %1\n\t"
>>>>     "mov %%ecx, %2\n\t"
>>>>   : "=m" (tsc_hi) ,
>>>>     "=m" (tsc_lo) ,
>>>>     "=m" (_ia64_tsc_user_cpu) :
>>>>   : "%eax","%ecx","%edx"
>>>>   );
>>>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>>   return tsc;
>>>> }
>>>>
>>>> __thread
>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t IA64_tsc_ticks_since_start()
>>>> { if(_ia64_first_tsc = 0xffffffffffffffffUL)
>>>>   { _ia64_first_tsc = IA64_tsc_now();
>>>>     return 0;
>>>>   }
>>>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>>>> }
>>>>
>>>> static inline __attribute__((always_inline))
>>>> void
>>>> ia64_tsc_calc_mult_shift
>>>> ( register U32_t *mult,
>>>>   register U32_t *shift
>>>> )
>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>>> function:
>>>>    * calculates second + nanosecond mult + shift in same way linux
>>>> does.
>>>>    * we want to be compatible with what linux returns in struct
>>>> timespec ts after call to
>>>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>>    */
>>>>   const U32_t scale\x1000U;
>>>>   register U32_t from= IA64_tsc_khz();
>>>>   register U32_t to  = NSEC_PER_SEC / scale;
>>>>   register U64_t sec = ( ~0UL / from ) / scale;
>>>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>>   register U64_t maxsec = sec * scale;
>>>>   UL_t tmp;
>>>>   U32_t sft, sftacc2;
>>>>   /*
>>>>    * Calculate the shift factor which is limiting the conversion
>>>>    * range:
>>>>    */
>>>>   tmp = (maxsec * from) >> 32;
>>>>   while (tmp)
>>>>   { tmp >>=1;
>>>>     sftacc--;
>>>>   }
>>>>   /*
>>>>    * Find the conversion shift/mult pair which has the best
>>>>    * accuracy and fits the maxsec conversion range:
>>>>    */
>>>>   for (sft = 32; sft > 0; sft--)
>>>>   { tmp = ((UL_t) to) << sft;
>>>>     tmp += from / 2;
>>>>     tmp = tmp / from;
>>>>     if ((tmp >> sftacc) = 0)
>>>>       break;
>>>>   }
>>>>   *mult = tmp;
>>>>   *shift = sft;
>>>> }
>>>>
>>>> __thread
>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t IA64_s_ns_since_start()
>>>> { if( ( _ia64_tsc_mult = ~0U ) || ( _ia64_tsc_shift = ~0U ) )
>>>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>>>   register U64_t ns = ((cycles
>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>>>> billion seconds here! */
>>>> }
>>>>
>>>>
>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>>> somehow,
>>>> then user-space libraries could have more confidence in using 'rdtsc'
>>>> or 'rdtscp'
>>>> if Linux's current_clocksource is 'tsc'.
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>>
>>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>>
>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>>>> in detect_art() in tsc.c,
>>>>>
>>>>> By some definition of available. You can feed CPUID random leaf
>>>>> numbers
>>>>> and
>>>>> it will return something, usually the value of the last valid CPUID
>>>>> leaf,
>>>>> which is 13 on your CPU. A similar CPU model has
>>>>>
>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>>> edx=0x00000000
>>>>>
>>>>> i.e. 7, 832, 832, 0
>>>>>
>>>>> Looks familiar, right?
>>>>>
>>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>>
>>>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>>>> CPUID +
>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>>> see this bit set .
>>>>>
>>>>> Rightfully so. This is a Haswell Core model.
>>>>>
>>>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>>>
>>>>> PTP is independent of the ART kernel feature . ART just provides
>>>>> enhanced
>>>>> PTP features. You are confusing things here.
>>>>>
>>>>> The ART feature as the kernel sees it is a hardware extension which
>>>>> feeds
>>>>> the ART clock to peripherals for timestamping and time correlation
>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf
>>>>> 0x15
>>>>> so
>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>>> accuracy.
>>>>>
>>>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>>>> of
>>>>> ART, but that has nothing to do with the feature bit, which solely
>>>>> describes the ratio between TSC and the ART frequency which is exposed
>>>>> to
>>>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>>>
>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to
>>>>>> be
>>>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART is
>>>>>> 0
>>>>>> because the CPU will always get a fault reading the MSR since it has
>>>>>> never been written.
>>>>>
>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>>> really
>>>>> wrong. And writing it unconditionally to 0 is not going to happen.
>>>>> 4.10
>>>>> has
>>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>>
>>>>>> It would be nice for user-space programs that want to use the TSC
>>>>>> with
>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>>>> bug report,
>>>>>> could have confidence that Linux is actually generating the results
>>>>>> of
>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>>> in a predictable way from the TSC by looking at the
>>>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>>> clock_gettime() values.
>>>>>
>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>>
>>>>> Nothing at all, really.
>>>>>
>>>>> The kernel makes use of the proper information values already.
>>>>>
>>>>> The TSC frequency is determined from:
>>>>>
>>>>>     1) CPUID(0x16) if available
>>>>>     2) MSRs if available
>>>>>     3) By calibration against a known clock
>>>>>
>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_*
>>>>> values
>>>>> are
>>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>>
>>>>>> has tsc: 1 constant: 1
>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>>
>>>>> And that voodoo math tells us what? That you found a way to correlate
>>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>>
>>>>> Now I'm curious how you do that on this other machine which returns
>>>>> for
>>>>> cpuid(15): 1, 1, 1
>>>>>
>>>>> You can't because all of this is completely wrong.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> 	tglx
>>>>>
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-22 20:15               ` Jason Vas Dias
  (?)
@ 2017-02-22 20:26               ` Jason Vas Dias
  2017-02-23 18:05                 ` Jason Vas Dias
  -1 siblings, 1 reply; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-22 20:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

[-- Attachment #1: Type: text/plain, Size: 21617 bytes --]

OK, last post on this issue today -
can anyone explain why, with standard 4.10.0 kernel & no new
'notsc_adjust' option, and the same maths being used, these two runs
should display
such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
values ? :

$ J/pub/ttsc/ttsc1
max_extended_leaf: 80000008
has tsc: 1 constant: 1
Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.000000641 ns2: 0.000002850
ts3 - ts2: 175 ns1: 0.000000659
ts3 - ts2: 18 ns1: 0.000000643
ts3 - ts2: 18 ns1: 0.000000618
ts3 - ts2: 17 ns1: 0.000000620
ts3 - ts2: 17 ns1: 0.000000616
ts3 - ts2: 18 ns1: 0.000000641
ts3 - ts2: 18 ns1: 0.000000709
ts3 - ts2: 20 ns1: 0.000000763
ts3 - ts2: 20 ns1: 0.000000735
ts3 - ts2: 20 ns1: 0.000000761
t1 - t0: 78200 - ns2: 0.000080824
$ J/pub/ttsc/ttsc1
max_extended_leaf: 80000008
has tsc: 1 constant: 1
Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.000001294 ns2: 0.000005375
ts3 - ts2: 210 ns1: 0.000001418
ts3 - ts2: 23 ns1: 0.000001399
ts3 - ts2: 22 ns1: 0.000001445
ts3 - ts2: 25 ns1: 0.000001321
ts3 - ts2: 20 ns1: 0.000001428
ts3 - ts2: 25 ns1: 0.000001367
ts3 - ts2: 23 ns1: 0.000001425
ts3 - ts2: 23 ns1: 0.000001357
ts3 - ts2: 22 ns1: 0.000001487
ts3 - ts2: 25 ns1: 0.000001377
t1 - t0: 145753 - ns2: 0.000150781

(complete source of test program ttsc1 attached in ttsc.tar
 $ tar -xpf ttsc.tar
 $ cd ttsc
 $ make
).

On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> I actually tried adding a 'notsc_adjust' kernel option to disable any
> setting or
> access to the TSC_ADJUST MSR, but then I see the problems  - a big
> disparity
> in values depending on which CPU the thread is scheduled -  and no
> improvement in clock_gettime() latency.  So I don't think the new
> TSC_ADJUST
> code in ts_sync.c itself is the issue - but something added @ 460ns
> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
> As I don't think fixing the clock_gettime() latency issue is my problem or
> even
> possible with current clock architecture approach, it is a non-issue.
>
> But please, can anyone tell me if are there any plans to move the time
> infrastructure  out of the kernel and into glibc along the lines
> outlined
> in previous mail - if not, I am going to concentrate on this more radical
> overhaul approach for my own systems .
>
> At least, I think mapping the clocksource information structure itself in
> some
> kind of sharable page makes sense . Processes could map that page
> copy-on-write
> so they could start off with all the timing parameters preloaded,  then
> keep
> their copy updated using the rdtscp instruction , or msync() (read-only)
> with the kernel's single copy to get the latest time any process has
> requested.
> All real-time parameters & adjustments could be stored in that page ,
> & eventually a single copy of the tzdata could be used by both kernel
> & user-space.
> That is what I am working towards. Any plans to make linux real-time tsc
> clock user-friendly ?
>
>
>
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
>> read or written . It is probably because it genuinuely does not
>> support any cpuid > 13 ,
>> or the modern TSC_ADJUST interface . This is probably why my
>> clock_gettime()
>> latencies are so bad. Now I have to develop a patch to disable all access
>> to
>> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
>> I really have an unlucky CPU :-) .
>>
>> But really, I think this issue goes deeper into the fundamental limits of
>> time measurement on Linux : it is never going to be possible to measure
>> minimum times with clock_gettime() comparable with those returned by
>> rdtscp instruction - the time taken to enter the kernel through the VDSO,
>> queue an access to vsyscall_gtod_data via a workqueue, access it & do
>> computations & copy value to user-space is NEVER going to be up to the
>> job of measuring small real-time durations of the order of 10-20 TSC
>> ticks
>> .
>>
>> I think the best way to solve this problem going forward would be to
>> store
>> the entire vsyscall_gtod_data  data structure representing the current
>> clocksource
>> in a shared page which is memory-mappable (read-only) by user-space .
>> I think sser-space programs should be able to do something like :
>>     int fd =
>> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
>>     size_t psz = getpagesize();
>>     void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
>>     msync(gtod,psz,MS_SYNC);
>>
>> Then they could all read the real-time clock values as they are updated
>> in real-time by the kernel, and know exactly how to interpret them .
>>
>> I also think that all mktime() / gmtime() / localtime() timezone handling
>> functionality should be
>> moved to user-space, and that the kernel should actually load and link in
>> some
>> /lib/libtzdata.so
>> library, provided by glibc / libc implementations, that is exactly the
>> same library
>> used by glibc() code to parse tzdata ; tzdata should be loaded at boot
>> time
>> by the kernel from the same places glibc loads it, and both the kernel
>> and
>> glibc should use identical mktime(), gmtime(), etc. functions to access
>> it,
>> and
>> glibc using code would not need to enter the kernel at all for any
>> time-handling
>> code. This tzdata-library code be automatically loaded into process
>> images
>> the
>> same way the vdso region is , and the whole system could access only one
>> copy of it and the 'gtod.page' in memory.
>>
>> That's just my two-cents worth, and how I'd like to eventually get
>> things working
>> on my system.
>>
>> All the best, Regards,
>> Jason
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>> RE:
>>>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>>>
>>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>>
>>>> I have attached an updated version of the test program which
>>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>>> version printed it, but equally ignored it).
>>>>
>>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>>>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>>>
>>>> $ uname -r
>>>> 4.10.0
>>>> $ ./ttsc1
>>>> max_extended_leaf: 80000008
>>>> has tsc: 1 constant: 1
>>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>>>> ts3 - ts2: 178 ns1: 0.000000592
>>>> ts3 - ts2: 14 ns1: 0.000000577
>>>> ts3 - ts2: 14 ns1: 0.000000651
>>>> ts3 - ts2: 17 ns1: 0.000000625
>>>> ts3 - ts2: 17 ns1: 0.000000677
>>>> ts3 - ts2: 17 ns1: 0.000000626
>>>> ts3 - ts2: 17 ns1: 0.000000627
>>>> ts3 - ts2: 17 ns1: 0.000000627
>>>> ts3 - ts2: 18 ns1: 0.000000655
>>>> ts3 - ts2: 17 ns1: 0.000000631
>>>> t1 - t0: 89067 - ns2: 0.000091411
>>>>
>>>
>>>
>>> Oops, going blind in my old age. These latencies are actually 3 times
>>> greater than under 4.8 !!
>>>
>>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime,
>>> as
>>> shown
>>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>>>
>>> ts3 - ts2: 24 ns1: 0.000000162
>>> ts3 - ts2: 17 ns1: 0.000000143
>>> ts3 - ts2: 17 ns1: 0.000000146
>>> ts3 - ts2: 17 ns1: 0.000000149
>>> ts3 - ts2: 17 ns1: 0.000000141
>>> ts3 - ts2: 16 ns1: 0.000000142
>>>
>>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
>>> 600ns, @ 4 times more than under 4.8 .
>>> But I'm glad the TSC_ADJUST problems are fixed.
>>>
>>> Will programs reading :
>>>  $ cat /sys/devices/msr/events/tsc
>>>  event=0x00
>>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on
>>> the
>>> TSC ?
>>>
>>>> I think this is because under Linux 4.8, the CPU got a fault every
>>>> time it read the TSC_ADJUST MSR.
>>>
>>> maybe it still is!
>>>
>>>
>>>> But user programs wanting to use the TSC  and correlate its value to
>>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>>>> program still have to  dig the TSC frequency value out of the kernel
>>>> with objdump  - this was really the point of the bug #194609.
>>>>
>>>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>>>> 'shift' values via sysfs.
>>>>
>>>> Regards,
>>>> Jason.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>>> Thank You for enlightening me -
>>>>>
>>>>> I was just having a hard time believing that Intel would ship a chip
>>>>> that features a monotonic, fixed frequency timestamp counter
>>>>> without specifying in either documentation or on-chip or in ACPI what
>>>>> precisely that hard-wired frequency is, but I now know that to
>>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>>>> difficult to reconcile with the statement in the SDM :
>>>>>   17.16.4  Invariant Time-Keeping
>>>>>     The invariant TSC is based on the invariant timekeeping hardware
>>>>>     (called Always Running Timer or ART), that runs at the core
>>>>> crystal
>>>>> clock
>>>>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>>>>> frequency
>>>>>     relationship between the ART hardware and TSC. If
>>>>> CPUID.15H:EBX[31:0]
>>>>> !=
>>>>> 0
>>>>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>>>>     relationship holds between TSC and the ART hardware:
>>>>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>>>                          / CPUID.15H:EAX[31:0] + K
>>>>>     Where 'K' is an offset that can be adjusted by a privileged
>>>>> agent*2.
>>>>>      When ART hardware is reset, both invariant TSC and K are also
>>>>> reset.
>>>>>
>>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>>>>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>>>>> that
>>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>>>> CPUs with InvariantTSC .
>>>>>
>>>>> Do I understand correctly , that since I do have InvariantTSC ,  the
>>>>> TSC_Value is in fact calculated according to the above formula, but
>>>>> with
>>>>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>>>>> TSC frequency ?
>>>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>>>> supported somehow , because I thought InvariantTSC meant it had ART
>>>>> hardware .
>>>>>
>>>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>>>> somewhere to user
>>>>> space .
>>>>>
>>>>> I think the best long-term solution would be to allow programs to
>>>>> somehow read the TSC without invoking
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>>>> system
>>>>> .
>>>>>
>>>>>
>>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>>>> 'clocksource->shift' values to /sysfs somehow ?
>>>>>
>>>>> For instance , only  if the 'current_clocksource' is 'tsc', then these
>>>>> values could be exported as:
>>>>> /sys/devices/system/clocksource/clocksource0/shift
>>>>> /sys/devices/system/clocksource/clocksource0/mult
>>>>> /sys/devices/system/clocksource/clocksource0/freq
>>>>>
>>>>> So user-space programs could  know that the value returned by
>>>>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>>>>   would be
>>>>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>>>     }
>>>>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>>>
>>>>> That would save user-space programs from having to know 'tsc_khz' by
>>>>> parsing the 'Refined TSC' frequency from log files or by examining the
>>>>> running kernel with objdump to obtain this value & figure out 'mult' &
>>>>> 'shift' themselves.
>>>>>
>>>>> And why not a
>>>>>   /sys/devices/system/clocksource/clocksource0/value
>>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>>>> expression as a long integer?
>>>>> And perhaps a
>>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>>>> file that actually prints out the number of real-time nano-seconds
>>>>> since
>>>>> the
>>>>> contents of the existing
>>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>>>> files using the current TSC value?
>>>>> To read the rtc0/{date,time} files is already faster than entering the
>>>>> kernel to call
>>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>>>
>>>>> I will work on developing a patch to this effect if no-one else is.
>>>>>
>>>>> Also, am I right in assuming that the maximum granularity of the
>>>>> real-time
>>>>> clock
>>>>> on my system is 1/64th of a second ? :
>>>>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>>>  64
>>>>> This is the maximum granularity that can be stored in CMOS , not
>>>>> returned by TSC? Couldn't we have something similar that gave an
>>>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>>>> value to get clock_gettime
>>>>> (CLOCK_MONOTONIC_RAW) value ?
>>>>>
>>>>> Regards,
>>>>> Jason
>>>>>
>>>>>
>>>>> This code does produce good timestamps with a latency of @20ns
>>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>>>> values, but it depends on a global variable that  is initialized to
>>>>> the 'tsc_khz' value
>>>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>>>
>>>>> static inline __attribute__((always_inline))
>>>>> U64_t
>>>>> IA64_tsc_now()
>>>>> { if(!(    _ia64_invariant_tsc_enabled
>>>>>       ||(( _cpu0id_fd == -1) &&
>>>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>>>       )
>>>>>     )
>>>>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>>>> TSC enabled.\n");
>>>>>     return 0;
>>>>>   }
>>>>>   U32_t tsc_hi, tsc_lo;
>>>>>   register UL_t tsc;
>>>>>   asm volatile
>>>>>   ( "rdtscp\n\t"
>>>>>     "mov %%edx, %0\n\t"
>>>>>     "mov %%eax, %1\n\t"
>>>>>     "mov %%ecx, %2\n\t"
>>>>>   : "=m" (tsc_hi) ,
>>>>>     "=m" (tsc_lo) ,
>>>>>     "=m" (_ia64_tsc_user_cpu) :
>>>>>   : "%eax","%ecx","%edx"
>>>>>   );
>>>>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>>>   return tsc;
>>>>> }
>>>>>
>>>>> __thread
>>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>>>
>>>>> static inline __attribute__((always_inline))
>>>>> U64_t IA64_tsc_ticks_since_start()
>>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>>>>   { _ia64_first_tsc = IA64_tsc_now();
>>>>>     return 0;
>>>>>   }
>>>>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>>>>> }
>>>>>
>>>>> static inline __attribute__((always_inline))
>>>>> void
>>>>> ia64_tsc_calc_mult_shift
>>>>> ( register U32_t *mult,
>>>>>   register U32_t *shift
>>>>> )
>>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>>>> function:
>>>>>    * calculates second + nanosecond mult + shift in same way linux
>>>>> does.
>>>>>    * we want to be compatible with what linux returns in struct
>>>>> timespec ts after call to
>>>>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>>>    */
>>>>>   const U32_t scale=1000U;
>>>>>   register U32_t from= IA64_tsc_khz();
>>>>>   register U32_t to  = NSEC_PER_SEC / scale;
>>>>>   register U64_t sec = ( ~0UL / from ) / scale;
>>>>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>>>   register U64_t maxsec = sec * scale;
>>>>>   UL_t tmp;
>>>>>   U32_t sft, sftacc=32;
>>>>>   /*
>>>>>    * Calculate the shift factor which is limiting the conversion
>>>>>    * range:
>>>>>    */
>>>>>   tmp = (maxsec * from) >> 32;
>>>>>   while (tmp)
>>>>>   { tmp >>=1;
>>>>>     sftacc--;
>>>>>   }
>>>>>   /*
>>>>>    * Find the conversion shift/mult pair which has the best
>>>>>    * accuracy and fits the maxsec conversion range:
>>>>>    */
>>>>>   for (sft = 32; sft > 0; sft--)
>>>>>   { tmp = ((UL_t) to) << sft;
>>>>>     tmp += from / 2;
>>>>>     tmp = tmp / from;
>>>>>     if ((tmp >> sftacc) == 0)
>>>>>       break;
>>>>>   }
>>>>>   *mult = tmp;
>>>>>   *shift = sft;
>>>>> }
>>>>>
>>>>> __thread
>>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>>>
>>>>> static inline __attribute__((always_inline))
>>>>> U64_t IA64_s_ns_since_start()
>>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>>>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>>>>   register U64_t ns = ((cycles
>>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>>>>> billion seconds here! */
>>>>> }
>>>>>
>>>>>
>>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>>>> somehow,
>>>>> then user-space libraries could have more confidence in using 'rdtsc'
>>>>> or 'rdtscp'
>>>>> if Linux's current_clocksource is 'tsc'.
>>>>>
>>>>> Regards,
>>>>> Jason
>>>>>
>>>>>
>>>>>
>>>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>>>
>>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>>>>> in detect_art() in tsc.c,
>>>>>>
>>>>>> By some definition of available. You can feed CPUID random leaf
>>>>>> numbers
>>>>>> and
>>>>>> it will return something, usually the value of the last valid CPUID
>>>>>> leaf,
>>>>>> which is 13 on your CPU. A similar CPU model has
>>>>>>
>>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>>>> edx=0x00000000
>>>>>>
>>>>>> i.e. 7, 832, 832, 0
>>>>>>
>>>>>> Looks familiar, right?
>>>>>>
>>>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>>>
>>>>>>> Linux does not think ART is enabled, and does not set the
>>>>>>> synthesized
>>>>>>> CPUID +
>>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>>>> see this bit set .
>>>>>>
>>>>>> Rightfully so. This is a Haswell Core model.
>>>>>>
>>>>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>>>>
>>>>>> PTP is independent of the ART kernel feature . ART just provides
>>>>>> enhanced
>>>>>> PTP features. You are confusing things here.
>>>>>>
>>>>>> The ART feature as the kernel sees it is a hardware extension which
>>>>>> feeds
>>>>>> the ART clock to peripherals for timestamping and time correlation
>>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf
>>>>>> 0x15
>>>>>> so
>>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>>>> accuracy.
>>>>>>
>>>>>> It's correct, that the NONSTOP_TSC feature depends on the
>>>>>> availability
>>>>>> of
>>>>>> ART, but that has nothing to do with the feature bit, which solely
>>>>>> describes the ratio between TSC and the ART frequency which is
>>>>>> exposed
>>>>>> to
>>>>>> peripherals. That frequency is not necessarily the real ART
>>>>>> frequency.
>>>>>>
>>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to
>>>>>>> be
>>>>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART
>>>>>>> is
>>>>>>> 0
>>>>>>> because the CPU will always get a fault reading the MSR since it has
>>>>>>> never been written.
>>>>>>
>>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>>>> really
>>>>>> wrong. And writing it unconditionally to 0 is not going to happen.
>>>>>> 4.10
>>>>>> has
>>>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>>>
>>>>>>> It would be nice for user-space programs that want to use the TSC
>>>>>>> with
>>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to
>>>>>>> the
>>>>>>> bug report,
>>>>>>> could have confidence that Linux is actually generating the results
>>>>>>> of
>>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>>>> in a predictable way from the TSC by looking at the
>>>>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>>>> clock_gettime() values.
>>>>>>
>>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>>>
>>>>>> Nothing at all, really.
>>>>>>
>>>>>> The kernel makes use of the proper information values already.
>>>>>>
>>>>>> The TSC frequency is determined from:
>>>>>>
>>>>>>     1) CPUID(0x16) if available
>>>>>>     2) MSRs if available
>>>>>>     3) By calibration against a known clock
>>>>>>
>>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_*
>>>>>> values
>>>>>> are
>>>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>>>
>>>>>>> has tsc: 1 constant: 1
>>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>>>
>>>>>> And that voodoo math tells us what? That you found a way to correlate
>>>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>>>
>>>>>> Now I'm curious how you do that on this other machine which returns
>>>>>> for
>>>>>> cpuid(15): 1, 1, 1
>>>>>>
>>>>>> You can't because all of this is completely wrong.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> 	tglx
>>>>>>
>>>>>
>>>>
>>>
>>
>

[-- Attachment #2: ttsc.tar --]
[-- Type: application/x-tar, Size: 40960 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609
  2017-02-22 20:26               ` Jason Vas Dias
@ 2017-02-23 18:05                 ` Jason Vas Dias
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Vas Dias @ 2017-02-23 18:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: kernel-janitors, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Prarit Bhargava, x86

[-- Attachment #1: Type: text/plain, Size: 24167 bytes --]

I have found a new source of weirdness with  TSC  using
clock_gettime(CLOCK_MONOTONIC_RAW,&ts) :

The vsyscall_gtod_data.mult field changes somewhat between
calls to clock_gettime(CLOCK_MONOTONIC_RAW,&ts),
so that sometimes an extra (2^24) nanoseconds are added or
removed from  the value derived from the TSC and stored in 'ts' .

This is demonstrated by the output of the test program in the
attached ttsc.tar  file:
$ ./tlgtd
it worked! - GTOD: clock:1 mult:5798662 shift:24
synced - mult now: 5798661

What it is doing is finding the address of the 'vsyscall_gtod_data' structure
from /proc/kallsyms, and mapping the virtual address to an ELF section
offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure
into user-space memory .

Really, this 'mult' value, which is used to return the
seconds|nanoseconds value:
    ( tsc_cycles * mult ) >> shift
(where shift is 24 ), should not change from the first time it is initialized .

The TSC is meant to be FIXED FREQUENCY, right ?
So how could  /  why should the conversion function from TSC ticks to
nanoseconds change ?

So now it is doubly difficult for user-space libraries to maintain their
RDTSC derived seconds|nanoseconds values to correlate well those returned by
the kernel,  because they must regularly read the updated 'mult' value
used by the
kernel .

I really don't think the kernel should randomly be deciding to
increase / decrease
the TSC tick period by 2^24 nanoseconds!

Is this a bug or intentional ? I am searching for all places where a
'[.>]mult.*=' occurs, but this returns rather alot of matches.

Please could a future version of linux at least export the 'mult' and
'shift' values for
the current clocksource !

Regards,
Jason








On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> OK, last post on this issue today -
> can anyone explain why, with standard 4.10.0 kernel & no new
> 'notsc_adjust' option, and the same maths being used, these two runs
> should display
> such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
> values ? :
>
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.000000641 ns2: 0.000002850
> ts3 - ts2: 175 ns1: 0.000000659
> ts3 - ts2: 18 ns1: 0.000000643
> ts3 - ts2: 18 ns1: 0.000000618
> ts3 - ts2: 17 ns1: 0.000000620
> ts3 - ts2: 17 ns1: 0.000000616
> ts3 - ts2: 18 ns1: 0.000000641
> ts3 - ts2: 18 ns1: 0.000000709
> ts3 - ts2: 20 ns1: 0.000000763
> ts3 - ts2: 20 ns1: 0.000000735
> ts3 - ts2: 20 ns1: 0.000000761
> t1 - t0: 78200 - ns2: 0.000080824
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.000001294 ns2: 0.000005375
> ts3 - ts2: 210 ns1: 0.000001418
> ts3 - ts2: 23 ns1: 0.000001399
> ts3 - ts2: 22 ns1: 0.000001445
> ts3 - ts2: 25 ns1: 0.000001321
> ts3 - ts2: 20 ns1: 0.000001428
> ts3 - ts2: 25 ns1: 0.000001367
> ts3 - ts2: 23 ns1: 0.000001425
> ts3 - ts2: 23 ns1: 0.000001357
> ts3 - ts2: 22 ns1: 0.000001487
> ts3 - ts2: 25 ns1: 0.000001377
> t1 - t0: 145753 - ns2: 0.000150781
>
> (complete source of test program ttsc1 attached in ttsc.tar
>  $ tar -xpf ttsc.tar
>  $ cd ttsc
>  $ make
> ).
>
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>> I actually tried adding a 'notsc_adjust' kernel option to disable any
>> setting or
>> access to the TSC_ADJUST MSR, but then I see the problems  - a big
>> disparity
>> in values depending on which CPU the thread is scheduled -  and no
>> improvement in clock_gettime() latency.  So I don't think the new
>> TSC_ADJUST
>> code in ts_sync.c itself is the issue - but something added @ 460ns
>> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
>> As I don't think fixing the clock_gettime() latency issue is my problem
>> or
>> even
>> possible with current clock architecture approach, it is a non-issue.
>>
>> But please, can anyone tell me if are there any plans to move the time
>> infrastructure  out of the kernel and into glibc along the lines
>> outlined
>> in previous mail - if not, I am going to concentrate on this more radical
>> overhaul approach for my own systems .
>>
>> At least, I think mapping the clocksource information structure itself in
>> some
>> kind of sharable page makes sense . Processes could map that page
>> copy-on-write
>> so they could start off with all the timing parameters preloaded,  then
>> keep
>> their copy updated using the rdtscp instruction , or msync() (read-only)
>> with the kernel's single copy to get the latest time any process has
>> requested.
>> All real-time parameters & adjustments could be stored in that page ,
>> & eventually a single copy of the tzdata could be used by both kernel
>> & user-space.
>> That is what I am working towards. Any plans to make linux real-time tsc
>> clock user-friendly ?
>>
>>
>>
>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
>>> read or written . It is probably because it genuinuely does not
>>> support any cpuid > 13 ,
>>> or the modern TSC_ADJUST interface . This is probably why my
>>> clock_gettime()
>>> latencies are so bad. Now I have to develop a patch to disable all
>>> access
>>> to
>>> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
>>> I really have an unlucky CPU :-) .
>>>
>>> But really, I think this issue goes deeper into the fundamental limits
>>> of
>>> time measurement on Linux : it is never going to be possible to measure
>>> minimum times with clock_gettime() comparable with those returned by
>>> rdtscp instruction - the time taken to enter the kernel through the
>>> VDSO,
>>> queue an access to vsyscall_gtod_data via a workqueue, access it & do
>>> computations & copy value to user-space is NEVER going to be up to the
>>> job of measuring small real-time durations of the order of 10-20 TSC
>>> ticks
>>> .
>>>
>>> I think the best way to solve this problem going forward would be to
>>> store
>>> the entire vsyscall_gtod_data  data structure representing the current
>>> clocksource
>>> in a shared page which is memory-mappable (read-only) by user-space .
>>> I think sser-space programs should be able to do something like :
>>>     int fd =
>>> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
>>>     size_t psz = getpagesize();
>>>     void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
>>>     msync(gtod,psz,MS_SYNC);
>>>
>>> Then they could all read the real-time clock values as they are updated
>>> in real-time by the kernel, and know exactly how to interpret them .
>>>
>>> I also think that all mktime() / gmtime() / localtime() timezone
>>> handling
>>> functionality should be
>>> moved to user-space, and that the kernel should actually load and link
>>> in
>>> some
>>> /lib/libtzdata.so
>>> library, provided by glibc / libc implementations, that is exactly the
>>> same library
>>> used by glibc() code to parse tzdata ; tzdata should be loaded at boot
>>> time
>>> by the kernel from the same places glibc loads it, and both the kernel
>>> and
>>> glibc should use identical mktime(), gmtime(), etc. functions to access
>>> it,
>>> and
>>> glibc using code would not need to enter the kernel at all for any
>>> time-handling
>>> code. This tzdata-library code be automatically loaded into process
>>> images
>>> the
>>> same way the vdso region is , and the whole system could access only one
>>> copy of it and the 'gtod.page' in memory.
>>>
>>> That's just my two-cents worth, and how I'd like to eventually get
>>> things working
>>> on my system.
>>>
>>> All the best, Regards,
>>> Jason
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>>> RE:
>>>>>>> 4.10 has  new code which utilizes the TSC_ADJUST MSR.
>>>>>
>>>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>>>
>>>>> I have attached an updated version of the test program which
>>>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>>>> version printed it, but equally ignored it).
>>>>>
>>>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>>>>> a factor of 2 - it used to be @140ns and is now @ 70ns  ! Wow!  :
>>>>>
>>>>> $ uname -r
>>>>> 4.10.0
>>>>> $ ./ttsc1
>>>>> max_extended_leaf: 80000008
>>>>> has tsc: 1 constant: 1
>>>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>>>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>>>>> ts3 - ts2: 178 ns1: 0.000000592
>>>>> ts3 - ts2: 14 ns1: 0.000000577
>>>>> ts3 - ts2: 14 ns1: 0.000000651
>>>>> ts3 - ts2: 17 ns1: 0.000000625
>>>>> ts3 - ts2: 17 ns1: 0.000000677
>>>>> ts3 - ts2: 17 ns1: 0.000000626
>>>>> ts3 - ts2: 17 ns1: 0.000000627
>>>>> ts3 - ts2: 17 ns1: 0.000000627
>>>>> ts3 - ts2: 18 ns1: 0.000000655
>>>>> ts3 - ts2: 17 ns1: 0.000000631
>>>>> t1 - t0: 89067 - ns2: 0.000091411
>>>>>
>>>>
>>>>
>>>> Oops, going blind in my old age. These latencies are actually 3 times
>>>> greater than under 4.8 !!
>>>>
>>>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime,
>>>> as
>>>> shown
>>>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>>>>
>>>> ts3 - ts2: 24 ns1: 0.000000162
>>>> ts3 - ts2: 17 ns1: 0.000000143
>>>> ts3 - ts2: 17 ns1: 0.000000146
>>>> ts3 - ts2: 17 ns1: 0.000000149
>>>> ts3 - ts2: 17 ns1: 0.000000141
>>>> ts3 - ts2: 16 ns1: 0.000000142
>>>>
>>>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
>>>> 600ns, @ 4 times more than under 4.8 .
>>>> But I'm glad the TSC_ADJUST problems are fixed.
>>>>
>>>> Will programs reading :
>>>>  $ cat /sys/devices/msr/events/tsc
>>>>  event=0x00
>>>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on
>>>> the
>>>> TSC ?
>>>>
>>>>> I think this is because under Linux 4.8, the CPU got a fault every
>>>>> time it read the TSC_ADJUST MSR.
>>>>
>>>> maybe it still is!
>>>>
>>>>
>>>>> But user programs wanting to use the TSC  and correlate its value to
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>>>>> program still have to  dig the TSC frequency value out of the kernel
>>>>> with objdump  - this was really the point of the bug #194609.
>>>>>
>>>>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>>>>> 'shift' values via sysfs.
>>>>>
>>>>> Regards,
>>>>> Jason.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
>>>>>> Thank You for enlightening me -
>>>>>>
>>>>>> I was just having a hard time believing that Intel would ship a chip
>>>>>> that features a monotonic, fixed frequency timestamp counter
>>>>>> without specifying in either documentation or on-chip or in ACPI what
>>>>>> precisely that hard-wired frequency is, but I now know that to
>>>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>>>>> difficult to reconcile with the statement in the SDM :
>>>>>>   17.16.4  Invariant Time-Keeping
>>>>>>     The invariant TSC is based on the invariant timekeeping hardware
>>>>>>     (called Always Running Timer or ART), that runs at the core
>>>>>> crystal
>>>>>> clock
>>>>>>     frequency. The ratio defined by CPUID leaf 15H expresses the
>>>>>> frequency
>>>>>>     relationship between the ART hardware and TSC. If
>>>>>> CPUID.15H:EBX[31:0]
>>>>>> !=
>>>>>> 0
>>>>>>     and CPUID.80000007H:EDX[InvariantTSC] = 1, the following
>>>>>> linearity
>>>>>>     relationship holds between TSC and the ART hardware:
>>>>>>     TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>>>>                          / CPUID.15H:EAX[31:0] + K
>>>>>>     Where 'K' is an offset that can be adjusted by a privileged
>>>>>> agent*2.
>>>>>>      When ART hardware is reset, both invariant TSC and K are also
>>>>>> reset.
>>>>>>
>>>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0]  and
>>>>>> CPUID.15H:EAX[31:0]  are for my hardware.  I assumed (incorrectly)
>>>>>> that
>>>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>>>>> CPUs with InvariantTSC .
>>>>>>
>>>>>> Do I understand correctly , that since I do have InvariantTSC ,  the
>>>>>> TSC_Value is in fact calculated according to the above formula, but
>>>>>> with
>>>>>> a "hidden" ART Value,  & Core Crystal Clock frequency & its ratio to
>>>>>> TSC frequency ?
>>>>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>>>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>>>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>>>>> supported somehow , because I thought InvariantTSC meant it had ART
>>>>>> hardware .
>>>>>>
>>>>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>>>>> somewhere to user
>>>>>> space .
>>>>>>
>>>>>> I think the best long-term solution would be to allow programs to
>>>>>> somehow read the TSC without invoking
>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>>>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>>>>> system
>>>>>> .
>>>>>>
>>>>>>
>>>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>>>>> 'clocksource->shift' values to /sysfs somehow ?
>>>>>>
>>>>>> For instance , only  if the 'current_clocksource' is 'tsc', then
>>>>>> these
>>>>>> values could be exported as:
>>>>>> /sys/devices/system/clocksource/clocksource0/shift
>>>>>> /sys/devices/system/clocksource/clocksource0/mult
>>>>>> /sys/devices/system/clocksource/clocksource0/freq
>>>>>>
>>>>>> So user-space programs could  know that the value returned by
>>>>>>     clock_gettime(CLOCK_MONOTONIC_RAW)
>>>>>>   would be
>>>>>>     {    .tv_sec =  ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>>>>       , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>>>>     }
>>>>>>   and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>>>>
>>>>>> That would save user-space programs from having to know 'tsc_khz' by
>>>>>> parsing the 'Refined TSC' frequency from log files or by examining
>>>>>> the
>>>>>> running kernel with objdump to obtain this value & figure out 'mult'
>>>>>> &
>>>>>> 'shift' themselves.
>>>>>>
>>>>>> And why not a
>>>>>>   /sys/devices/system/clocksource/clocksource0/value
>>>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>>>>> expression as a long integer?
>>>>>> And perhaps a
>>>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>>>>> file that actually prints out the number of real-time nano-seconds
>>>>>> since
>>>>>> the
>>>>>> contents of the existing
>>>>>>   /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>>>>> files using the current TSC value?
>>>>>> To read the rtc0/{date,time} files is already faster than entering
>>>>>> the
>>>>>> kernel to call
>>>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>>>>
>>>>>> I will work on developing a patch to this effect if no-one else is.
>>>>>>
>>>>>> Also, am I right in assuming that the maximum granularity of the
>>>>>> real-time
>>>>>> clock
>>>>>> on my system is 1/64th of a second ? :
>>>>>>  $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>>>>  64
>>>>>> This is the maximum granularity that can be stored in CMOS , not
>>>>>> returned by TSC? Couldn't we have something similar that gave an
>>>>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>>>>> value to get clock_gettime
>>>>>> (CLOCK_MONOTONIC_RAW) value ?
>>>>>>
>>>>>> Regards,
>>>>>> Jason
>>>>>>
>>>>>>
>>>>>> This code does produce good timestamps with a latency of @20ns
>>>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>>>>> values, but it depends on a global variable that  is initialized to
>>>>>> the 'tsc_khz' value
>>>>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> U64_t
>>>>>> IA64_tsc_now()
>>>>>> { if(!(    _ia64_invariant_tsc_enabled
>>>>>>       ||(( _cpu0id_fd == -1) &&
>>>>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>>>>       )
>>>>>>     )
>>>>>>   { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>>>>> TSC enabled.\n");
>>>>>>     return 0;
>>>>>>   }
>>>>>>   U32_t tsc_hi, tsc_lo;
>>>>>>   register UL_t tsc;
>>>>>>   asm volatile
>>>>>>   ( "rdtscp\n\t"
>>>>>>     "mov %%edx, %0\n\t"
>>>>>>     "mov %%eax, %1\n\t"
>>>>>>     "mov %%ecx, %2\n\t"
>>>>>>   : "=m" (tsc_hi) ,
>>>>>>     "=m" (tsc_lo) ,
>>>>>>     "=m" (_ia64_tsc_user_cpu) :
>>>>>>   : "%eax","%ecx","%edx"
>>>>>>   );
>>>>>>   tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>>>>   return tsc;
>>>>>> }
>>>>>>
>>>>>> __thread
>>>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> U64_t IA64_tsc_ticks_since_start()
>>>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>>>>>   { _ia64_first_tsc = IA64_tsc_now();
>>>>>>     return 0;
>>>>>>   }
>>>>>>   return (IA64_tsc_now() - _ia64_first_tsc) ;
>>>>>> }
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> void
>>>>>> ia64_tsc_calc_mult_shift
>>>>>> ( register U32_t *mult,
>>>>>>   register U32_t *shift
>>>>>> )
>>>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>>>>> function:
>>>>>>    * calculates second + nanosecond mult + shift in same way linux
>>>>>> does.
>>>>>>    * we want to be compatible with what linux returns in struct
>>>>>> timespec ts after call to
>>>>>>    * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>>>>    */
>>>>>>   const U32_t scale=1000U;
>>>>>>   register U32_t from= IA64_tsc_khz();
>>>>>>   register U32_t to  = NSEC_PER_SEC / scale;
>>>>>>   register U64_t sec = ( ~0UL / from ) / scale;
>>>>>>   sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>>>>   register U64_t maxsec = sec * scale;
>>>>>>   UL_t tmp;
>>>>>>   U32_t sft, sftacc=32;
>>>>>>   /*
>>>>>>    * Calculate the shift factor which is limiting the conversion
>>>>>>    * range:
>>>>>>    */
>>>>>>   tmp = (maxsec * from) >> 32;
>>>>>>   while (tmp)
>>>>>>   { tmp >>=1;
>>>>>>     sftacc--;
>>>>>>   }
>>>>>>   /*
>>>>>>    * Find the conversion shift/mult pair which has the best
>>>>>>    * accuracy and fits the maxsec conversion range:
>>>>>>    */
>>>>>>   for (sft = 32; sft > 0; sft--)
>>>>>>   { tmp = ((UL_t) to) << sft;
>>>>>>     tmp += from / 2;
>>>>>>     tmp = tmp / from;
>>>>>>     if ((tmp >> sftacc) == 0)
>>>>>>       break;
>>>>>>   }
>>>>>>   *mult = tmp;
>>>>>>   *shift = sft;
>>>>>> }
>>>>>>
>>>>>> __thread
>>>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> U64_t IA64_s_ns_since_start()
>>>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>>>>>     ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>>>>   register U64_t cycles = IA64_tsc_ticks_since_start();
>>>>>>   register U64_t ns = ((cycles
>>>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>>>>   return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>>>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>>>>   /* Yes, we are purposefully ignoring durations of more than 4.2
>>>>>> billion seconds here! */
>>>>>> }
>>>>>>
>>>>>>
>>>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>>>>> somehow,
>>>>>> then user-space libraries could have more confidence in using 'rdtsc'
>>>>>> or 'rdtscp'
>>>>>> if Linux's current_clocksource is 'tsc'.
>>>>>>
>>>>>> Regards,
>>>>>> Jason
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 20/02/2017, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>>>>
>>>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 ,
>>>>>>>> so
>>>>>>>> in detect_art() in tsc.c,
>>>>>>>
>>>>>>> By some definition of available. You can feed CPUID random leaf
>>>>>>> numbers
>>>>>>> and
>>>>>>> it will return something, usually the value of the last valid CPUID
>>>>>>> leaf,
>>>>>>> which is 13 on your CPU. A similar CPU model has
>>>>>>>
>>>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>>>>> edx=0x00000000
>>>>>>>
>>>>>>> i.e. 7, 832, 832, 0
>>>>>>>
>>>>>>> Looks familiar, right?
>>>>>>>
>>>>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>>>>
>>>>>>>> Linux does not think ART is enabled, and does not set the
>>>>>>>> synthesized
>>>>>>>> CPUID +
>>>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>>>>> see this bit set .
>>>>>>>
>>>>>>> Rightfully so. This is a Haswell Core model.
>>>>>>>
>>>>>>>> if an e1000 NIC card had been installed, PTP would not be
>>>>>>>> available.
>>>>>>>
>>>>>>> PTP is independent of the ART kernel feature . ART just provides
>>>>>>> enhanced
>>>>>>> PTP features. You are confusing things here.
>>>>>>>
>>>>>>> The ART feature as the kernel sees it is a hardware extension which
>>>>>>> feeds
>>>>>>> the ART clock to peripherals for timestamping and time correlation
>>>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf
>>>>>>> 0x15
>>>>>>> so
>>>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>>>>> accuracy.
>>>>>>>
>>>>>>> It's correct, that the NONSTOP_TSC feature depends on the
>>>>>>> availability
>>>>>>> of
>>>>>>> ART, but that has nothing to do with the feature bit, which solely
>>>>>>> describes the ratio between TSC and the ART frequency which is
>>>>>>> exposed
>>>>>>> to
>>>>>>> peripherals. That frequency is not necessarily the real ART
>>>>>>> frequency.
>>>>>>>
>>>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems
>>>>>>>> to
>>>>>>>> be
>>>>>>>> nowhere else in Linux,  the code will always think X86_FEATURE_ART
>>>>>>>> is
>>>>>>>> 0
>>>>>>>> because the CPU will always get a fault reading the MSR since it
>>>>>>>> has
>>>>>>>> never been written.
>>>>>>>
>>>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>>>>> really
>>>>>>> wrong. And writing it unconditionally to 0 is not going to happen.
>>>>>>> 4.10
>>>>>>> has
>>>>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>>>>
>>>>>>>> It would be nice for user-space programs that want to use the TSC
>>>>>>>> with
>>>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to
>>>>>>>> the
>>>>>>>> bug report,
>>>>>>>> could have confidence that Linux is actually generating the results
>>>>>>>> of
>>>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>>>>> in a predictable way from the TSC by looking at the
>>>>>>>>  /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>>>>> clock_gettime() values.
>>>>>>>
>>>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>>>>
>>>>>>> Nothing at all, really.
>>>>>>>
>>>>>>> The kernel makes use of the proper information values already.
>>>>>>>
>>>>>>> The TSC frequency is determined from:
>>>>>>>
>>>>>>>     1) CPUID(0x16) if available
>>>>>>>     2) MSRs if available
>>>>>>>     3) By calibration against a known clock
>>>>>>>
>>>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_*
>>>>>>> values
>>>>>>> are
>>>>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>>>>
>>>>>>>> has tsc: 1 constant: 1
>>>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>>>>
>>>>>>> And that voodoo math tells us what? That you found a way to
>>>>>>> correlate
>>>>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>>>>
>>>>>>> Now I'm curious how you do that on this other machine which returns
>>>>>>> for
>>>>>>> cpuid(15): 1, 1, 1
>>>>>>>
>>>>>>> You can't because all of this is completely wrong.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> 	tglx
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

[-- Attachment #2: ttsc.tar --]
[-- Type: application/x-tar, Size: 40960 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-02-23 18:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-19  0:31 [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609 Jason Vas Dias
2017-02-19 15:35 ` Jason Vas Dias
2017-02-20 21:49   ` Thomas Gleixner
2017-02-20 21:49     ` Thomas Gleixner
2017-02-21 23:39     ` Jason Vas Dias
2017-02-21 23:39       ` Jason Vas Dias
2017-02-22 16:07       ` Jason Vas Dias
2017-02-22 16:18         ` Jason Vas Dias
2017-02-22 16:18           ` Jason Vas Dias
2017-02-22 17:27           ` Jason Vas Dias
2017-02-22 17:27             ` Jason Vas Dias
2017-02-22 19:53             ` Thomas Gleixner
2017-02-22 19:53               ` Thomas Gleixner
2017-02-22 20:15             ` Jason Vas Dias
2017-02-22 20:15               ` Jason Vas Dias
2017-02-22 20:26               ` Jason Vas Dias
2017-02-23 18:05                 ` Jason Vas Dias

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.