linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC patch 00/15] Tracer Timestamping
@ 2008-10-16 23:27 Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 01/15] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
                   ` (15 more replies)
  0 siblings, 16 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller

Hi,

Starting with the bottom of my LTTng patchset
(git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git)
I post as RFC the timestamping infrastructure I have been using for a while in
the tracer. It integrates get_cycles() standardization following David Miller's
comments I did more recently.

It also deals with 32 -> 64 bits timestamp counter extension with a RCU-style
algorithm, which is especially useful on MIPS and SuperH architectures.

There is also a TSC synchronization test within this patchset to detect
unsynchronized TSCs. See comments in this specific patch to figure out the
difference between the current x86 tsc_sync.c and the one I propose in this
patch.

It also provides an architecture-agnostic fallback in case there is no
timestamp counter available : basically, it's
(jiffies << 13) | atomically_incremented_counter (if there are more than 8192
events per jiffy, time will still be monotonic, but will increment faster than
the actual system frequency).

Comments are welcome. Note that this is only the beginning of the patchset. I
plan to submit the event ID allocation/portable event typing aimed at exporting
the data to userspace and buffering mechanism as soon as I integrate a core
version of the LTTV userspace tools to the kernel build tree. Other than that, I
currently have a tracer which fulfills most of the requirements expressed
earlier. I just fear that if I release only the kernel part without foolproof
binary-to-ascii trace decoder within the kernel, people might be a bit reluctant
to fetch a separate userspace package.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 01/15] get_cycles() : kconfig HAVE_GET_CYCLES
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 02/15] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ingo Molnar

[-- Attachment #1: get-cycles-kconfig-have-get-cycles.patch --]
[-- Type: text/plain, Size: 2003 bytes --]

Create a new "HAVE_GET_CYCLES" architecture option to specify which
architectures provide 64-bits TSC counters readable with get_cycles(). It's
principally useful to only enable high-precision tracing code only on such
architectures and don't even bother building it on architectures which lack such
support.

It also requires architectures to provide get_cycles_barrier() and
get_cycles_rate().

I mainly use it for the "priority-sifting rwlock" latency tracing code, which
traces worse-case latency induced by the locking. It also provides the basic
changes needed for the LTTng timestamping infrastructure.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 init/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig	2008-10-16 11:25:07.000000000 -0400
+++ linux-2.6-lttng/init/Kconfig	2008-10-16 11:32:26.000000000 -0400
@@ -323,6 +323,16 @@ config CPUSETS
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# Architectures with a 64-bits get_cycles() should select this.
+# They should also define
+# get_cycles_barrier() : instruction synchronization barrier if required
+# get_cycles_rate() : cycle counter rate, in HZ. If 0, TSC are not synchronized
+# across CPUs or their frequency may vary due to frequency scaling.
+#
+config HAVE_GET_CYCLES
+	def_bool n
+
 config GROUP_SCHED
 	bool "Group CPU scheduler"
 	depends on EXPERIMENTAL

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 02/15] get_cycles() : x86 HAVE_GET_CYCLES
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 01/15] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ingo Molnar

[-- Attachment #1: get-cycles-x86-have-get-cycles.patch --]
[-- Type: text/plain, Size: 1904 bytes --]

This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 arch/x86/Kconfig      |    1 +
 include/asm-x86/tsc.h |   12 ++++++++++++
 2 files changed, 13 insertions(+)

Index: linux-2.6-lttng/arch/x86/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig	2008-10-16 13:07:04.000000000 -0400
+++ linux-2.6-lttng/arch/x86/Kconfig	2008-10-16 13:07:12.000000000 -0400
@@ -19,6 +19,7 @@ config X86_64
 config X86
 	def_bool y
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_GET_CYCLES
 	select HAVE_IDE
 	select HAVE_OPROFILE
 	select HAVE_IOREMAP_PROT
Index: linux-2.6-lttng/include/asm-x86/tsc.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/tsc.h	2008-10-16 13:07:04.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/tsc.h	2008-10-16 13:07:24.000000000 -0400
@@ -50,6 +50,18 @@ extern void mark_tsc_unstable(char *reas
 extern int unsynchronized_tsc(void);
 int check_tsc_unstable(void);
 
+static inline cycles_t get_cycles_rate(void)
+{
+	if (check_tsc_unstable())
+		return 0;
+	return tsc_khz;
+}
+
+static inline void get_cycles_barrier(void)
+{
+	rdtsc_barrier();
+}
+
 /*
  * Boot-time check whether the TSCs are synchronized across
  * all CPUs/cores:

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 01/15] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 02/15] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-17  2:48   ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ingo Molnar

[-- Attachment #1: get-cycles-sparc64-have-get-cycles.patch --]
[-- Type: text/plain, Size: 2084 bytes --]

This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 arch/sparc/include/asm/timex_64.h |   17 ++++++++++++++++-
 arch/sparc64/Kconfig              |    1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/sparc64/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/sparc64/Kconfig	2008-10-16 13:07:04.000000000 -0400
+++ linux-2.6-lttng/arch/sparc64/Kconfig	2008-10-16 18:14:45.000000000 -0400
@@ -14,6 +14,7 @@ config SPARC64
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE
 	select HAVE_IDE
+	select HAVE_GET_CYCLES
 	select HAVE_LMB
 	select HAVE_ARCH_KGDB
 	select USE_GENERIC_SMP_HELPERS if SMP
Index: linux-2.6-lttng/arch/sparc/include/asm/timex_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/sparc/include/asm/timex_64.h	2008-10-16 13:07:04.000000000 -0400
+++ linux-2.6-lttng/arch/sparc/include/asm/timex_64.h	2008-10-16 18:14:25.000000000 -0400
@@ -12,7 +12,22 @@
 
 /* Getting on the cycle counter on sparc64. */
 typedef unsigned long cycles_t;
-#define get_cycles()	tick_ops->get_tick()
+
+static inline cycles_t get_cycles(void)
+{
+	return tick_ops->get_tick();
+}
+
+/* get_cycles instruction is synchronized on sparc64 */
+static inline void get_cycles_barrier(void)
+{
+	return;
+}
+
+static inline cycles_t get_cycles_rate(void)
+{
+	return CLOCK_TICK_RATE;
+}
 
 #define ARCH_HAS_READ_CURRENT_TIMER
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-17  0:26   ` Paul Mackerras
  2008-10-16 23:27 ` [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, benh, paulus, Ingo Molnar

[-- Attachment #1: get-cycles-powerpc-have-get-cycles.patch --]
[-- Type: text/plain, Size: 1994 bytes --]

This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 arch/powerpc/Kconfig             |    1 +
 arch/powerpc/include/asm/timex.h |   13 +++++++++++++
 2 files changed, 14 insertions(+)

Index: linux-2.6-lttng/arch/powerpc/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig	2008-10-16 12:25:48.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/Kconfig	2008-10-16 12:33:31.000000000 -0400
@@ -122,6 +122,7 @@ config PPC
 	select HAVE_DMA_ATTRS if PPC64
 	select USE_GENERIC_SMP_HELPERS if SMP
 	select HAVE_OPROFILE
+	select HAVE_GET_CYCLES if PPC64
 
 config EARLY_PRINTK
 	bool
Index: linux-2.6-lttng/arch/powerpc/include/asm/timex.h
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/include/asm/timex.h	2008-10-16 12:25:48.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/include/asm/timex.h	2008-10-16 12:33:58.000000000 -0400
@@ -46,5 +46,18 @@ static inline cycles_t get_cycles(void)
 #endif
 }
 
+static inline cycles_t get_cycles_rate(void)
+{
+	return CLOCK_TICK_RATE;
+}
+
+/*
+ * To check : assuming mtfb requires isync to synchronize instruction execution.
+ */
+static inline void get_cycles_barrier(void)
+{
+	isync();
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _ASM_POWERPC_TIMEX_H */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-26 10:18   ` Ralf Baechle
  2008-10-16 23:27 ` [RFC patch 06/15] LTTng build Mathieu Desnoyers
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ralf Baechle, Ingo Molnar

[-- Attachment #1: get-cycles-mips-have-get-cycles.patch --]
[-- Type: text/plain, Size: 2827 bytes --]

partly reverts commit efb9ca08b5a2374b29938cdcab417ce4feb14b54. Selects
HAVE_GET_CYCLES_32 only on CPUs where it is safe to use it.

Currently consider the "_WORKAROUND" cases for 4000 and 4400 to be unsafe, but
should probably add other sub-architecture to the blacklist.

Do not define HAVE_GET_CYCLES because MIPS does not provide 64-bit tsc (only
32-bits).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 arch/mips/Kconfig        |    4 ++++
 include/asm-mips/timex.h |   25 +++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

Index: linux-2.6-lttng/include/asm-mips/timex.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-mips/timex.h	2008-10-16 12:25:47.000000000 -0400
+++ linux-2.6-lttng/include/asm-mips/timex.h	2008-10-16 12:34:18.000000000 -0400
@@ -29,14 +29,39 @@
  * which isn't an evil thing.
  *
  * We know that all SMP capable CPUs have cycle counters.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ * HAVE_GET_CYCLES makes sure that this case is handled properly :
+ *
+ * Ralf Baechle <ralf@linux-mips.org> :
+ * This avoids us executing an mfc0 c0_count instruction on processors which
+ * don't have but also on certain R4000 and R4400 versions where reading from
+ * the count register just in the very moment when its value equals c0_compare
+ * will result in the timer interrupt getting lost.
  */
 
 typedef unsigned int cycles_t;
 
+#ifdef HAVE_GET_CYCLES_32
+static inline cycles_t get_cycles(void)
+{
+	return read_c0_count();
+}
+
+static inline void get_cycles_barrier(void)
+{
+}
+
+static inline cycles_t get_cycles_rate(void)
+{
+	return CLOCK_TICK_RATE;
+}
+#else
 static inline cycles_t get_cycles(void)
 {
 	return 0;
 }
+#endif
 
 #endif /* __KERNEL__ */
 
Index: linux-2.6-lttng/arch/mips/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/mips/Kconfig	2008-10-16 12:25:47.000000000 -0400
+++ linux-2.6-lttng/arch/mips/Kconfig	2008-10-16 12:34:01.000000000 -0400
@@ -1581,6 +1581,10 @@ config CPU_R4000_WORKAROUNDS
 config CPU_R4400_WORKAROUNDS
 	bool
 
+config HAVE_GET_CYCLES_32
+	def_bool y
+	depends on !CPU_R4400_WORKAROUNDS
+
 #
 # Use the generic interrupt handling code in kernel/irq/:
 #

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 06/15] LTTng build
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-17  8:10   ` KOSAKI Motohiro
  2008-10-16 23:27 ` [RFC patch 07/15] LTTng timestamp Mathieu Desnoyers
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ingo Molnar, William L. Irwin

[-- Attachment #1: lttng-build-instrumentation-menu.patch --]
[-- Type: text/plain, Size: 2138 bytes --]

Adds the basic LTTng config options.

sparc32 needs to have ltt/ added to core-y for some reason (seems broken).

ltt/Kconfig is sourced from kernel/Kconfig.instrumentation.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: William L. Irwin <wli@holomorphy.com>
---
 Makefile            |    2 +-
 arch/sparc/Makefile |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/arch/sparc/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/sparc/Makefile	2008-10-16 18:16:33.000000000 -0400
+++ linux-2.6-lttng/arch/sparc/Makefile	2008-10-16 18:16:38.000000000 -0400
@@ -37,7 +37,7 @@ drivers-$(CONFIG_OPROFILE)	+= arch/sparc
 # Renaming is done to avoid confusing pattern matching rules in 2.5.45 (multy-)
 INIT_Y		:= $(patsubst %/, %/built-in.o, $(init-y))
 CORE_Y		:= $(core-y)
-CORE_Y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+CORE_Y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ltt/
 CORE_Y		:= $(patsubst %/, %/built-in.o, $(CORE_Y))
 DRIVERS_Y	:= $(patsubst %/, %/built-in.o, $(drivers-y))
 NET_Y		:= $(patsubst %/, %/built-in.o, $(net-y))
Index: linux-2.6-lttng/Makefile
===================================================================
--- linux-2.6-lttng.orig/Makefile	2008-10-16 18:16:33.000000000 -0400
+++ linux-2.6-lttng/Makefile	2008-10-16 18:16:38.000000000 -0400
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ltt/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 07/15] LTTng timestamp
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 06/15] LTTng build Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-17  8:15   ` KOSAKI Motohiro
  2008-10-16 23:27 ` [RFC patch 08/15] LTTng - Timestamping Mathieu Desnoyers
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ralf Baechle, benh, paulus, Ingo Molnar

[-- Attachment #1: lttng-timestamp-core.patch --]
[-- Type: text/plain, Size: 10339 bytes --]

LTTng synthetic TSC code for timestamping. Extracts 64 bits tsc from a [0..32]
bits counter, kept up to date by periodical timer interrupt. Lockless.

> do you actually use the RCU internals? or do you just reimplement an RCU
> algorithm?
> 

Nope, I don't use RCU internals in this code. Preempt disable seemed
like the best way to handle this utterly short code path and I wanted
the write side to be fast enough to be called periodically. What I do is:

- Disable preemption at the read-side :
  it makes sure the pointer I get will point to a data structure that
  will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
  (only need to disable preemption, does not race against other CPUs and
  won't generate cache line bouncing). It also allows dealing with
  unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.

(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.

Changelog:

- Support [0..32] bits -> 64 bits.

I volountarily limit the code to use at most 32 bits of the hardware clock for
performance considerations. If this is a problem it could be changed. Also, the
algorithm is aimed at a 32 bits architecture. The code becomes muuuch simpler on
a 64 bits arch, since we can do the updates atomically.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 init/Kconfig        |    2 
 ltt/Kconfig         |   17 ++++
 ltt/Makefile        |    1 
 ltt/ltt-timestamp.c |  212 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 232 insertions(+)

Index: linux-2.6-lttng/ltt/ltt-timestamp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/ltt/ltt-timestamp.c	2008-10-16 18:30:25.000000000 -0400
@@ -0,0 +1,212 @@
+/*
+ * (C) Copyright	2006,2007 -
+ * 		Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * notes : ltt-timestamp timer-based clock cannot be used for early tracing in
+ * the boot process, as it depends on timer interrupts.
+ *
+ * The timer is only on one CPU to support hotplug.
+ * We have the choice between schedule_delayed_work_on and an IPI to get each
+ * CPU to write the heartbeat. IPI has been chosen because it is considered
+ * faster than passing through the timer to get the work scheduled on all the
+ * CPUs.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <linux/cpu.h>
+#include <linux/timex.h>
+#include <linux/bitops.h>
+#include <linux/ltt.h>
+#include <linux/smp.h>
+#include <linux/sched.h> /* FIX for m68k local_irq_enable in on_each_cpu */
+
+/*
+ * Number of hardware clock bits. The higher order bits are expected to be 0.
+ * If the hardware clock source has more than 32 bits, the bits higher than the
+ * 32nd will be truncated by a cast to a 32 bits unsigned. Range : 1 - 32.
+ * (too few bits would be unrealistic though, since we depend on the timer to
+ * detect the overflows).
+ */
+#define HW_BITS				32
+
+#define HW_BITMASK			((1ULL << HW_BITS) - 1)
+#define HW_LSB(hw)			((hw) & HW_BITMASK)
+#define SW_MSB(sw)			((sw) & ~HW_BITMASK)
+
+/* Expected maximum interrupt latency in ms : 15ms, *2 for security */
+#define EXPECTED_INTERRUPT_LATENCY	30
+
+atomic_t lttng_generic_clock;
+EXPORT_SYMBOL(lttng_generic_clock);
+
+static struct timer_list stsc_timer;
+static unsigned int precalc_expire;
+
+struct synthetic_tsc_struct {
+	union {
+		u64 val;
+		struct {
+#ifdef __BIG_ENDIAN
+			u32 msb;
+			u32 lsb;
+#else
+			u32 lsb;
+			u32 msb;
+#endif
+		} sel;
+	} tsc[2];
+	unsigned int index;	/* Index of the current synth. tsc. */
+};
+
+static DEFINE_PER_CPU(struct synthetic_tsc_struct, synthetic_tsc);
+
+/* Called from IPI : either in interrupt or process context */
+static void ltt_update_synthetic_tsc(void)
+{
+	struct synthetic_tsc_struct *cpu_synth;
+	u32 tsc;
+
+	preempt_disable();
+	cpu_synth = &per_cpu(synthetic_tsc, smp_processor_id());
+	tsc = ltt_get_timestamp32();		/* Hardware clocksource read */
+
+	if (tsc < HW_LSB(cpu_synth->tsc[cpu_synth->index].sel.lsb)) {
+		unsigned int new_index = 1 - cpu_synth->index; /* 0 <-> 1 */
+		/*
+		 * Overflow
+		 * Non atomic update of the non current synthetic TSC, followed
+		 * by an atomic index change. There is no write concurrency,
+		 * so the index read/write does not need to be atomic.
+		 */
+		cpu_synth->tsc[new_index].val =
+			(SW_MSB(cpu_synth->tsc[cpu_synth->index].val)
+				| (u64)tsc) + (1ULL << HW_BITS);
+		cpu_synth->index = new_index;	/* atomic change of index */
+	} else {
+		/*
+		 * No overflow : We know that the only bits changed are
+		 * contained in the 32 LSBs, which can be written to atomically.
+		 */
+		cpu_synth->tsc[cpu_synth->index].sel.lsb =
+			SW_MSB(cpu_synth->tsc[cpu_synth->index].sel.lsb) | tsc;
+	}
+	preempt_enable();
+}
+
+/* Called from buffer switch : in _any_ context (even NMI) */
+u64 notrace ltt_read_synthetic_tsc(void)
+{
+	struct synthetic_tsc_struct *cpu_synth;
+	u64 ret;
+	unsigned int index;
+	u32 tsc;
+
+	preempt_disable_notrace();
+	cpu_synth = &per_cpu(synthetic_tsc, smp_processor_id());
+	index = cpu_synth->index;		/* atomic read */
+	tsc = ltt_get_timestamp32();		/* Hardware clocksource read */
+
+	/* Overflow detection */
+	if (unlikely(tsc < HW_LSB(cpu_synth->tsc[index].sel.lsb)))
+		ret = (SW_MSB(cpu_synth->tsc[index].val) | (u64)tsc)
+			+ (1ULL << HW_BITS);
+	else
+		ret = SW_MSB(cpu_synth->tsc[index].val) | (u64)tsc;
+	preempt_enable_notrace();
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ltt_read_synthetic_tsc);
+
+static void synthetic_tsc_ipi(void *info)
+{
+	ltt_update_synthetic_tsc();
+}
+
+/* We need to be in process context to do an IPI */
+static void synthetic_tsc_work(struct work_struct *work)
+{
+	on_each_cpu(synthetic_tsc_ipi, NULL, 1);
+}
+static DECLARE_WORK(stsc_work, synthetic_tsc_work);
+
+/*
+ * stsc_timer : - Timer function synchronizing synthetic TSC.
+ * @data: unused
+ *
+ * Guarantees at least 1 execution before low word of TSC wraps.
+ */
+static void stsc_timer_fct(unsigned long data)
+{
+	PREPARE_WORK(&stsc_work, synthetic_tsc_work);
+	schedule_work(&stsc_work);
+
+	mod_timer(&stsc_timer, jiffies + precalc_expire);
+}
+
+/*
+ * precalc_stsc_interval: - Precalculates the interval between the clock
+ * wraparounds.
+ */
+static int __init precalc_stsc_interval(void)
+{
+	precalc_expire =
+		(HW_BITMASK / ((ltt_frequency() / HZ * ltt_freq_scale()) << 1)
+			- 1 - (EXPECTED_INTERRUPT_LATENCY * HZ / 1000)) >> 1;
+	WARN_ON(precalc_expire == 0);
+	printk(KERN_DEBUG "Synthetic TSC timer will fire each %u jiffies.\n",
+		precalc_expire);
+	return 0;
+}
+
+/*
+ * 	hotcpu_callback - CPU hotplug callback
+ * 	@nb: notifier block
+ * 	@action: hotplug action to take
+ * 	@hcpu: CPU number
+ *
+ *	Sets the new CPU's current synthetic TSC to the same value as the
+ *	currently running CPU.
+ *
+ * 	Returns the success/failure of the operation. (NOTIFY_OK, NOTIFY_BAD)
+ */
+static int __cpuinit hotcpu_callback(struct notifier_block *nb,
+				unsigned long action,
+				void *hcpu)
+{
+	unsigned int hotcpu = (unsigned long)hcpu;
+	struct synthetic_tsc_struct *cpu_synth;
+	u64 local_count;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		cpu_synth = &per_cpu(synthetic_tsc, hotcpu);
+		local_count = ltt_read_synthetic_tsc();
+		cpu_synth->tsc[0].val = local_count;
+		cpu_synth->index = 0;
+		smp_wmb();	/* Writing in data of CPU about to come up */
+		break;
+	case CPU_ONLINE:
+		/* As we are preemptible, make sure it runs on the right cpu */
+		smp_call_function_single(hotcpu, synthetic_tsc_ipi, NULL, 0);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+/* Called from one CPU, before any tracing starts, to init each structure */
+static int __init ltt_init_synthetic_tsc(void)
+{
+	hotcpu_notifier(hotcpu_callback, 3);
+	precalc_stsc_interval();
+	init_timer(&stsc_timer);
+	stsc_timer.function = stsc_timer_fct;
+	stsc_timer.expires = jiffies + precalc_expire;
+	add_timer(&stsc_timer);
+	return 0;
+}
+
+__initcall(ltt_init_synthetic_tsc);
Index: linux-2.6-lttng/ltt/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/ltt/Kconfig	2008-10-16 18:30:41.000000000 -0400
@@ -0,0 +1,17 @@
+menu "Linux Trace Toolkit"
+
+config LTT_TIMESTAMP
+	bool "LTTng fine-grained timestamping"
+	default y
+	help
+	  Allow fine-grained timestamps to be taken from tracing applications.
+
+config HAVE_LTT_CLOCK
+	def_bool n
+
+config HAVE_LTT_SYNTHETIC_TSC
+	bool
+	default y if (!HAVE_LTT_CLOCK)
+	default n if HAVE_LTT_CLOCK
+
+endmenu
Index: linux-2.6-lttng/ltt/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/ltt/Makefile	2008-10-16 18:30:41.000000000 -0400
@@ -0,0 +1 @@
+obj-$(CONFIG_HAVE_LTT_SYNTHETIC_TSC)	+= ltt-timestamp.o
Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig	2008-10-16 18:18:38.000000000 -0400
+++ linux-2.6-lttng/init/Kconfig	2008-10-16 18:30:39.000000000 -0400
@@ -787,6 +787,8 @@ config MARKERS
 	  Place an empty function call at each marker site. Can be
 	  dynamically changed for a probe function.
 
+source "ltt/Kconfig"
+
 source "arch/Kconfig"
 
 config PROC_PAGE_MONITOR

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 08/15] LTTng - Timestamping
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 07/15] LTTng timestamp Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 09/15] LTTng mips export hpt frequency Mathieu Desnoyers
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ralf Baechle, benh, paulus, Ingo Molnar

[-- Attachment #1: lttng-timestamp-generic.patch --]
[-- Type: text/plain, Size: 4662 bytes --]

Wrapper to use the lower level clock sources available on the systems. Fall-back
on jiffies or'd with a logical clock for architectures lacking CPU timestamp
counters. Fall-back on a mixed TSC-logical clock on architectures lacking
synchronized TSC on SMP.

A generic fallback based on a logical clock and the timer interrupt is
available.

generic - Uses jiffies or'd with a logical clock extended to 64 bits by
          ltt-timestamp.c.
i386 - Uses TSC. If detects non synchronized TSC, uses mixed TSC-logical clock.
mips - Uses TSC extended atomically from 32 to 64 bits by ltt-heartbeat.c.
powerpc - Uses TSC or generic ltt clock.
x86_64 - Uses TSC. If detects non synchronized TSC, uses mixed TSC-logical clock

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 include/asm-generic/ltt.h |   53 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ltt.h       |   27 +++++++++++++++++++++++
 kernel/timer.c            |    2 +
 3 files changed, 82 insertions(+)

Index: linux-2.6-lttng/kernel/timer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/timer.c	2008-09-16 14:50:15.000000000 -0400
+++ linux-2.6-lttng/kernel/timer.c	2008-09-16 14:59:17.000000000 -0400
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/ltt.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1066,6 +1067,7 @@ void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_times(ticks);
+	ltt_add_timestamp(ticks);
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM
Index: linux-2.6-lttng/include/linux/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/ltt.h	2008-09-16 14:58:47.000000000 -0400
@@ -0,0 +1,27 @@
+#ifndef _LINUX_LTT_H
+#define _LINUX_LTT_H
+
+/*
+ * Generic LTT clock.
+ *
+ * Chooses between an architecture specific clock or an atomic logical clock.
+ *
+ * Copyright (C) 2007 Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ */
+
+#ifdef CONFIG_LTT_TIMESTAMP
+#ifdef CONFIG_HAVE_LTT_CLOCK
+#include <asm/ltt.h>
+#else
+#include <asm-generic/ltt.h>
+
+#define ltt_get_timestamp32	ltt_get_timestamp32_generic
+#define ltt_get_timestamp64	ltt_get_timestamp64_generic
+#define ltt_add_timestamp	ltt_add_timestamp_generic
+#define ltt_frequency		ltt_frequency_generic
+#define ltt_freq_scale		ltt_freq_scale_generic
+#endif /* CONFIG_HAVE_LTT_CLOCK */
+#else
+#define ltt_add_timestamp(ticks)
+#endif /* CONFIG_LTT_TIMESTAMP */
+#endif /* _LINUX_LTT_H */
Index: linux-2.6-lttng/include/asm-generic/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-generic/ltt.h	2008-09-16 14:58:47.000000000 -0400
@@ -0,0 +1,53 @@
+#ifndef _ASM_GENERIC_LTT_H
+#define _ASM_GENERIC_LTT_H
+
+/*
+ * linux/include/asm-generic/ltt.h
+ *
+ * Copyright (C) 2007 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * Generic definitions for LTT
+ * Architectures without TSC
+ */
+
+#include <linux/param.h>	/* For HZ */
+#include <asm/atomic.h>
+
+#define LTT_GENERIC_CLOCK_SHIFT 13
+
+u64 ltt_read_synthetic_tsc(void);
+
+extern atomic_t lttng_generic_clock;
+
+static inline u32 ltt_get_timestamp32_generic(void)
+{
+	return atomic_add_return(1, &lttng_generic_clock);
+}
+
+static inline u64 ltt_get_timestamp64_generic(void)
+{
+	return ltt_read_synthetic_tsc();
+}
+
+static inline void ltt_add_timestamp_generic(unsigned long ticks)
+{
+	int old_clock, new_clock;
+
+	do {
+		old_clock = atomic_read(&lttng_generic_clock);
+		new_clock = (old_clock + (ticks << LTT_GENERIC_CLOCK_SHIFT))
+			& (~((1 << LTT_GENERIC_CLOCK_SHIFT) - 1));
+	} while (atomic_cmpxchg(&lttng_generic_clock, old_clock, new_clock)
+			!= old_clock);
+}
+
+static inline unsigned int ltt_frequency_generic(void)
+{
+	return HZ << LTT_GENERIC_CLOCK_SHIFT;
+}
+
+static inline u32 ltt_freq_scale_generic(void)
+{
+	return 1;
+}
+#endif /* _ASM_GENERIC_LTT_H */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 09/15] LTTng mips export hpt frequency
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 08/15] LTTng - Timestamping Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 10/15] LTTng timestamp mips Mathieu Desnoyers
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ralf Baechle

[-- Attachment #1: lttng-mips-export-hpt-frequency.patch --]
[-- Type: text/plain, Size: 1346 bytes --]

LTTng needs it from modules.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
---
 arch/mips/kernel/time.c  |    1 +
 include/asm-mips/timex.h |    2 ++
 2 files changed, 3 insertions(+)

Index: linux-2.6-lttng/include/asm-mips/timex.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-mips/timex.h	2008-10-16 18:16:37.000000000 -0400
+++ linux-2.6-lttng/include/asm-mips/timex.h	2008-10-16 18:16:44.000000000 -0400
@@ -63,6 +63,8 @@ static inline cycles_t get_cycles(void)
 }
 #endif
 
+extern unsigned int mips_hpt_frequency;
+
 #endif /* __KERNEL__ */
 
 #endif /*  _ASM_TIMEX_H */
Index: linux-2.6-lttng/arch/mips/kernel/time.c
===================================================================
--- linux-2.6-lttng.orig/arch/mips/kernel/time.c	2008-10-16 18:16:32.000000000 -0400
+++ linux-2.6-lttng/arch/mips/kernel/time.c	2008-10-16 18:16:44.000000000 -0400
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(perf_irq);
  */
 
 unsigned int mips_hpt_frequency;
+EXPORT_SYMBOL(mips_hpt_frequency);
 
 void __init clocksource_set_clock(struct clocksource *cs, unsigned int clock)
 {

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 10/15] LTTng timestamp mips
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 09/15] LTTng mips export hpt frequency Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 11/15] LTTng timestamp powerpc Mathieu Desnoyers
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ralf Baechle

[-- Attachment #1: lttng-timestamp-mips.patch --]
[-- Type: text/plain, Size: 2402 bytes --]

MIPS get_cycles only returns a 32 bits TSC (see timex.h). The assumption there
is that the reschedule is done every 8 seconds or so. Given that tracing needs
to detect delays longer than 8 seconds, we need a full 64-bits TSC, whic is
provided by LTTng syntheric TSC.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ralf Baechle <ralf@linux-mips.org>
---
 arch/mips/Kconfig      |    2 ++
 include/asm-mips/ltt.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

Index: linux-2.6-lttng/include/asm-mips/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-mips/ltt.h	2008-10-16 19:18:48.000000000 -0400
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2005, Mathieu Desnoyers
+ *
+ * MIPS definitions for tracing system
+ */
+
+#ifndef _ASM_MIPS_LTT_H
+#define _ASM_MIPS_LTT_H
+
+#include <linux/timex.h>
+#include <asm/processor.h>
+
+extern u64 ltt_read_synthetic_tsc(void);
+
+/*
+ * MIPS get_cycles only returns a 32 bits TSC (see timex.h). The assumption
+ * there is that the reschedule is done every 8 seconds or so. Given that
+ * tracing needs to detect delays longer than 8 seconds, we need a full 64-bits
+ * TSC, whic is provided by LTTng syntheric TSC.
+*/
+static inline u32 ltt_get_timestamp32(void)
+{
+	return get_cycles();
+}
+
+static inline u64 ltt_get_timestamp64(void)
+{
+	return ltt_read_synthetic_tsc();
+}
+
+static inline void ltt_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int ltt_frequency(void)
+{
+	return mips_hpt_frequency;
+}
+
+static inline u32 ltt_freq_scale(void)
+{
+	return 1;
+}
+
+#endif /* _ASM_MIPS_LTT_H */
Index: linux-2.6-lttng/arch/mips/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/mips/Kconfig	2008-10-16 19:14:08.000000000 -0400
+++ linux-2.6-lttng/arch/mips/Kconfig	2008-10-16 19:16:49.000000000 -0400
@@ -1584,6 +1584,8 @@ config CPU_R4400_WORKAROUNDS
 config HAVE_GET_CYCLES_32
 	def_bool y
 	depends on !CPU_R4400_WORKAROUNDS
+	select HAVE_LTT_CLOCK
+	select HAVE_LTT_SYNTHETIC_TSC
 
 #
 # Use the generic interrupt handling code in kernel/irq/:

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 11/15] LTTng timestamp powerpc
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 10/15] LTTng timestamp mips Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 12/15] LTTng timestamp sparc64 Mathieu Desnoyers
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, benh, paulus

[-- Attachment #1: lttng-timestamp-powerpc.patch --]
[-- Type: text/plain, Size: 1844 bytes --]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: linux-arch@vger.kernel.org
---
 arch/powerpc/Kconfig      |    1 +
 include/asm-powerpc/ltt.h |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-powerpc/ltt.h	2008-10-16 19:18:51.000000000 -0400
@@ -0,0 +1,37 @@
+/*
+ * Copyright (C) 2005, Mathieu Desnoyers
+ *
+ * POWERPC definitions for tracing system
+ */
+
+#ifndef _ASM_POWERPC_LTT_H
+#define _ASM_POWERPC_LTT_H
+
+#include <linux/timex.h>
+#include <asm/time.h>
+#include <asm/processor.h>
+
+static inline u32 ltt_get_timestamp32(void)
+{
+	return get_tbl();
+}
+
+static inline u64 ltt_get_timestamp64(void)
+{
+	return get_tb();
+}
+
+static inline void ltt_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int ltt_frequency(void)
+{
+	return tb_ticks_per_sec;
+}
+
+static inline u32 ltt_freq_scale(void)
+{
+	return 1;
+}
+
+#endif /* _ASM_POWERPC_LTT_H */
Index: linux-2.6-lttng/arch/powerpc/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig	2008-10-16 19:14:07.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/Kconfig	2008-10-16 19:20:22.000000000 -0400
@@ -115,6 +115,7 @@ config PPC
 	select HAVE_IOREMAP_PROT
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_KPROBES
+	select HAVE_LTT_CLOCK
 	select HAVE_ARCH_KGDB
 	select HAVE_KRETPROBES
 	select HAVE_ARCH_TRACEHOOK

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 12/15] LTTng timestamp sparc64
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (10 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 11/15] LTTng timestamp powerpc Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 13/15] LTTng timestamp sh Mathieu Desnoyers
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers

[-- Attachment #1: lttng-timestamp-sparc64.patch --]
[-- Type: text/plain, Size: 1791 bytes --]

Implement LTTng timestamping for sparc64

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: linux-arch@vger.kernel.org
---
 arch/sparc/include/asm/ltt.h |   35 +++++++++++++++++++++++++++++++++++
 arch/sparc64/Kconfig         |    1 +
 2 files changed, 36 insertions(+)

Index: linux-2.6-lttng/arch/sparc/include/asm/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/sparc/include/asm/ltt.h	2008-10-16 19:04:23.000000000 -0400
@@ -0,0 +1,35 @@
+/*
+ * Copyright (C) 2008, Mathieu Desnoyers
+ *
+ * SPARC64 definitions for tracing system
+ */
+
+#ifndef _ASM_SPARC_LTT_H
+#define _ASM_SPARC_LTT_H
+
+#include <linux/timex.h>
+
+static inline u32 ltt_get_timestamp32(void)
+{
+	return get_cycles();
+}
+
+static inline u64 ltt_get_timestamp64(void)
+{
+	return get_cycles();
+}
+
+static inline void ltt_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int ltt_frequency(void)
+{
+	return CLOCK_TICK_RATE;
+}
+
+static inline u32 ltt_freq_scale(void)
+{
+	return 1;
+}
+
+#endif /* _ASM_SPARC_LTT_H */
Index: linux-2.6-lttng/arch/sparc64/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/sparc64/Kconfig	2008-10-16 18:57:15.000000000 -0400
+++ linux-2.6-lttng/arch/sparc64/Kconfig	2008-10-16 19:02:55.000000000 -0400
@@ -19,6 +19,7 @@ config SPARC64
 	select HAVE_ARCH_KGDB
 	select USE_GENERIC_SMP_HELPERS if SMP
 	select HAVE_ARCH_TRACEHOOK
+	select HAVE_LTT_CLOCK
 
 config GENERIC_TIME
 	bool

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 13/15] LTTng timestamp sh
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (11 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 12/15] LTTng timestamp sparc64 Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 14/15] LTTng - TSC synchronicity test Mathieu Desnoyers
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Giuseppe Cavallaro, Mathieu Desnoyers, Paul Mundt,
	linux-sh

[-- Attachment #1: lttng-timestamp-sh.patch --]
[-- Type: text/plain, Size: 3472 bytes --]

This patch adds the timestamping mechanism in the ltt.h arch header file.
The new timestamp functions use the TMU channel 1.

This code only works if the TMU channel 1 is initialized during the kernel boot

Note from Mathieu :
This patch seems to assume TMU channel 1 is setup at boot. Is it always true on
all SuperH boards ? Is there some Kconfig selection that should be done here ?
We could probably extract part of this patch into get_cycles() and remove
CONFIG_LTT dependency.

From: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Paul Mundt <lethal@linux-sh.org>
CC: linux-sh@vger.kernel.org
---
 arch/sh/Kconfig             |    2 +
 arch/sh/include/asm/ltt.h   |   45 ++++++++++++++++++++++++++++++++++++++++++++
 arch/sh/include/asm/timex.h |    8 ++++++-
 3 files changed, 54 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/sh/include/asm/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/sh/include/asm/ltt.h	2008-10-16 18:54:57.000000000 -0400
@@ -0,0 +1,45 @@
+/*
+ * Copyright (C) 2007, Giuseppe Cavallaro <peppe.cavallaro@st.com>
+ *                     Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * SuperH definitions for tracing system
+ */
+
+#ifndef _ASM_SH_LTT_H
+#define _ASM_SH_LTT_H
+
+#include <linux/timer.h>
+#include <asm/clock.h>
+
+extern u64 ltt_read_synthetic_tsc(void);
+
+static inline u32 ltt_get_timestamp32(void)
+{
+	return get_cycles();
+}
+
+static inline u64 ltt_get_timestamp64(void)
+{
+	return ltt_read_synthetic_tsc();
+}
+
+static inline void ltt_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int ltt_frequency(void)
+{
+	unsigned long rate;
+	struct clk *tmu1_clk;
+
+	tmu1_clk = clk_get(NULL, "tmu1_clk");
+	rate = (clk_get_rate(tmu1_clk));
+
+	return (unsigned int)(rate);
+}
+
+static inline u32 ltt_freq_scale(void)
+{
+	return 1;
+}
+
+#endif /* _ASM_SH_LTT_H */
Index: linux-2.6-lttng/arch/sh/include/asm/timex.h
===================================================================
--- linux-2.6-lttng.orig/arch/sh/include/asm/timex.h	2008-10-16 18:53:59.000000000 -0400
+++ linux-2.6-lttng/arch/sh/include/asm/timex.h	2008-10-16 18:54:51.000000000 -0400
@@ -6,12 +6,18 @@
 #ifndef __ASM_SH_TIMEX_H
 #define __ASM_SH_TIMEX_H
 
-#define CLOCK_TICK_RATE		(CONFIG_SH_PCLK_FREQ / 4) /* Underlying HZ */
+#include <linux/io.h>
+#include <asm/cpu/timer.h>
+
+#define CLOCK_TICK_RATE		(HZ * 100000UL)
 
 typedef unsigned long long cycles_t;
 
 static __inline__ cycles_t get_cycles (void)
 {
+#ifdef CONFIG_LTT
+	return 0xffffffff - ctrl_inl(TMU1_TCNT);
+#endif
 	return 0;
 }
 
Index: linux-2.6-lttng/arch/sh/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/sh/Kconfig	2008-10-16 18:53:59.000000000 -0400
+++ linux-2.6-lttng/arch/sh/Kconfig	2008-10-16 18:54:51.000000000 -0400
@@ -11,6 +11,8 @@ config SUPERH
 	select HAVE_CLK
 	select HAVE_IDE
 	select HAVE_OPROFILE
+	select HAVE_LTT_CLOCK
+	select HAVE_LTT_SYNTHETIC_TSC
 	select HAVE_GENERIC_DMA_COHERENT
 	help
 	  The SuperH is a RISC processor targeted for use in embedded systems

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 14/15] LTTng - TSC synchronicity test
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (12 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 13/15] LTTng timestamp sh Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-16 23:27 ` [RFC patch 15/15] LTTng timestamp x86 Mathieu Desnoyers
  2008-10-17  7:59 ` [RFC patch 00/15] Tracer Timestamping Peter Zijlstra
  15 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ingo Molnar, Jan Kiszka

[-- Attachment #1: lttng-test-tsc.patch --]
[-- Type: text/plain, Size: 6474 bytes --]

Test TSC synchronization across CPUs. Architecture independant and can therefore
be used on various architectures. Aims at testing the TSC synchronization on a
running system (not only at early boot), with minimal impact on interrupt
latency.

I've written this code before x86 tsc_sync.c existed and given it worked well
for my needs, I never switched to tsc_sync.c. Although it has the same goal, it
does it a bit differently :

tsc_sync looks at the cycle counters on two CPUs to see if one compared to the
other are going backward when read in loop. The LTTng code synchronizes both
cores with a counter used as a memory barrier and then reads the two TSCs at a
delta equal to the cache line exchange. Instruction and data caches are primed.
This test is repeated in loops to insure we deal with MCE, NMIs which could skew
the results.

The problem I see with tsc_sync.c is that is one of the two CPUs is delayed by
an interrupt handler (for way too long) while the other CPU is doing its
check_tsc_warp() execution, and if the CPU with the lowest TSC values runs
first, this code will fail to detect unsynchronized CPUs.

LTTng TSC sync test code does not have this problem.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Ingo Molnar <mingo@redhat.com>
CC: Jan Kiszka <jan.kiszka@siemens.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
---
 ltt/Kconfig        |    3 +
 ltt/Makefile       |    1 
 ltt/ltt-test-tsc.c |  132 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 136 insertions(+)

Index: linux-2.6-lttng/ltt/ltt-test-tsc.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/ltt/ltt-test-tsc.c	2008-10-16 18:53:07.000000000 -0400
@@ -0,0 +1,132 @@
+/*
+ * ltt-test-tsc.c
+ *
+ * Test TSC synchronization
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/timex.h>
+#include <linux/jiffies.h>
+#include <linux/cpu.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+
+#define MAX_CYCLES_DELTA 1000ULL
+
+static DEFINE_PER_CPU(cycles_t, tsc_count);
+static DEFINE_MUTEX(tscsync_mutex);
+
+static DEFINE_PER_CPU(int, wait_sync);
+static DEFINE_PER_CPU(int, wait_end_sync);
+
+int ltt_tsc_is_sync = 1;
+EXPORT_SYMBOL(ltt_tsc_is_sync);
+
+cycles_t ltt_last_tsc;
+EXPORT_SYMBOL(ltt_last_tsc);
+
+/*
+ * Mark it noinline so we make sure it is not unrolled.
+ * Wait until value is reached.
+ */
+static noinline void tsc_barrier(long wait_cpu, int value)
+{
+	sync_core();
+	per_cpu(wait_sync, smp_processor_id())--;
+	do {
+		smp_mb();
+	} while (unlikely(per_cpu(wait_sync, wait_cpu) > value));
+	rdtsc_barrier();
+	__get_cpu_var(tsc_count) = get_cycles();
+	rdtsc_barrier();
+}
+
+/*
+ * Worker thread called on each CPU.
+ * First wait with interrupts enabled, then wait with interrupt disabled,
+ * for precision. We are already bound to one CPU.
+ */
+static void test_sync(void *arg)
+{
+	long wait_cpu = (long)arg;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	/* Make sure the instructions are in I-CACHE */
+	tsc_barrier(wait_cpu, 1);
+	tsc_barrier(wait_cpu, 0);
+	per_cpu(wait_end_sync, smp_processor_id())--;
+	do {
+		smp_mb();
+	} while (unlikely(per_cpu(wait_end_sync, wait_cpu) > 1));
+	per_cpu(wait_end_sync, smp_processor_id())--;
+	local_irq_restore(flags);
+}
+
+/*
+ * Do loops (making sure no unexpected event changes the timing), keep the
+ * best one. The result of each loop is the highest tsc delta between the
+ * master CPU and the slaves.
+ */
+static int test_tsc_synchronization(void)
+{
+	long cpu, master;
+	cycles_t max_diff = 0, diff, best_loop, worse_loop = 0;
+	int i;
+
+	mutex_lock(&tscsync_mutex);
+	preempt_disable();
+	master = smp_processor_id();
+	for_each_online_cpu(cpu) {
+		if (master == cpu)
+			continue;
+		best_loop = ULLONG_MAX;
+		for (i = 0; i < 10; i++) {
+			/*
+			 * Each CPU (master and slave) must decrement the
+			 * wait_sync value twice (one for priming in cache).
+			 */
+			per_cpu(wait_sync, master) = 2;
+			per_cpu(wait_sync, cpu) = 2;
+			per_cpu(wait_end_sync, master) = 2;
+			per_cpu(wait_end_sync, cpu) = 2;
+			smp_call_function_single(cpu, test_sync,
+						(void *)master, 0);
+			test_sync((void *)cpu);
+			/*
+			 * Wait until slave is done so that we don't overwrite
+			 * wait_end_sync prematurely.
+			 */
+			while (unlikely(per_cpu(wait_end_sync, cpu) > 0))
+				cpu_relax();
+
+			diff = abs(per_cpu(tsc_count, cpu)
+				- per_cpu(tsc_count, master));
+			best_loop = min(best_loop, diff);
+			worse_loop = max(worse_loop, diff);
+		}
+		max_diff = max(best_loop, max_diff);
+	}
+	preempt_enable();
+	if (max_diff >= MAX_CYCLES_DELTA) {
+		printk(KERN_WARNING
+			"LTTng : Your timestamp counter is not reliable.\n"
+			"See LTTng documentation to find the "
+			"appropriate solution for your architecture.\n");
+		printk("TSC unsynchronized : %llu cycles delta is over "
+			"threshold %llu\n", max_diff, MAX_CYCLES_DELTA);
+	}
+	mutex_unlock(&tscsync_mutex);
+	return max_diff < MAX_CYCLES_DELTA;
+}
+EXPORT_SYMBOL_GPL(test_tsc_synchronization);
+
+static int __init tsc_test_init(void)
+{
+	ltt_tsc_is_sync = test_tsc_synchronization();
+	return 0;
+}
+
+__initcall(tsc_test_init);
Index: linux-2.6-lttng/ltt/Makefile
===================================================================
--- linux-2.6-lttng.orig/ltt/Makefile	2008-10-16 18:50:32.000000000 -0400
+++ linux-2.6-lttng/ltt/Makefile	2008-10-16 18:52:39.000000000 -0400
@@ -1 +1,2 @@
 obj-$(CONFIG_HAVE_LTT_SYNTHETIC_TSC)	+= ltt-timestamp.o
+obj-$(CONFIG_HAVE_LTT_UNSTABLE_TSC)	+= ltt-test-tsc.o
Index: linux-2.6-lttng/ltt/Kconfig
===================================================================
--- linux-2.6-lttng.orig/ltt/Kconfig	2008-10-16 18:50:32.000000000 -0400
+++ linux-2.6-lttng/ltt/Kconfig	2008-10-16 18:52:39.000000000 -0400
@@ -6,6 +6,9 @@ config LTT_TIMESTAMP
 	help
 	  Allow fine-grained timestamps to be taken from tracing applications.
 
+config HAVE_LTT_UNSTABLE_TSC
+	def_bool n
+
 config HAVE_LTT_CLOCK
 	def_bool n
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC patch 15/15] LTTng timestamp x86
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (13 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 14/15] LTTng - TSC synchronicity test Mathieu Desnoyers
@ 2008-10-16 23:27 ` Mathieu Desnoyers
  2008-10-17  0:08   ` Linus Torvalds
  2008-10-17  7:59 ` [RFC patch 00/15] Tracer Timestamping Peter Zijlstra
  15 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-16 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Mathieu Desnoyers, Ingo Molnar, H. Peter Anvin

[-- Attachment #1: lttng-timestamp-x86.patch --]
[-- Type: text/plain, Size: 4543 bytes --]

X86 LTTng timestamping. Depends on ltt-test-tsc module to detect if timestamp
counters are synchronized on the machine.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/Kconfig      |    2 
 include/asm-x86/ltt.h |  123 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

Index: linux-2.6-lttng/include/asm-x86/ltt.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-x86/ltt.h	2008-10-16 18:53:27.000000000 -0400
@@ -0,0 +1,123 @@
+#ifndef _ASM_X86_LTT_H
+#define _ASM_X86_LTT_H
+/*
+ * linux/include/asm-x86/ltt.h
+ *
+ * Copyright (C) 2005,2006 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * x86 time and TSC definitions for ltt
+ */
+
+#include <linux/timex.h>
+#include <asm/system.h>
+#include <asm/processor.h>
+#include <asm/atomic.h>
+
+/* Minimum duration of a probe, in cycles */
+#define LTT_MIN_PROBE_DURATION 200
+
+#ifdef CONFIG_HAVE_LTT_SYNTHETIC_TSC
+/* Only for testing. Never needed on x86. */
+u64 ltt_read_synthetic_tsc(void);
+#endif
+
+#ifdef CONFIG_HAVE_LTT_UNSTABLE_TSC
+extern cycles_t ltt_last_tsc;
+extern int ltt_tsc_is_sync;
+
+/*
+ * Support for architectures with non-sync TSCs.
+ * When the local TSC is discovered to lag behind the highest TSC counter, we
+ * increment the TSC count of an amount that should be, ideally, lower than the
+ * execution time of this routine, in cycles : this is the granularity we look
+ * for : we must be able to order the events.
+ */
+static inline cycles_t ltt_async_tsc_read(void)
+{
+	cycles_t new_tsc;
+	cycles_t last_tsc;
+
+	rdtsc_barrier();
+	new_tsc = get_cycles();
+	rdtsc_barrier();
+	do {
+		last_tsc = ltt_last_tsc;
+		if (new_tsc < last_tsc)
+			new_tsc = last_tsc + LTT_MIN_PROBE_DURATION;
+		/*
+		 * If cmpxchg fails with a value higher than the new_tsc, don't
+		 * retry : the value has been incremented and the events
+		 * happened almost at the same time.
+		 * We must retry if cmpxchg fails with a lower value :
+		 * it means that we are the CPU with highest frequency and
+		 * therefore MUST update the value.
+		 */
+	} while (cmpxchg64(&ltt_last_tsc, last_tsc, new_tsc) < new_tsc);
+	return new_tsc;
+}
+
+static inline u32 ltt_get_timestamp32(void)
+{
+	u32 cycles;
+
+	if (ltt_tsc_is_sync) {
+		rdtsc_barrier();
+		cycles = (u32)get_cycles(); /* only need the 32 LSB */
+		rdtsc_barrier();
+	} else
+		cycles = (u32)ltt_async_tsc_read();
+	return cycles;
+}
+
+static inline u64 ltt_get_timestamp64(void)
+{
+	u64 cycles;
+
+	if (ltt_tsc_is_sync) {
+		rdtsc_barrier();
+		cycles = get_cycles();
+		rdtsc_barrier();
+	} else
+		cycles = ltt_async_tsc_read();
+	return cycles;
+}
+#else /* CONFIG_HAVE_LTT_UNSTABLE_TSC */
+static inline u32 ltt_get_timestamp32(void)
+{
+	u32 cycles;
+
+	rdtsc_barrier();
+	cycles = (u32)get_cycles(); /* only need the 32 LSB */
+	rdtsc_barrier();
+	return cycles;
+}
+
+static inline u64 ltt_get_timestamp64(void)
+{
+	u64 cycles;
+
+	rdtsc_barrier();
+	cycles = get_cycles();
+	rdtsc_barrier();
+	return cycles;
+}
+#endif /* CONFIG_HAVE_LTT_UNSTABLE_TSC */
+
+/*
+ * Periodic IPI to have an upper bound on TSC inaccuracy.
+ * TODO: should implement this in ltt-test-tsc.ko.
+ */
+static inline void ltt_add_timestamp(unsigned long ticks)
+{ }
+
+static inline unsigned int ltt_frequency(void)
+{
+	return cpu_khz;
+}
+
+static inline u32 ltt_freq_scale(void)
+{
+	return 1000;
+}
+
+#endif /* _ASM_X86_LTT_H */
Index: linux-2.6-lttng/arch/x86/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig	2008-10-16 18:50:29.000000000 -0400
+++ linux-2.6-lttng/arch/x86/Kconfig	2008-10-16 18:53:09.000000000 -0400
@@ -26,6 +26,8 @@ config X86
 	select HAVE_KPROBES
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select HAVE_KRETPROBES
+	select HAVE_LTT_CLOCK
+	select HAVE_LTT_UNSTABLE_TSC
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE
 	select HAVE_KVM if ((X86_32 && !X86_VOYAGER && !X86_VISWS && !X86_NUMAQ) || X86_64)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-16 23:27 ` [RFC patch 15/15] LTTng timestamp x86 Mathieu Desnoyers
@ 2008-10-17  0:08   ` Linus Torvalds
  2008-10-17  0:12     ` Linus Torvalds
  2008-10-17  1:28     ` Mathieu Desnoyers
  0 siblings, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2008-10-17  0:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Ingo Molnar, linux-kernel, linux-arch,
	Steven Rostedt, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar, H. Peter Anvin



On Thu, 16 Oct 2008, Mathieu Desnoyers wrote:
>
> +static inline cycles_t ltt_async_tsc_read(void)

(a) this shouldn't be inline

> +	rdtsc_barrier();
> +	new_tsc = get_cycles();
> +	rdtsc_barrier();
> +	do {
> +		last_tsc = ltt_last_tsc;
> +		if (new_tsc < last_tsc)
> +			new_tsc = last_tsc + LTT_MIN_PROBE_DURATION;
> +		/*
> +		 * If cmpxchg fails with a value higher than the new_tsc, don't
> +		 * retry : the value has been incremented and the events
> +		 * happened almost at the same time.
> +		 * We must retry if cmpxchg fails with a lower value :
> +		 * it means that we are the CPU with highest frequency and
> +		 * therefore MUST update the value.
> +		 */
> +	} while (cmpxchg64(&ltt_last_tsc, last_tsc, new_tsc) < new_tsc);

(b) This is really quite expensive.

Why do things like this? Make the timestamps be per-cpu. If you do things 
like the above, then just getting the timestamp means that every single 
trace event will cause a cacheline bounce, and if you do that, you might 
as well just not have per-cpu tracing at all.

It really boils down to two cases:

 - you do per-CPU traces

   If so, you need to ONLY EVER touch per-cpu data when tracing, and the 
   above is a fundamental BUG. Dirtying shared cachelines makes the whole 
   per-cpu thing pointless.

 - you do global traces

   Sure, then the above works, but why bother? You'll get the ordering 
   from the global trace, you might as well do time stamps with local 
   counts.

So in neither case does it make any sense to try to do that global 
ltt_last_tsc.

Perhaps more importantly - if the TSC really are out of whack, that just 
means that now all your timestamps are worthless, because the value you 
calculate ends up having NOTHING to do with the timestamp. So you cannot 
even use it to see how long something took, because it may be that you're 
running on the CPU that runs behind, and all you ever see is the value of 
LTT_MIN_PROBE_DURATION.

			Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17  0:08   ` Linus Torvalds
@ 2008-10-17  0:12     ` Linus Torvalds
  2008-10-17  1:28     ` Mathieu Desnoyers
  1 sibling, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2008-10-17  0:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Ingo Molnar, linux-kernel, linux-arch,
	Steven Rostedt, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar, H. Peter Anvin



On Thu, 16 Oct 2008, Linus Torvalds wrote:
> 
> Perhaps more importantly - if the TSC really are out of whack, that just 
> means that now all your timestamps are worthless, because the value you 
> calculate ends up having NOTHING to do with the timestamp. So you cannot 
> even use it to see how long something took, because it may be that you're 
> running on the CPU that runs behind, and all you ever see is the value of 
> LTT_MIN_PROBE_DURATION.

If it isn't clear: the alternative is to just always use local timestamps.

At least that way the timestamps mean _something_. You can get the 
difference between two events when they happen on the same CPU, and it is 
about as meaningful as it can be.

Don't even _try_ to make a global clock.

Yes, to be able to compare across CPU's you'd need to have extra 
synchronization information (eg offset and frequency things), but quite 
frankly, the "global TSC" thing is already worse than even a totally 
non-synchronized TSC for the above reasons. 

			Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES
  2008-10-16 23:27 ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-10-17  0:26   ` Paul Mackerras
  2008-10-17  0:43     ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Paul Mackerras @ 2008-10-17  0:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, benh, Ingo Molnar

Mathieu Desnoyers writes:

> This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
> get_cycles_rate() are implemented.

[snip]

> +static inline cycles_t get_cycles_rate(void)
> +{
> +	return CLOCK_TICK_RATE;
> +}

CLOCK_TICK_RATE is certainly wrong.  You want ppc_tb_freq (declared in
asm/time.h).  Or tb_ticks_per_sec, since we seem to have two variables
for exactly the same thing, for some reason. :)

Paul.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update)
  2008-10-17  0:26   ` Paul Mackerras
@ 2008-10-17  0:43     ` Mathieu Desnoyers
  2008-10-17  0:54       ` Paul Mackerras
  2008-10-17  1:42       ` David Miller
  0 siblings, 2 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17  0:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, benh, Ingo Molnar

* Paul Mackerras (paulus@samba.org) wrote:
> Mathieu Desnoyers writes:
> 
> > This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
> > get_cycles_rate() are implemented.
> 
> [snip]
> 
> > +static inline cycles_t get_cycles_rate(void)
> > +{
> > +	return CLOCK_TICK_RATE;
> > +}
> 
> CLOCK_TICK_RATE is certainly wrong.  You want ppc_tb_freq (declared in
> asm/time.h).  Or tb_ticks_per_sec, since we seem to have two variables
> for exactly the same thing, for some reason. :)
> 
> Paul.

Ok, this should work better. Thanks !

Do you know if mtfb implies an instruction synchronization (isync) ? I
think that if it does not, the new get_cycles_barrier() might have to be
used at some locations in the kernel code if more precise timestamp
order is required.

Mathieu


get_cycles() : powerpc64 HAVE_GET_CYCLES

This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: benh@kernel.crashing.org
CC: paulus@samba.org
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 arch/powerpc/Kconfig             |    1 +
 arch/powerpc/include/asm/timex.h |   14 ++++++++++++++
 2 files changed, 15 insertions(+)

Index: linux-2.6-lttng/arch/powerpc/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig	2008-10-16 20:31:33.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/Kconfig	2008-10-16 20:31:36.000000000 -0400
@@ -122,6 +122,7 @@ config PPC
 	select HAVE_DMA_ATTRS if PPC64
 	select USE_GENERIC_SMP_HELPERS if SMP
 	select HAVE_OPROFILE
+	select HAVE_GET_CYCLES if PPC64
 
 config EARLY_PRINTK
 	bool
Index: linux-2.6-lttng/arch/powerpc/include/asm/timex.h
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/include/asm/timex.h	2008-10-16 20:31:33.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/include/asm/timex.h	2008-10-16 20:32:02.000000000 -0400
@@ -9,6 +9,7 @@
 
 #include <asm/cputable.h>
 #include <asm/reg.h>
+#include <asm/time.h>
 
 #define CLOCK_TICK_RATE	1024000 /* Underlying HZ */
 
@@ -46,5 +47,18 @@ static inline cycles_t get_cycles(void)
 #endif
 }
 
+static inline cycles_t get_cycles_rate(void)
+{
+	return tb_ticks_per_sec;
+}
+
+/*
+ * To check : assuming mtfb requires isync to synchronize instruction execution.
+ */
+static inline void get_cycles_barrier(void)
+{
+	isync();
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _ASM_POWERPC_TIMEX_H */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update)
  2008-10-17  0:43     ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
@ 2008-10-17  0:54       ` Paul Mackerras
  2008-10-17  1:42       ` David Miller
  1 sibling, 0 replies; 66+ messages in thread
From: Paul Mackerras @ 2008-10-17  0:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, benh, Ingo Molnar

Mathieu Desnoyers writes:

> Do you know if mtfb implies an instruction synchronization (isync) ? I

It doesn't.

> think that if it does not, the new get_cycles_barrier() might have to be
> used at some locations in the kernel code if more precise timestamp
> order is required.

OK.  I'll let you figure out where. :)

Paul.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17  0:08   ` Linus Torvalds
  2008-10-17  0:12     ` Linus Torvalds
@ 2008-10-17  1:28     ` Mathieu Desnoyers
  2008-10-17  2:19       ` Luck, Tony
  1 sibling, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17  1:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Ingo Molnar, linux-kernel, linux-arch,
	Steven Rostedt, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar, H. Peter Anvin

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Thu, 16 Oct 2008, Mathieu Desnoyers wrote:
> >
> > +static inline cycles_t ltt_async_tsc_read(void)
> 
> (a) this shouldn't be inline
> 

Ok, will fix. I will put this in a new arch/x86/kernel/ltt.c.

> > +	rdtsc_barrier();
> > +	new_tsc = get_cycles();
> > +	rdtsc_barrier();
> > +	do {
> > +		last_tsc = ltt_last_tsc;
> > +		if (new_tsc < last_tsc)
> > +			new_tsc = last_tsc + LTT_MIN_PROBE_DURATION;
> > +		/*
> > +		 * If cmpxchg fails with a value higher than the new_tsc, don't
> > +		 * retry : the value has been incremented and the events
> > +		 * happened almost at the same time.
> > +		 * We must retry if cmpxchg fails with a lower value :
> > +		 * it means that we are the CPU with highest frequency and
> > +		 * therefore MUST update the value.
> > +		 */
> > +	} while (cmpxchg64(&ltt_last_tsc, last_tsc, new_tsc) < new_tsc);
> 
> (b) This is really quite expensive.
> 

Ok, let's try to figure out what the use-cases are, because we are
really facing an architectural mess (thanks to Intel and AMD). I don't
think there is a single perfect solution for all, but I'll try to
explain why I accept the cache-line bouncing behavior when
unsynchronized TSCs are detected by LTTng.

First, the most important thing in LTTng is to provide the event flow
in the correct order across CPUs. Secondary to that, getting the precise
execution time is a nice-to-have when the architecture supports it, but
the time granularity itself is not crucially important, as long as we
have a way to determine which of two events close in time happens first.
The principal use-case where I have seen such tracer in action is when
one have to understand why one or more processes are slower than
expected. The root cause can easily sit on another CPU, be a locking
delay in a particular race condition, or just a process waiting for
other processes waiting for a timeout.


> Why do things like this? Make the timestamps be per-cpu. If you do things 
> like the above, then just getting the timestamp means that every single 
> trace event will cause a cacheline bounce, and if you do that, you might 
> as well just not have per-cpu tracing at all.
> 

This cache-line bouncing global clock is a best-effort to provide
correct event order in the trace on architectures with unsync tsc. It's
actually better than a global tracing buffer because it limits the
number of cache line transfers required to one per event. Global tracing
buffers may require to transfer many cache lines across CPUs when events
are written across cache lines or larger than a cache line.

> It really boils down to two cases:
> 
>  - you do per-CPU traces
> 
>    If so, you need to ONLY EVER touch per-cpu data when tracing, and the 
>    above is a fundamental BUG. Dirtying shared cachelines makes the whole 
>    per-cpu thing pointless.

Sharing only a single cache-line is not completely pointless, as
explained above, but yes, there is a big performance hit involved.

I agree that we should maybe add a degree of flexibility in this time
infrastructure to let users select the type of time source they want :

- Global clock, potentially slow on unsynchronized CPUs.
- Local clock, fast, possibility unsynchronized across CPUs.

> 
>  - you do global traces
> 
>    Sure, then the above works, but why bother? You'll get the ordering 
>    from the global trace, you might as well do time stamps with local 
>    counts.
> 

I simply don't like the global traces because of the extra cache-line
bouncing experienced by events written on multiple cache-lines.

> So in neither case does it make any sense to try to do that global 
> ltt_last_tsc.
> 
> Perhaps more importantly - if the TSC really are out of whack, that just 
> means that now all your timestamps are worthless, because the value you 
> calculate ends up having NOTHING to do with the timestamp. So you cannot 
> even use it to see how long something took, because it may be that you're 
> running on the CPU that runs behind, and all you ever see is the value of 
> LTT_MIN_PROBE_DURATION.
> 

I thought about this one. There is actually a FIXME in the code which
plans to add an IPI called at each timer interrupt to do a "read tsc" on
each CPU. This would give an HZ upper bound to the time precision, which
would give a trace with events ordered across CPUs and manage to have
the execution time at a HZ precision.

So given that global buffers are less efficient that just synchronizing
a single cache-line and that some people are willing to pay the price to
get events synchronized across CPUs and others are not, what do you
think of leaving the choice to the user about globally/locally
synchronized timestamps ?

Thanks for the feedback,

Mathieu

> 			Linus

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update)
  2008-10-17  0:43     ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
  2008-10-17  0:54       ` Paul Mackerras
@ 2008-10-17  1:42       ` David Miller
  2008-10-17  2:08         ` Mathieu Desnoyers
  1 sibling, 1 reply; 66+ messages in thread
From: David Miller @ 2008-10-17  1:42 UTC (permalink / raw)
  To: mathieu.desnoyers
  Cc: paulus, torvalds, akpm, mingo, linux-kernel, linux-arch, rostedt,
	a.p.zijlstra, tglx, benh, mingo

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu, 16 Oct 2008 20:43:28 -0400

> * Paul Mackerras (paulus@samba.org) wrote:
> > Mathieu Desnoyers writes:
> > 
> > > This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
> > > get_cycles_rate() are implemented.
> > 
> > [snip]
> > 
> > > +static inline cycles_t get_cycles_rate(void)
> > > +{
> > > +	return CLOCK_TICK_RATE;
> > > +}
> > 
> > CLOCK_TICK_RATE is certainly wrong.  You want ppc_tb_freq (declared in
> > asm/time.h).  Or tb_ticks_per_sec, since we seem to have two variables
> > for exactly the same thing, for some reason. :)
> > 
> > Paul.
> 
> Ok, this should work better. Thanks !
> 
> Do you know if mtfb implies an instruction synchronization (isync) ? I
> think that if it does not, the new get_cycles_barrier() might have to be
> used at some locations in the kernel code if more precise timestamp
> order is required.

You'll need to make a similar fix on sparc64.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update)
  2008-10-17  1:42       ` David Miller
@ 2008-10-17  2:08         ` Mathieu Desnoyers
  2008-10-17  2:33           ` David Miller
  0 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17  2:08 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, torvalds, akpm, mingo, linux-kernel, linux-arch, rostedt,
	a.p.zijlstra, tglx, benh, mingo

* David Miller (davem@davemloft.net) wrote:
> From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Date: Thu, 16 Oct 2008 20:43:28 -0400
> 
> > * Paul Mackerras (paulus@samba.org) wrote:
> > > Mathieu Desnoyers writes:
> > > 
> > > > This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
> > > > get_cycles_rate() are implemented.
> > > 
> > > [snip]
> > > 
> > > > +static inline cycles_t get_cycles_rate(void)
> > > > +{
> > > > +	return CLOCK_TICK_RATE;
> > > > +}
> > > 
> > > CLOCK_TICK_RATE is certainly wrong.  You want ppc_tb_freq (declared in
> > > asm/time.h).  Or tb_ticks_per_sec, since we seem to have two variables
> > > for exactly the same thing, for some reason. :)
> > > 
> > > Paul.
> > 
> > Ok, this should work better. Thanks !
> > 
> > Do you know if mtfb implies an instruction synchronization (isync) ? I
> > think that if it does not, the new get_cycles_barrier() might have to be
> > used at some locations in the kernel code if more precise timestamp
> > order is required.
> 
> You'll need to make a similar fix on sparc64.

I guess you are talking about using sparc64_get_clock_tick rather than
CLOCK_TICK_RATE ? I assume sparc64_get_clock_tick() done on any CPU will
return the same rate ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17  1:28     ` Mathieu Desnoyers
@ 2008-10-17  2:19       ` Luck, Tony
  2008-10-17 17:25         ` Steven Rostedt
  2008-10-17 19:36         ` Christoph Lameter
  0 siblings, 2 replies; 66+ messages in thread
From: Luck, Tony @ 2008-10-17  2:19 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds
  Cc: Andrew Morton, Ingo Molnar, linux-kernel, linux-arch,
	Steven Rostedt, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar, H. Peter Anvin

> This cache-line bouncing global clock is a best-effort to provide
> correct event order in the trace on architectures with unsync tsc. It's
> actually better than a global tracing buffer because it limits the
> number of cache line transfers required to one per event.

Even one line bouncing between cpus can be a performamce disaster.
You'll probably hit a serious wall somewhere between 8 and 16
cpus (ia64 has code that looks a lot like this in the gettimeofday()
path because it does not synchronize cpu cycle counters ... some
applications that are overly fond of timestamping internal
events using gettimeofday() end up spending significant time
doing so on large systems ... even with only a few thousands
of calls per second).

-Tony

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update)
  2008-10-17  2:08         ` Mathieu Desnoyers
@ 2008-10-17  2:33           ` David Miller
  0 siblings, 0 replies; 66+ messages in thread
From: David Miller @ 2008-10-17  2:33 UTC (permalink / raw)
  To: mathieu.desnoyers
  Cc: paulus, torvalds, akpm, mingo, linux-kernel, linux-arch, rostedt,
	a.p.zijlstra, tglx, benh, mingo

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu, 16 Oct 2008 22:08:13 -0400

> * David Miller (davem@davemloft.net) wrote:
> > Date: Thu, 16 Oct 2008 20:43:28 -0400
> > 
> > You'll need to make a similar fix on sparc64.
> 
> I guess you are talking about using sparc64_get_clock_tick rather than
> CLOCK_TICK_RATE ? I assume sparc64_get_clock_tick() done on any CPU will
> return the same rate ?

You'll need to use tb_ticks_per_usec or similar.

The ->get_tick() thing you are using uses a TICK source which is
synchronized across the entire system and advances at a non-changing
rate.

sparc64_get_clock_tick() returns something different.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES (update)
  2008-10-16 23:27 ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
@ 2008-10-17  2:48   ` Mathieu Desnoyers
  2008-10-17  2:57     ` David Miller
  0 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17  2:48 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner
  Cc: David Miller, Ingo Molnar

This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
get_cycles_rate() are implemented.

Changelog :
- Use tb_ticks_per_usec * 1000000 in get_cycles_rate().

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: David Miller <davem@davemloft.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: linux-arch@vger.kernel.org
---
 arch/sparc/include/asm/timex_64.h |   19 ++++++++++++++++++-
 arch/sparc64/Kconfig              |    1 +
 arch/sparc64/kernel/time.c        |    3 ++-
 3 files changed, 21 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/arch/sparc64/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/sparc64/Kconfig	2008-10-16 22:29:43.000000000 -0400
+++ linux-2.6-lttng/arch/sparc64/Kconfig	2008-10-16 22:29:53.000000000 -0400
@@ -14,6 +14,7 @@ config SPARC64
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE
 	select HAVE_IDE
+	select HAVE_GET_CYCLES
 	select HAVE_LMB
 	select HAVE_ARCH_KGDB
 	select USE_GENERIC_SMP_HELPERS if SMP
Index: linux-2.6-lttng/arch/sparc/include/asm/timex_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/sparc/include/asm/timex_64.h	2008-10-16 22:29:43.000000000 -0400
+++ linux-2.6-lttng/arch/sparc/include/asm/timex_64.h	2008-10-16 22:33:48.000000000 -0400
@@ -12,7 +12,24 @@
 
 /* Getting on the cycle counter on sparc64. */
 typedef unsigned long cycles_t;
-#define get_cycles()	tick_ops->get_tick()
+
+static inline cycles_t get_cycles(void)
+{
+	return tick_ops->get_tick();
+}
+
+/* get_cycles instruction is synchronized on sparc64 */
+static inline void get_cycles_barrier(void)
+{
+	return;
+}
+
+extern unsigned long tb_ticks_per_usec;
+
+static inline cycles_t get_cycles_rate(void)
+{
+	return tb_ticks_per_usec * 1000000UL;
+}
 
 #define ARCH_HAS_READ_CURRENT_TIMER
 
Index: linux-2.6-lttng/arch/sparc64/kernel/time.c
===================================================================
--- linux-2.6-lttng.orig/arch/sparc64/kernel/time.c	2008-10-16 22:30:36.000000000 -0400
+++ linux-2.6-lttng/arch/sparc64/kernel/time.c	2008-10-16 22:30:58.000000000 -0400
@@ -1008,7 +1008,8 @@ static void __init setup_clockevent_mult
 	sparc64_clockevent.mult = mult;
 }
 
-static unsigned long tb_ticks_per_usec __read_mostly;
+unsigned long tb_ticks_per_usec __read_mostly;
+EXPORT_SYMBOL_GPL(tb_ticks_per_usec);
 
 void __delay(unsigned long loops)
 {

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES (update)
  2008-10-17  2:48   ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
@ 2008-10-17  2:57     ` David Miller
  0 siblings, 0 replies; 66+ messages in thread
From: David Miller @ 2008-10-17  2:57 UTC (permalink / raw)
  To: mathieu.desnoyers
  Cc: torvalds, akpm, mingo, linux-kernel, linux-arch, rostedt,
	a.p.zijlstra, tglx, mingo

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Thu, 16 Oct 2008 22:48:24 -0400

> This patch selects HAVE_GET_CYCLES and makes sure get_cycles_barrier() and
> get_cycles_rate() are implemented.
> 
> Changelog :
> - Use tb_ticks_per_usec * 1000000 in get_cycles_rate().
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

Yep, that'll work:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 00/15] Tracer Timestamping
  2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
                   ` (14 preceding siblings ...)
  2008-10-16 23:27 ` [RFC patch 15/15] LTTng timestamp x86 Mathieu Desnoyers
@ 2008-10-17  7:59 ` Peter Zijlstra
  2008-10-20 20:25   ` Mathieu Desnoyers
  15 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2008-10-17  7:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Thomas Gleixner, David Miller

On Thu, 2008-10-16 at 19:27 -0400, Mathieu Desnoyers wrote:
> Hi,
> 
> Starting with the bottom of my LTTng patchset
> (git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git)
> I post as RFC the timestamping infrastructure I have been using for a while in
> the tracer. It integrates get_cycles() standardization following David Miller's
> comments I did more recently.
> 
> It also deals with 32 -> 64 bits timestamp counter extension with a RCU-style
> algorithm, which is especially useful on MIPS and SuperH architectures.

Have you looked at the existing 32->63 extention code in
include/linux/cnt32_to_63.h and considered unifying it?

> There is also a TSC synchronization test within this patchset to detect
> unsynchronized TSCs. 

We already have such code, no? Does this code replace that one or just
add a second test?

> See comments in this specific patch to figure out the
> difference between the current x86 tsc_sync.c and the one I propose in this
> patch.

Right so you don't unify, that's a missed opportunity, no?

> It also provides an architecture-agnostic fallback in case there is no
> timestamp counter available : basically, it's
> (jiffies << 13) | atomically_incremented_counter (if there are more than 8192
> events per jiffy, time will still be monotonic, but will increment faster than
> the actual system frequency).
> 
> Comments are welcome. Note that this is only the beginning of the patchset. I
> plan to submit the event ID allocation/portable event typing aimed at exporting
> the data to userspace and buffering mechanism as soon as I integrate a core
> version of the LTTV userspace tools to the kernel build tree. Other than that, I
> currently have a tracer which fulfills most of the requirements expressed
> earlier. I just fear that if I release only the kernel part without foolproof
> binary-to-ascii trace decoder within the kernel, people might be a bit reluctant
> to fetch a separate userspace package.

It might be good to drop all the ltt naming and pick more generic names,
esp. as ftrace could use a lot of this infrastructure as well.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 06/15] LTTng build
  2008-10-16 23:27 ` [RFC patch 06/15] LTTng build Mathieu Desnoyers
@ 2008-10-17  8:10   ` KOSAKI Motohiro
  2008-10-17 16:18     ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2008-10-17  8:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ingo Molnar, William L. Irwin

Hi

just nit

> Index: linux-2.6-lttng/Makefile
> ===================================================================
> --- linux-2.6-lttng.orig/Makefile	2008-10-16 18:16:33.000000000 -0400
> +++ linux-2.6-lttng/Makefile	2008-10-16 18:16:38.000000000 -0400
> @@ -619,7 +619,7 @@ export mod_strip_cmd
>  
>  
>  ifeq ($(KBUILD_EXTMOD),)
> -core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> +core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ltt/
>  
>  vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
>  		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \

This patch depend on 07/15. it seems wrong patch order.
Then it cause following build error.


~/linux/linux-2.6.27-lttng% make
make
make[1]: `include/asm-ia64/nr-irqs.h' is up to date.
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CALL    scripts/checksyscalls.sh
  CHK     include/linux/compile.h
scripts/Makefile.build:41: /mnt/sdb1/kosaki/linux/linux-2.6.27-lttng/ltt/Makefile: No such file or directory
make[1]: *** No rule to make target `/mnt/sdb1/kosaki/linux/linux-2.6.27-lttng/ltt/Makefile'.  Stop.
make: *** [ltt] Error 2





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 07/15] LTTng timestamp
  2008-10-16 23:27 ` [RFC patch 07/15] LTTng timestamp Mathieu Desnoyers
@ 2008-10-17  8:15   ` KOSAKI Motohiro
  2008-10-17 16:23     ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: KOSAKI Motohiro @ 2008-10-17  8:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ralf Baechle, benh, paulus,
	Ingo Molnar


more nit.

> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/delay.h>
> +#include <linux/timer.h>
> +#include <linux/workqueue.h>
> +#include <linux/cpu.h>
> +#include <linux/timex.h>
> +#include <linux/bitops.h>
> +#include <linux/ltt.h>

ltt.h is created by 08/15. then following build error happend.

make[1]: `include/asm-ia64/nr-irqs.h' is up to date.
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CALL    scripts/checksyscalls.sh
  CHK     include/linux/compile.h
  GZIP    kernel/config_data.gz
  IKCFG   kernel/config_data.h
  CC      kernel/configs.o
  LD      kernel/built-in.o
  CC      ltt/ltt-timestamp.o
ltt/ltt-timestamp.c:23:23: error: linux/ltt.h: No such file or directory
ltt/ltt-timestamp.c: In function 'ltt_update_synthetic_tsc':
ltt/ltt-timestamp.c:75: error: implicit declaration of function 'ltt_get_timestamp32'
ltt/ltt-timestamp.c: In function 'precalc_stsc_interval':
ltt/ltt-timestamp.c:157: error: implicit declaration of function 'ltt_frequency'
ltt/ltt-timestamp.c:157: error: implicit declaration of function 'ltt_freq_scale'
make[1]: *** [ltt/ltt-timestamp.o] Error 1
make: *** [ltt] Error 2


> +#include <linux/smp.h>
> +#include <linux/sched.h> /* FIX for m68k local_irq_enable in on_each_cpu */





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 06/15] LTTng build
  2008-10-17  8:10   ` KOSAKI Motohiro
@ 2008-10-17 16:18     ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17 16:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, William L. Irwin

* KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> Hi
> 
> just nit
> 
> > Index: linux-2.6-lttng/Makefile
> > ===================================================================
> > --- linux-2.6-lttng.orig/Makefile	2008-10-16 18:16:33.000000000 -0400
> > +++ linux-2.6-lttng/Makefile	2008-10-16 18:16:38.000000000 -0400
> > @@ -619,7 +619,7 @@ export mod_strip_cmd
> >  
> >  
> >  ifeq ($(KBUILD_EXTMOD),)
> > -core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> > +core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ltt/
> >  
> >  vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
> >  		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
> 
> This patch depend on 07/15. it seems wrong patch order.
> Then it cause following build error.
> 

Will fix in the next release, thanks !

Mathieu

> 
> ~/linux/linux-2.6.27-lttng% make
> make
> make[1]: `include/asm-ia64/nr-irqs.h' is up to date.
>   CHK     include/linux/version.h
>   CHK     include/linux/utsrelease.h
>   CALL    scripts/checksyscalls.sh
>   CHK     include/linux/compile.h
> scripts/Makefile.build:41: /mnt/sdb1/kosaki/linux/linux-2.6.27-lttng/ltt/Makefile: No such file or directory
> make[1]: *** No rule to make target `/mnt/sdb1/kosaki/linux/linux-2.6.27-lttng/ltt/Makefile'.  Stop.
> make: *** [ltt] Error 2
> 
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 07/15] LTTng timestamp
  2008-10-17  8:15   ` KOSAKI Motohiro
@ 2008-10-17 16:23     ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17 16:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ralf Baechle, benh, paulus, Ingo Molnar

* KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> 
> more nit.
> 
> > +#include <linux/module.h>
> > +#include <linux/init.h>
> > +#include <linux/delay.h>
> > +#include <linux/timer.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/cpu.h>
> > +#include <linux/timex.h>
> > +#include <linux/bitops.h>
> > +#include <linux/ltt.h>
> 
> ltt.h is created by 08/15. then following build error happend.
> 
> make[1]: `include/asm-ia64/nr-irqs.h' is up to date.
>   CHK     include/linux/version.h
>   CHK     include/linux/utsrelease.h
>   CALL    scripts/checksyscalls.sh
>   CHK     include/linux/compile.h
>   GZIP    kernel/config_data.gz
>   IKCFG   kernel/config_data.h
>   CC      kernel/configs.o
>   LD      kernel/built-in.o
>   CC      ltt/ltt-timestamp.o
> ltt/ltt-timestamp.c:23:23: error: linux/ltt.h: No such file or directory
> ltt/ltt-timestamp.c: In function 'ltt_update_synthetic_tsc':
> ltt/ltt-timestamp.c:75: error: implicit declaration of function 'ltt_get_timestamp32'
> ltt/ltt-timestamp.c: In function 'precalc_stsc_interval':
> ltt/ltt-timestamp.c:157: error: implicit declaration of function 'ltt_frequency'
> ltt/ltt-timestamp.c:157: error: implicit declaration of function 'ltt_freq_scale'
> make[1]: *** [ltt/ltt-timestamp.o] Error 1
> make: *** [ltt] Error 2
> 

Thanks for pointing this out. Will fix in the next release with the
following patchset order :

get-cycles-kconfig-have-get-cycles.patch
get-cycles-x86-have-get-cycles.patch
get-cycles-sparc64-have-get-cycles.patch
get-cycles-powerpc-have-get-cycles.patch
get-cycles-mips-have-get-cycles.patch

#LTTng timestamping
#
lttng-timestamp-core.patch
lttng-timestamp-generic.patch
lttng-build-instrumentation-menu.patch
lttng-build-instrumentation-menu-timestamp.patch
lttng-mips-export-hpt-frequency.patch
lttng-timestamp-mips.patch
lttng-timestamp-powerpc.patch
lttng-timestamp-sparc64.patch
lttng-timestamp-sh.patch
lttng-test-tsc.patch
lttng-timestamp-x86.patch

Mathieu

> 
> > +#include <linux/smp.h>
> > +#include <linux/sched.h> /* FIX for m68k local_irq_enable in on_each_cpu */
> 
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17  2:19       ` Luck, Tony
@ 2008-10-17 17:25         ` Steven Rostedt
  2008-10-17 18:08           ` Luck, Tony
  2008-10-17 19:36         ` Christoph Lameter
  1 sibling, 1 reply; 66+ messages in thread
From: Steven Rostedt @ 2008-10-17 17:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin

On Thu, Oct 16, 2008 at 07:19:48PM -0700, Luck, Tony wrote:
> > This cache-line bouncing global clock is a best-effort to provide
> > correct event order in the trace on architectures with unsync tsc. It's
> > actually better than a global tracing buffer because it limits the
> > number of cache line transfers required to one per event.
> 
> Even one line bouncing between cpus can be a performamce disaster.
> You'll probably hit a serious wall somewhere between 8 and 16
> cpus (ia64 has code that looks a lot like this in the gettimeofday()
> path because it does not synchronize cpu cycle counters ... some
> applications that are overly fond of timestamping internal
> events using gettimeofday() end up spending significant time
> doing so on large systems ... even with only a few thousands
> of calls per second).
> 

I agree that one cache line bouncer is devastating to performance. But
as Mathieu said, it is better than a global tracer with lots of bouncing
going on. My logdev tracer (something similar to ftrace, but used only
for debugging) use to have a single buffer. By moving it to a per cpu
buffer and using an atomic counter to sort the events, the increase of
speed was a few magnitudes.

ftrace does not have a global counter, but on some boxes with out of
sync TSCs, it could not find race conditions. I had to pull in logdev,
which found the race right away, because of this atomic counter.

logdev adds a bit of perfomance degradation, but for debugging, I don't
care, and it has helped me quite a bit.

ftrace can help in debugging most of the time, but on some boxes with
wacky time stamps, it is useless to find race problems between CPUS. But
ftrace is for production, and can not afford the performance penalty of
a global counter.

-- Steve


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 17:25         ` Steven Rostedt
@ 2008-10-17 18:08           ` Luck, Tony
  2008-10-17 18:42             ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Luck, Tony @ 2008-10-17 18:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin

> I agree that one cache line bouncer is devastating to performance. But
> as Mathieu said, it is better than a global tracer with lots of bouncing
> going on.

Scale up enough, and it becomes more than just a performance problem.
When SGI first tried to boot on 512 cpus they found the kernel hung
completely because of a single global atomic counter for how many
interrupts there were.  With HZ=1024 and 512 cpus the ensuing cache
line bouncing storm from each interrupt took longer to resolve than
the interval between interrupts.

With higher event rates (1KHz seems relatively low) this wall will
be a problem for smaller systems too.

> ftrace does not have a global counter, but on some boxes with out of
> sync TSCs, it could not find race conditions. I had to pull in logdev,
> which found the race right away, because of this atomic counter.

Perhaps this needs to be optional (and run-time switchable).  Some
users (tracking performance issues) will want the  tracer to have
the minumum possible effect on the system.  Others (chasing race
conditions) will want the best possible ordering of events between
cpus[*].

-Tony

[*] I'd still be concerned that a heavyweight strict ordering might
perturb the system enough to make the race disappear when tracing
is enabled.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 18:08           ` Luck, Tony
@ 2008-10-17 18:42             ` Mathieu Desnoyers
  2008-10-17 18:58               ` Luck, Tony
                                 ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17 18:42 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin

* Luck, Tony (tony.luck@intel.com) wrote:
> > I agree that one cache line bouncer is devastating to performance. But
> > as Mathieu said, it is better than a global tracer with lots of bouncing
> > going on.
> 
> Scale up enough, and it becomes more than just a performance problem.
> When SGI first tried to boot on 512 cpus they found the kernel hung
> completely because of a single global atomic counter for how many
> interrupts there were.  With HZ=1024 and 512 cpus the ensuing cache
> line bouncing storm from each interrupt took longer to resolve than
> the interval between interrupts.
> 
> With higher event rates (1KHz seems relatively low) this wall will
> be a problem for smaller systems too.
> 

Hrm, on such systems
- *large* amount of cpus
- no synchronized TSCs

What would be the best approach to order events ? Do you think we should
consider using HPET, event though it's painfully slow ? Would it be
faster than cache-line bouncing on such large boxes ? With a frequency
around 10MHz, that would give a 100ns precision, which should be enough
to order events. However, HPET is known for its poor performances, which
I doubt will do better than the cache-line bouncing alternative.

> > ftrace does not have a global counter, but on some boxes with out of
> > sync TSCs, it could not find race conditions. I had to pull in logdev,
> > which found the race right away, because of this atomic counter.
> 
> Perhaps this needs to be optional (and run-time switchable).  Some
> users (tracking performance issues) will want the  tracer to have
> the minumum possible effect on the system.  Others (chasing race
> conditions) will want the best possible ordering of events between
> cpus[*].
> 

Yup, I think this solution would work. The user could specify the time
source for a specific set of buffers (a trace) through debugfs files.

> -Tony
> 
> [*] I'd still be concerned that a heavyweight strict ordering might
> perturb the system enough to make the race disappear when tracing
> is enabled.
> 

Yes, it's true that it may make the race disappear, but what has been
seen in the field (Steven could confirm that) is that it usually makes
the race more likely to appear due to an enlarged race window. But I
guess it all depends on where the activated instrumentation is.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 18:42             ` Mathieu Desnoyers
@ 2008-10-17 18:58               ` Luck, Tony
  2008-10-17 20:23                 ` Mathieu Desnoyers
  2008-10-17 19:17               ` Steven Rostedt
  2008-10-20 20:10               ` Linus Torvalds
  2 siblings, 1 reply; 66+ messages in thread
From: Luck, Tony @ 2008-10-17 18:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin

> Hrm, on such systems
> - *large* amount of cpus
> - no synchronized TSCs
>
> What would be the best approach to order events ?

There isn't a perfect solution for this.  My feeling is
that your best hope is with per-cpu buffers logged with
the local TSC ... together with some fancy heuristics to
post-process the logs to come up with the best approximation
to the actual ordering.

If you have a tight upper bound estimate for the
errors in converting from "per-cpu" TSC values to "global
system time" then the post processing tool will be able
to identify events for which the order is uncertain.

> Do you think we should consider using HPET, event though it's
> painfully slow ? Would it be faster than cache-line bouncing
> on such large boxes ? With a frequency around 10MHz, that
> would give a 100ns precision, which should be enough
> to order events.

This sounds like a poor choice.  Makes all traces very
slow.  100ns precision isn't all that good ... we can
probably do almost as well estimating the delta between
TSC on different cpus.

-Tony

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 18:42             ` Mathieu Desnoyers
  2008-10-17 18:58               ` Luck, Tony
@ 2008-10-17 19:17               ` Steven Rostedt
  2008-10-20 20:10               ` Linus Torvalds
  2 siblings, 0 replies; 66+ messages in thread
From: Steven Rostedt @ 2008-10-17 19:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Luck, Tony, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin




On Fri, 17 Oct 2008, Mathieu Desnoyers wrote:

> * Luck, Tony (tony.luck@intel.com) wrote:
> > > I agree that one cache line bouncer is devastating to performance. But
> > > as Mathieu said, it is better than a global tracer with lots of bouncing
> > > going on.
> > 
> > Scale up enough, and it becomes more than just a performance problem.
> > When SGI first tried to boot on 512 cpus they found the kernel hung
> > completely because of a single global atomic counter for how many
> > interrupts there were.  With HZ=1024 and 512 cpus the ensuing cache
> > line bouncing storm from each interrupt took longer to resolve than
> > the interval between interrupts.
> > 
> > With higher event rates (1KHz seems relatively low) this wall will
> > be a problem for smaller systems too.
> > 
> 
> Hrm, on such systems
> - *large* amount of cpus
> - no synchronized TSCs

What about selective counting? Or have counters per nodes? If you are
dealing with a race, most cases, the race is not happening against CPUs
not sharing a node. Those not sharing will try hard not to ever use the
same cache lines.


> 
> What would be the best approach to order events ? Do you think we should
> consider using HPET, event though it's painfully slow ? Would it be
> faster than cache-line bouncing on such large boxes ? With a frequency
> around 10MHz, that would give a 100ns precision, which should be enough
> to order events. However, HPET is known for its poor performances, which
> I doubt will do better than the cache-line bouncing alternative.
> 
> > > ftrace does not have a global counter, but on some boxes with out of
> > > sync TSCs, it could not find race conditions. I had to pull in logdev,
> > > which found the race right away, because of this atomic counter.
> > 
> > Perhaps this needs to be optional (and run-time switchable).  Some
> > users (tracking performance issues) will want the  tracer to have
> > the minumum possible effect on the system.  Others (chasing race
> > conditions) will want the best possible ordering of events between
> > cpus[*].
> > 
> 
> Yup, I think this solution would work. The user could specify the time
> source for a specific set of buffers (a trace) through debugfs files.
> 
> > -Tony
> > 
> > [*] I'd still be concerned that a heavyweight strict ordering might
> > perturb the system enough to make the race disappear when tracing
> > is enabled.
> > 
> 
> Yes, it's true that it may make the race disappear, but what has been
> seen in the field (Steven could confirm that) is that it usually makes
> the race more likely to appear due to an enlarged race window. But I
> guess it all depends on where the activated instrumentation is.

I've seen both. 9 out of 10 times, the tracer helps induce the race. But 
I've had that 1 out of 10 where it makes the race go away.

Actually, what happens is that I'll start adding trace markers (printk 
like traces), and the race will happen quicker. Then I'll add a few more 
markers and the race goes away. Those are the worst ;-)

-- Steve


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17  2:19       ` Luck, Tony
  2008-10-17 17:25         ` Steven Rostedt
@ 2008-10-17 19:36         ` Christoph Lameter
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Lameter @ 2008-10-17 19:36 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Mathieu Desnoyers, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ingo Molnar, H. Peter Anvin

Luck, Tony wrote:
> Even one line bouncing between cpus can be a performamce disaster.
> You'll probably hit a serious wall somewhere between 8 and 16
> cpus (ia64 has code that looks a lot like this in the gettimeofday()
> path because it does not synchronize cpu cycle counters ... some


The code exist by necessity because some systems do not have synchronized ITCs
and one would not have time go backward. The cmpxchg there is usually switched
off. Its horrible in terms of scaling to large numbers of processor and also
horrible in terms of clock accuracy.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 18:58               ` Luck, Tony
@ 2008-10-17 20:23                 ` Mathieu Desnoyers
  2008-10-17 23:52                   ` Luck, Tony
  0 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-17 20:23 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev

* Luck, Tony (tony.luck@intel.com) wrote:
> > Hrm, on such systems
> > - *large* amount of cpus
> > - no synchronized TSCs
> >
> > What would be the best approach to order events ?
> 
> There isn't a perfect solution for this.  My feeling is
> that your best hope is with per-cpu buffers logged with
> the local TSC ... together with some fancy heuristics to
> post-process the logs to come up with the best approximation
> to the actual ordering.
> 
> If you have a tight upper bound estimate for the
> errors in converting from "per-cpu" TSC values to "global
> system time" then the post processing tool will be able
> to identify events for which the order is uncertain.
>

The only problem I see with "fancy heuristics" regarding the time base
is that when we detect that something is going wrong in the kernel or in
a userspace program, the *very last* thing we want to do is to doubt
about the reliability of the time source. When a problematic situation
is detected, it makes a huge difference whether this information can be
trusted or not. I've seen much simpler algorithms in the past (I'm
referring to the original LTT heartbeat here) which were told to be
plain simple but ended up being buggy and unreliable in rare
corner-cases (it did not take into account interrupt latency). After
fixing the main problems, I decided to start all over from scratch,
because unreliable timestamps means unreliable traces and this is not
something I am willing to provide.


> > Do you think we should consider using HPET, event though it's
> > painfully slow ? Would it be faster than cache-line bouncing
> > on such large boxes ? With a frequency around 10MHz, that
> > would give a 100ns precision, which should be enough
> > to order events.
> 
> This sounds like a poor choice.  Makes all traces very
> slow.  100ns precision isn't all that good ... we can
> probably do almost as well estimating the delta between
> TSC on different cpus.
> 
> -Tony
> 

100ns is not bad at all actually, especially given we don't plan to
require a memory barrier to be issued around the timestamp counter
reads. Memory read/writes can easily be reordered so they cause timing
skew in the order of 100ns. Also, just the TSC frequency drift and
imprecision of the TSC synchronization even when they are synchronized
(which is typically one cache line transfer delay when the TSCs are not
synchronized by the BIOS/mobo) is also in the order of 100ns. So sorry, I
disagree and think 100ns is actually the kind of precision we can expect
even from TSC reads.

Having read a lot about the subtle timestamp counter bugs one can find
in Intel and AMD boxes (gross summary of my findings here : 
http://ltt.polymtl.ca/svn/trunk/lttv/doc/developer/tsc.txt), I think
there is no reliable way to give an upper bound on the timing
inaccurary, even with heroic measures trying to map the specific bugs of
each of those CPUs, when you have stuff like the southbridge temperature
throttling slowing down your CPU clock without notifying the kernel. And
as a said above, the timestamping code should be _very_ _very_ simple,
given that the first thing a kernel developer will point his finger at
when a tracer discovers a bug in his code is the tracer itself. So let's
save everyone precious time and make this code easy to review. :-)

So we are talking about performance impact of time base reads. Let's
look at some interesting numbers :

On my x86_64 box
model name  : Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping  : 6
cpu MHz   : 2000.060
(dual quad-core, *not* NUMA)

Cycles to read (from 10000 loops, on a single CPU) :

get_cycles :                                                         60
cpuid + get_cycles + cpuid (with rdtsc_barrier) :                    77
ltt_get_timestamp_64() with synchronized TSC :                       79
ltt_get_timestamp_64() with non-synchronized TSC (cache-hot) :      163
ltt_get_timestamp_64() with non-synchronized TSC (after clflush) :  310
   (just doing the clflush takes 68 cycles, has been substracted from
   the previous result)
HPET :                                                              945

So if we have 512 processors doing timestamp reads like crazy, we can
suppose the execution to be serialized by cacheline transfer operations
from cpu to cpu. Therefore, assuming a worse-case scenario where all the
timestamp reads cause a cache line transfer, the 310 cycles (upper
bound) it takes to do the cmpxchg, with CPUs running at 2.0GHz, means
that we can do 6 451 612 timestamp reads per second on the overall
system. On 512 nodes, that means we can do 12 600 timestamp
reads/second/cpu.

Compared to this, HPET would offer a slower time base read (945 cycles
per read is fairly slow, which gives 2 116 402 time stamp read per
second), but if this mmio read is done in parallel across CPUs (I
don't see any reason why it should hold the bus exclusively for a simple
read.. ?), then it would scale much better so we could expect about 2.1M
timestamp read/second/cpu.

I guess the cache-line bouncing approach would get much worse with NUMA
systems, and in that case HPET could become increasingly interesting.

To give an order of magnitude, I expect some worse-case scenarios to be
around 8MB/s/cpu when tracing stuff like lockdep in circular per-cpu
memory buffers. With an average of, say, 16 bytes per event, including
the event header and payload, that would mean 524 288 events/second/cpu.
In that case, using the cache-line bouncing approach on a 512 nodes box
would simply kill the system, but if the HPET reads does not imply
serialization, time base reads would add a 25% performance loss in this
utterly extreme case, which is acceptable given this is a best-effort.
Compared to this, the TSC-based solution (given we have synchronized
TSCs) would add a 2% performance hit.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 20:23                 ` Mathieu Desnoyers
@ 2008-10-17 23:52                   ` Luck, Tony
  2008-10-18 17:01                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Luck, Tony @ 2008-10-17 23:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev

Complexity of dealing with all the random issues that have
plagued TSC in different cpus over the years definitely
seems to be a problem.

I have one more idea on how we might be able to use
TSC locally and still have confidence that we can
merge local cpu buffers into a consistent stream.

What if we read the HPET occasionally (once per
second?) and add a record to our per-cpu buffer
with the value of the HPET.  That would give us
a periodic cross-check of each cpus TSC against
real time so that a "smart" post-processor can
sanity check the log entries at regular intervals.

It doesn't deal with the truly insane TSC behaivours
(like stopping completely in certain C states, or
varying frequency) ... but it would at least be able
to reliably detect these forms of insanity.

We need periodic entries added to the buffer anyway
to make sure we can detect rollover since we don't
want waste space in log records with a full width
TSC value.

-Tony

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 23:52                   ` Luck, Tony
@ 2008-10-18 17:01                     ` Mathieu Desnoyers
  2008-10-18 17:35                       ` Linus Torvalds
  2008-10-20 18:07                       ` Luck, Tony
  0 siblings, 2 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-18 17:01 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev

* Luck, Tony (tony.luck@intel.com) wrote:
> Complexity of dealing with all the random issues that have
> plagued TSC in different cpus over the years definitely
> seems to be a problem.
> 

Yes :(

> I have one more idea on how we might be able to use
> TSC locally and still have confidence that we can
> merge local cpu buffers into a consistent stream.
> 
> What if we read the HPET occasionally (once per
> second?) and add a record to our per-cpu buffer
> with the value of the HPET.  That would give us
> a periodic cross-check of each cpus TSC against
> real time so that a "smart" post-processor can
> sanity check the log entries at regular intervals.
> 

Hrm, that would make the timestamps much more sensitive to tracing
hiccups :

- if interrupts are disabled for a long time on the system (kernel bug
  or at early boot), we cannot assume those HPET events will be logged
  at the expected interval.
- if we are in buffer full condition (buffers are too small to handle
  the load and we drop events on buffer full condition), we will not
  only have missing events : given we depend on those HPET events to
  have a consistent time-base, all the trace time-base must be
  considered untrustable.
- we would also have to get this HPET timer value at each subbuffer
  boundary (at each page in Steven's implementation). This is required
  so we can make sense of the time-base of buffers when we only gather
  the last subbuffers written, given the previous ones have been
  overwritten in flight-recorder mode. However, with a relatively large
  load and small subbuffers (e.g. 4kB), we would have to get this HPET
  value 2048 times/second/cpu. On a 512 nodes machine, it may become a
  problem. See my analysis of poor HPET scalability below.
  
> It doesn't deal with the truly insane TSC behaivours
> (like stopping completely in certain C states, or
> varying frequency) ... but it would at least be able
> to reliably detect these forms of insanity.
> 

I also like the one done by AMD when the cycle counter goes backward
one a single CPU. :) Hrm, I thought those you say are truly insane
behaviors are exactly the ones we are trying to deal with ?

And what do we say when we detect this ? "sorry, please upgrade your
hardware to get a reliable trace" ? ;)

> We need periodic entries added to the buffer anyway
> to make sure we can detect rollover since we don't
> want waste space in log records with a full width
> TSC value.
> 

Nope, this is not required. I removed the heartbeat event from LTTng two
weeks ago, implementing detection of the delta from the last timestamp
written into the trace. If we detect that the new timestamp is too far
from the previous one, we write the full 64 bits TSC in an extended
event header. Therefore, we have no dependency on interrupt latency to
get a sane time-base.


> -Tony
> 

Here are some numbers showing the scalability of synchronized TSC vs
cache-line bouncing vs HPET read under tracing load. I use LTTng to take
a trace only in circular per-cpu memory buffers while tbench is running.
I look at the resulting tbench speed. This kind of load generates a lot
of tracing data especially because tbench does a lot of small
read/writes which generates a lot of system call events. Side-note:
LTTng is currently fully dynamic and parses the format string like
printk, and this is accountable for a large part of the performance
degradation. LTTng however supports to override this probe with
"specialized" probes which know exactly which types to record. I just
did not create any yet. So let's focus on timestamping :


model name	: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping	: 6
cpu MHz		: 2000.073

tbench, x86_64 dual quad-core, 2.0GHz, 16GB ram      Speed          Slowdown

(8 cores up)
  No tracing :                                      1910.50 MB/sec
  Flight recorder tracing (per-cpu memory buffers)
    synchronized TSC, get_cycles with cpuid :        940.20 MB/sec (50%)
    unsync TSC, get_cycles + cmpxchg :               716.96 MB/sec (62%)
    unsync TSC, HPET read :                          586.53 MB/sec (69%)

(2 cores up)
  No tracing :                                       488.15 MB/sec
  Flight recorder tracing (per-cpu memory buffers)
    synchronized TSC, get_cycles with cpuid :        241.34 MB/sec (50%)
    unsync TSC, get_cycles + cmpxchg :               202.30 MB/sec (58%)
    unsync TSC, HPET read :                          187.04 MB/sec (61%)

(1 core up)
  No tracing :                                       270.67 MB/sec
  Flight recorder tracing (per-cpu memory buffers)
    synchronized TSC, get_cycles with cpuid :        126.82 MB/sec (53.1%)
    unsync TSC, get_cycles + cmpxchg :               124.54 MB/sec (53.9%)
    unsync TSC, HPET read :                           98.75 MB/sec (63.5%)

So, the conclusion it brings about scalability of those time sources
regarding tracing is :
- local TSC read scales very well when the number of CPU increases
  (constant 50% overhead)
- Comparing the added overhead of both get_cyles+cmpxchg and HPET to
  the local sync TSC :

  cores    get_cycles+cmpxchg    HPET
      1                  0.8%     10%
      2                  8  %     11%
      8                 12  %     19%

So, is it me, or HPET scales even more poorly than a cache-line bouncing
cmpxchg ? I find it a bit surprising.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-18 17:01                     ` Mathieu Desnoyers
@ 2008-10-18 17:35                       ` Linus Torvalds
  2008-10-18 17:50                         ` Ingo Molnar
  2008-10-22 15:53                         ` Mathieu Desnoyers
  2008-10-20 18:07                       ` Luck, Tony
  1 sibling, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2008-10-18 17:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Luck, Tony, Steven Rostedt, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev



On Sat, 18 Oct 2008, Mathieu Desnoyers wrote:
> 
> So, the conclusion it brings about scalability of those time sources
> regarding tracing is :
> - local TSC read scales very well when the number of CPU increases
>   (constant 50% overhead)

You should basically expect it to scale perfectly. Of course the tracing 
itself adds overhead, and at some point the trace data generation may add 
so much cache/memory traffic that you start getting worse scaling because 
of _that_, but just a local TSC access itself will be perfect on any sane 
setup.

> - Comparing the added overhead of both get_cyles+cmpxchg and HPET to
>   the local sync TSC :
> 
>   cores    get_cycles+cmpxchg    HPET
>       1                  0.8%     10%
>       2                  8  %     11%
>       8                 12  %     19%
> 
> So, is it me, or HPET scales even more poorly than a cache-line bouncing
> cmpxchg ? I find it a bit surprising.

I don't think that's strictly true.

The cacheline is going to generally be faster than HPET ever will, since 
caches are really important. But as you can see, the _degradation_ is 
actually worse for the cacheline, since the cacheline works perfectly in 
the UP case (not surprising) and starts degrading a lot more when you 
start getting bouncing.

And I'm not sure what the behaviour would be for many-core, but I would 
not be surprised of the cmpxchg actually ends up losing at some point. The 
HPET is never fast (you can think of it as "uncached access"), and it's 
going to degrade too (contention at the IO hub level), but it's actually 
possible that the contention at some point becomes less than wild 
bouncing.

Many cacheline bouncing issues end up being almost exponential. When you 
*really* get bouncing, things degrade in a major way. I don't think you've 
seen the worst of it with 8 cores ;)

And that's why I'd really like to see the "only local TSC" access, even if 
I admit that the code is going to be much more subtle, and I will also 
admit that especially in the presense of frequency changes *and* hw with 
unsynchronized TSC's you may be in the situation where you never get 
exactly what you want.

But while you may not like some of the "purely local TSC" issues, I would 
like to point out that

 - In _practice_, it's going to be essentially perfect on a lot of 
   machines, and under a lot of loads. 

   For example, yes it's true that frequency changes will make TSC things 
   less reliable on a number of machines, but people already end up 
   disabling dynamic cpufreq when doing various benchmark runs, simply 
   because people want more consistent numbers for benchmarking across 
   different kernels etc.

   So it's entirely possible (and I'd say "likely") that most people are 
   simply willing to do the same thing for tracing if they are tracing 
   things at a level where CPU frequency changes might otherwise matter. 

   So maybe the "local TSC" approach isn't always perfect, but I'd expect 
   that quite often people who do tracing are willing to work around it. 
   The people doing tracing are generally not doing so without being aware 
   of what they are up to..

 - While there is certainly a lot of hardware out there with flaky TSC's, 
   there's also a lot of hardware (especially upcoming) that do *not* have 
   flaky TSC's. We've been complaining to Intel about TSC behavior for 
   years, and the thing is, it actually _is_ improving. It just takes some 
   time.

 - So considering that some of the tracing will actually be very important 
   on machines that have lots of cores, and considering that a lot of the 
   issues can generally be worked around, I really do think that it's 
   worth trying to spend a bit of effort on doing the "local TSC + timely 
   corrections"

For example, you mention that interrupts can be disabled for a time, 
delaying things like regular sync events with some stable external clock 
(say the HPET). That's true, and it would even be a problem if you'd use 
the time of the interrupt itself as the source of the sync, but you don't 
really need to depend on the timing of the interrupt - just that it 
happens "reasonably often" (and now we're talking _much_ longer timeframes 
than some interrupt-disabled time - we're talking tenths of seconds or 
even more).

Then, rather than depend on the time of the interrupt, you just purely can 
check the local TSC against the HPET (or other source), and synchronize 
just _purely_ based on those. That you can do by basically doing something 
like

	do {
		start = read_tsc();
		hpet = read_hpet();
		end = read_tsc();
	} while (end - start > ERROR);

and now, even if you have interrupts enabled (or worry about NMI's), you 
now know that you have a totally _independent_ sync point, ie you know 
that your hpet read value is withing ERROR cycles of the start/end values, 
so now you have a good point for doing future linear interpolation based 
on those kinds of sync points.

And if you make all these linear interpolations be per-CPU (so you have 
per-CPU offsets and frequencies) you never _ever_ need to touch any shared 
data at all, and you know you can scale basically perfectly.

Your linear interpolations may not be _perfect_, but you'll be able to get 
them pretty damn near. In fact, even if the TSC's aren't synchronized at 
all, if they are at least _individually_ stable (just running at slightly 
different frequencies because they are in different clock domains, and/or 
at different start points), you can basically perfect the precision over 
time.

			Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-18 17:35                       ` Linus Torvalds
@ 2008-10-18 17:50                         ` Ingo Molnar
  2008-10-22 16:19                           ` Mathieu Desnoyers
  2008-10-22 15:53                         ` Mathieu Desnoyers
  1 sibling, 1 reply; 66+ messages in thread
From: Ingo Molnar @ 2008-10-18 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Luck, Tony, Steven Rostedt, Andrew Morton,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev,
	Michael Davidson


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And if you make all these linear interpolations be per-CPU (so you 
> have per-CPU offsets and frequencies) you never _ever_ need to touch 
> any shared data at all, and you know you can scale basically 
> perfectly.
> 
> Your linear interpolations may not be _perfect_, but you'll be able to 
> get them pretty damn near. In fact, even if the TSC's aren't 
> synchronized at all, if they are at least _individually_ stable (just 
> running at slightly different frequencies because they are in 
> different clock domains, and/or at different start points), you can 
> basically perfect the precision over time.

there's been code submitted by Michael Davidson recently that looked 
interesting, which turns the TSC into such an entity:

    http://lkml.org/lkml/2008/9/25/451

The periodic synchronization uses the hpet, but it thus allows lockless 
and globally correct readouts of the TSC .

And that would match the long term goal as well: the hw should do this 
all automatically. So perhaps we should have a trace_clock() after all, 
independent of sched_clock(), and derived straight from RDTSC.

The approach as propoed has a couple of practical problems, but if we 
could be one RDTSC+multiplication away from a pretty good timestamp that 
would be rather useful, very fast and very robust ...

	Ingo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [RFC patch 15/15] LTTng timestamp x86
  2008-10-18 17:01                     ` Mathieu Desnoyers
  2008-10-18 17:35                       ` Linus Torvalds
@ 2008-10-20 18:07                       ` Luck, Tony
  2008-10-22 16:51                         ` Mathieu Desnoyers
  1 sibling, 1 reply; 66+ messages in thread
From: Luck, Tony @ 2008-10-20 18:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev

> And what do we say when we detect this ? "sorry, please upgrade your
> hardware to get a reliable trace" ? ;)

My employer might be happy with that answer ;-) ... but I think
we could tell the user to:

        1) adjust something in /sys/...
        2) boot with some special option
        3) rebuild kernel with CONFIG_INSANE_TSC=y

to switch over to a heavyweight workaround in s/w.  Systems
that require this are already in the minority ... and I
think (hope!) that current and future generations of cpus
won't have these challenges.

So this is mostly a campaign for the default code path to
be based on current (sane) TSC behaviour ... with the workarounds
for past problems kept to one side.

> Nope, this is not required. I removed the heartbeat event from LTTng two
> weeks ago, implementing detection of the delta from the last timestamp
> written into the trace. If we detect that the new timestamp is too far
> from the previous one, we write the full 64 bits TSC in an extended
> event header. Therefore, we have no dependency on interrupt latency to
> get a sane time-base.

Neat.  Could you grab the HPET value here too?


> (8 cores up)

Interesting results.  I'm not at all sure why HPET scales so badly.
Maybe some h/w throttling/synchronizing going on???

-Tony

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-17 18:42             ` Mathieu Desnoyers
  2008-10-17 18:58               ` Luck, Tony
  2008-10-17 19:17               ` Steven Rostedt
@ 2008-10-20 20:10               ` Linus Torvalds
  2008-10-20 21:38                 ` john stultz
  2008-10-22 17:05                 ` Mathieu Desnoyers
  2 siblings, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2008-10-20 20:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Luck, Tony, Steven Rostedt, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin



On Fri, 17 Oct 2008, Mathieu Desnoyers wrote:
> 
> Hrm, on such systems
> - *large* amount of cpus
> - no synchronized TSCs
> 
> What would be the best approach to order events ?

My strong opinion has been - for a longish while now, and independently of 
any timestamping code - that we should be seriously looking at basically 
doing essentially a "ntp" inside the kernel to give up the whole idiotic 
notion of "synchronized TSCs". Yes, TSC's are often synchronized, but even 
when they are, we might as well _think_ of them as not being so.

In other words, instead of expecting internal clocks to be synchronized, 
just make the clock be a clock network of independent TSC domains. The 
domains could in theory be per-package (assuming TSC is synchronized at 
that level), but even if we _could_ do that, we'd probably still be better 
off by simply always doing it per-core. If only because then the reading 
would be per-core.

I think it's a mistake for us to maintain a single clock for 
gettimeofday() (well, "getnstimeofday" and the whole "clocksource_read()" 
crud to be technically correct). And sure, I bet clocksource_read() can do 
various per-CPU things and try to do that, but it's complex and pretty 
generic code, and as far as I know none of the clocksources have even 
tried. The TSC clocksource read certainly does not (it just does a very 
similar horrible "at least don't go backwards" crud that the LTTng patch 
suggested).

So I think we should make "xtime" be a per-CPU thing, and add support for 
per-CPU clocksources. And screw that insane "mark_tsc_unstable()" thing.

And if we did it well, we migth be able to get good timestamps that way 
too.

		Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 00/15] Tracer Timestamping
  2008-10-17  7:59 ` [RFC patch 00/15] Tracer Timestamping Peter Zijlstra
@ 2008-10-20 20:25   ` Mathieu Desnoyers
  2008-10-21  0:20     ` Nicolas Pitre
  0 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-20 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Thomas Gleixner, David Miller

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Thu, 2008-10-16 at 19:27 -0400, Mathieu Desnoyers wrote:
> > Hi,
> > 
> > Starting with the bottom of my LTTng patchset
> > (git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git)
> > I post as RFC the timestamping infrastructure I have been using for a while in
> > the tracer. It integrates get_cycles() standardization following David Miller's
> > comments I did more recently.
> > 
> > It also deals with 32 -> 64 bits timestamp counter extension with a RCU-style
> > algorithm, which is especially useful on MIPS and SuperH architectures.
> 
> Have you looked at the existing 32->63 extention code in
> include/linux/cnt32_to_63.h and considered unifying it?
> 

Yep, I felt this code was dangerous on SMP given it could suffer from
the following type of race due to lack of proper barriers :

CPU    A                                 B
       read hw cnt low
       read __m_cnt_hi
                                         read hw cnt low
       (wrap detected)
       write __m_cnt_hi (incremented)
                                         read __m_cnt_hi
                                         (wrap detected)
                                         write __m_cnt_hi (incremented)

we therefore increment the high bits twice in the given race.

On UP, the same race could happen if the code is called with preemption
enabled.

I don't think the "volatile" statement would necessarily make sure the
compiler and CPU would do the __m_cnt_hi read before the hw cnt low
read. A real memory barrier to order mmio reads wrt memory reads (or
instruction sync barrier if the value is taken from the cpu registers)
would be required to insure such order.

I also felt it would be more solid to have per-cpu structures to keep
track of 32->64 bits TSC updates, given the TSCs can always be slightly
out-of-sync :

CPU    A                                 B
       read __m_cnt_hi
       read hw cnt low (+200 cycles)
       (wrap detected)
       write __m_cnt_hi (incremented)
                                         read __m_cnt_hi
                                         read hw cnt low (-200 cycles)
                                         (no wrap)
                                         -> bogus value returned.




> > There is also a TSC synchronization test within this patchset to detect
> > unsynchronized TSCs. 
> 
> We already have such code, no? Does this code replace that one or just
> add a second test?
> 

It adds a second test, which seems more solid to me than the existing
x86 tsc_sync detection code.

> > See comments in this specific patch to figure out the
> > difference between the current x86 tsc_sync.c and the one I propose in this
> > patch.
> 
> Right so you don't unify, that's a missed opportunity, no?
> 

Yep, If we can switch the current x86 tsc_sync code to use my
architecture agnostic implementation, that would be a gain. We could
probably port other tsc sync detect code (ia64 ?) to use this
infrastructure too.


> > It also provides an architecture-agnostic fallback in case there is no
> > timestamp counter available : basically, it's
> > (jiffies << 13) | atomically_incremented_counter (if there are more than 8192
> > events per jiffy, time will still be monotonic, but will increment faster than
> > the actual system frequency).
> > 
> > Comments are welcome. Note that this is only the beginning of the patchset. I
> > plan to submit the event ID allocation/portable event typing aimed at exporting
> > the data to userspace and buffering mechanism as soon as I integrate a core
> > version of the LTTV userspace tools to the kernel build tree. Other than that, I
> > currently have a tracer which fulfills most of the requirements expressed
> > earlier. I just fear that if I release only the kernel part without foolproof
> > binary-to-ascii trace decoder within the kernel, people might be a bit reluctant
> > to fetch a separate userspace package.
> 
> It might be good to drop all the ltt naming and pick more generic names,
> esp. as ftrace could use a lot of this infrastructure as well.
> 

Sure. I've done all this development as part of the LTTng project, but I
don't care about renaming stuff. trace_clock() seems like a good name
for trace clock source. The unsync TSC detection and the 23->64 bits TSC
extension would also probably require more generic names (and would
benefit to be moved to kernel/).

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 20:10               ` Linus Torvalds
@ 2008-10-20 21:38                 ` john stultz
  2008-10-20 22:06                   ` Linus Torvalds
  2008-10-22 17:05                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 66+ messages in thread
From: john stultz @ 2008-10-20 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Luck, Tony, Steven Rostedt, Andrew Morton,
	Ingo Molnar, linux-kernel, linux-arch, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ingo Molnar, H. Peter Anvin

On Mon, Oct 20, 2008 at 1:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I think it's a mistake for us to maintain a single clock for
> gettimeofday() (well, "getnstimeofday" and the whole "clocksource_read()"
> crud to be technically correct). And sure, I bet clocksource_read() can do
> various per-CPU things and try to do that, but it's complex and pretty
> generic code, and as far as I know none of the clocksources have even
> tried. The TSC clocksource read certainly does not (it just does a very
> similar horrible "at least don't go backwards" crud that the LTTng patch
> suggested).
>
> So I think we should make "xtime" be a per-CPU thing, and add support for
> per-CPU clocksources. And screw that insane "mark_tsc_unstable()" thing.
>
> And if we did it well, we migth be able to get good timestamps that way
> too.

Personally I'd been hoping that the experiments in the trace
timestamping code would provide a safe area of experimentation before
we adapt it to the TSC clocksource implementation for
getnstimeofday(). Earlier I know Andi and Jiri were working on such a
per-cpu TSC clocksource, but I don't know where it ended up.

I'm not quite sure I followed your per-cpu xtime thoughts.  Could you
explain further your thinking as to why the entire timekeeping
subsystem should be per-cpu instead of just keeping that back in the
arch-specific clocksource implementation?  In other words, why keep
things synced at the nanosecond level instead of keeping the per-cpu
TSC synched at the cycle level?

thanks
-john

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 21:38                 ` john stultz
@ 2008-10-20 22:06                   ` Linus Torvalds
  2008-10-20 22:17                     ` Ingo Molnar
                                       ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Linus Torvalds @ 2008-10-20 22:06 UTC (permalink / raw)
  To: john stultz
  Cc: Mathieu Desnoyers, Luck, Tony, Steven Rostedt, Andrew Morton,
	Ingo Molnar, linux-kernel, linux-arch, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ingo Molnar, H. Peter Anvin



On Mon, 20 Oct 2008, john stultz wrote:
> 
> I'm not quite sure I followed your per-cpu xtime thoughts.  Could you
> explain further your thinking as to why the entire timekeeping
> subsystem should be per-cpu instead of just keeping that back in the
> arch-specific clocksource implementation?  In other words, why keep
> things synced at the nanosecond level instead of keeping the per-cpu
> TSC synched at the cycle level?

I don't think you can kep them sync'ed without taking frequency drift into 
account. When you have multiple boards (ie big boxes), they simply _will_ 
be in different clock domains. They won't have the exact same frequency.

So the "rewrite the TSC every once in a while" approach (where "after 
coming out of idle" is just a special case of "once in a while" due to 
many CPU's losing TSC in idle) works well in the kind of situation where 
you really only have a single clock domain, and the TSC's are all 
basically from the same reference clock. And that's a common case, but it 
certainly isn't the _only_ case.

What about fundamnetally different frequencies (old TSC's that change with 
cpufreq)? Or what about just subtle different ones (new TSC's but on 
separate sockets that use separate external clocks)?

But sure, I can imagine using a global xtime, but just local TSC offsets 
and frequencies, and just generating a local offset from xtime. BUT HOW DO 
YOU EXPECT TO DO THAT?

Right now, the global xtime offset thing also depends on the fact that we 
have a single global TSC offset! That whole "delta against xtime" logic 
depends very much on this:

	/* calculate the delta since the last update_wall_time: */
	cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;

and that base-time setting depends on a _global_ clock source. Why? 
Because it depends on setting that in sync with updating xtime.

And maybe I'm missing something. But I do not believe that it's easy to 
just make the TSC be per-CPU. You need per-cpu correction factors, but you 
_also_ need a per-CPU time base.

Oh, I'm sure you can do hacky things, and work around known issues, and 
consider the TSC to be globally stable in a lot of common schenarios. 
That's what you get by re-syncing after idle etc. And it's going to work 
in a lot of situations.

But it's not going to solve the "hey, I have 512 CPU's, they are all on 
different boards, and no, they are _not_ synchronized to one global 
clock!".

That's why I'd suggest making _purely_ local time, and then aiming for 
something NTP-like. But maybe there are better solutions out there.

			Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 22:06                   ` Linus Torvalds
@ 2008-10-20 22:17                     ` Ingo Molnar
  2008-10-20 22:29                     ` H. Peter Anvin
  2008-10-20 23:47                     ` john stultz
  2 siblings, 0 replies; 66+ messages in thread
From: Ingo Molnar @ 2008-10-20 22:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: john stultz, Mathieu Desnoyers, Luck, Tony, Steven Rostedt,
	Andrew Morton, linux-kernel, linux-arch, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ingo Molnar, H. Peter Anvin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> That's why I'd suggest making _purely_ local time, and then aiming for 
> something NTP-like. But maybe there are better solutions out there.

this 'fast local time' was the rough idea we tried to implement via the 
cpu_clock(cpu) interface.

cpu_clock() results are loosely coupled to xtime in every scheduler tick 
via the scd->tick_gtod logic.

( That way in a sense it tracks NTP time as well, if NTP is fed back 
  into GTOD, such as when ntpd is running. Granted, this is not the same 
  quality at all as if it did native NTP-alike corrections, but it at 
  least has a long-term stability. )

And it only ever does cross-CPU work if we specifically ask for a remote 
clock:

        if (cpu != raw_smp_processor_id()) {
                struct sched_clock_data *my_scd = this_scd();

                lock_double_clock(scd, my_scd);

it still does this "serialization looking" even in the local case:

                __raw_spin_lock(&scd->lock);
                clock = __update_sched_clock(scd, now);
        }

        __raw_spin_unlock(&scd->lock);

... but that lock is strictly per CPU, so it only matters if there _is_ 
cross-CPU "interest" in that clock. Otherwise these locks are in essence 
just per CPU and cause no cacheline bouncing, etc.

... but we could try to eliminate even that potential for any locking. 
On 64-bit it's a real possibility i think. (we need the lock for 32-bit 
mainly, the timestamps are all 64 bits)

... it also has all the tsc-stops-in-idle smarts, knows about cpufreq, 
etc. Those things are needed even on UP, to not get really bad 
transients in time.

That still leaves us with sched_clock() complexity, which has spread out 
a bit more than it should have. So it's not all as simple as you'd like 
it to be i think, but we are trying hard ...

Ideas to simplify/robustify it are welcome.

	Ingo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 22:06                   ` Linus Torvalds
  2008-10-20 22:17                     ` Ingo Molnar
@ 2008-10-20 22:29                     ` H. Peter Anvin
  2008-10-21 18:10                       ` Bjorn Helgaas
  2008-10-20 23:47                     ` john stultz
  2 siblings, 1 reply; 66+ messages in thread
From: H. Peter Anvin @ 2008-10-20 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: john stultz, Mathieu Desnoyers, Luck, Tony, Steven Rostedt,
	Andrew Morton, Ingo Molnar, linux-kernel, linux-arch,
	Peter Zijlstra, Thomas Gleixner, David Miller, Ingo Molnar

Linus Torvalds wrote:
> 
> But it's not going to solve the "hey, I have 512 CPU's, they are all on 
> different boards, and no, they are _not_ synchronized to one global 
> clock!".
> 
> That's why I'd suggest making _purely_ local time, and then aiming for 
> something NTP-like. But maybe there are better solutions out there.
> 

At the same time, it would definitely be nice to encourage vendors of 
large SMP systems to provide a common root crystal (frequency standard) 
for a single SMP domain.  Preferrably a really good one, TCXO or better.

	-hpa

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 22:06                   ` Linus Torvalds
  2008-10-20 22:17                     ` Ingo Molnar
  2008-10-20 22:29                     ` H. Peter Anvin
@ 2008-10-20 23:47                     ` john stultz
  2 siblings, 0 replies; 66+ messages in thread
From: john stultz @ 2008-10-20 23:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Luck, Tony, Steven Rostedt, Andrew Morton,
	Ingo Molnar, linux-kernel, linux-arch, Peter Zijlstra,
	Thomas Gleixner, David Miller, Ingo Molnar, H. Peter Anvin

On Mon, 2008-10-20 at 15:06 -0700, Linus Torvalds wrote:
> 
> On Mon, 20 Oct 2008, john stultz wrote:
> > 
> > I'm not quite sure I followed your per-cpu xtime thoughts.  Could you
> > explain further your thinking as to why the entire timekeeping
> > subsystem should be per-cpu instead of just keeping that back in the
> > arch-specific clocksource implementation?  In other words, why keep
> > things synced at the nanosecond level instead of keeping the per-cpu
> > TSC synched at the cycle level?
> 
> I don't think you can kep them sync'ed without taking frequency drift into 
> account. When you have multiple boards (ie big boxes), they simply _will_ 
> be in different clock domains. They won't have the exact same frequency.
> 
> So the "rewrite the TSC every once in a while" approach (where "after 
> coming out of idle" is just a special case of "once in a while" due to 
> many CPU's losing TSC in idle) works well in the kind of situation where 
> you really only have a single clock domain, and the TSC's are all 
> basically from the same reference clock. And that's a common case, but it 
> certainly isn't the _only_ case.
> 
> What about fundamnetally different frequencies (old TSC's that change with 
> cpufreq)? Or what about just subtle different ones (new TSC's but on 
> separate sockets that use separate external clocks)?

Ok. Thanks, the clarification about dealing with the multiple frequency
domains helps me understand what you're looking for and why per-cpu time
bases would be needed.

I was assuming that we were just looking at the single frequency domain,
but unsynced TSCs due to idle halting (or maybe just very slight
frequency skew).

<snip>
> Oh, I'm sure you can do hacky things, and work around known issues, and 
> consider the TSC to be globally stable in a lot of common schenarios. 
> That's what you get by re-syncing after idle etc. And it's going to work 
> in a lot of situations.

Yea, and indeed this is path we've been on, because folks have had quite
a bit of difficulty getting the single freq domain solution working. So
small hacks have been added over time, hoping to get there for just one
freq.

> But it's not going to solve the "hey, I have 512 CPU's, they are all on 
> different boards, and no, they are _not_ synchronized to one global 
> clock!".

Yep. And for now we dodge that by pushing to use an stable global
clocksource like HPET for these cases, at the cost of performance.


> That's why I'd suggest making _purely_ local time, and then aiming for 
> something NTP-like. But maybe there are better solutions out there.

The difficulty with NTP-like, is distributed systems tend to expect
slight deltas between machines. Userland gettimeofday() users do not
expect detectable skew between cpus.

Getting that last part right without those "at least don't go backwards"
hacks is hard.

I'll keep thinking about it.

thanks
-john


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 00/15] Tracer Timestamping
  2008-10-20 20:25   ` Mathieu Desnoyers
@ 2008-10-21  0:20     ` Nicolas Pitre
  2008-10-21  1:32       ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Nicolas Pitre @ 2008-10-21  0:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Ingo Molnar, lkml,
	linux-arch, Steven Rostedt, Thomas Gleixner, David Miller

On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:

> * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > Have you looked at the existing 32->63 extention code in
> > include/linux/cnt32_to_63.h and considered unifying it?
> > 
> 
> Yep, I felt this code was dangerous on SMP given it could suffer from
> the following type of race due to lack of proper barriers :

You are wrong.

> CPU    A                                 B
>        read hw cnt low
>        read __m_cnt_hi
>                                          read hw cnt low
>        (wrap detected)
>        write __m_cnt_hi (incremented)
>                                          read __m_cnt_hi
>                                          (wrap detected)
>                                          write __m_cnt_hi (incremented)
> 
> we therefore increment the high bits twice in the given race.

No.  Either you do the _same_ incrementation twice effectively writing 
the _same_ high value twice, or you don't have simultaneous wrap 
detections.  Simulate it on paper with real numbers if you are not 
convinced.

> On UP, the same race could happen if the code is called with preemption
> enabled.

Wrong again.

> I don't think the "volatile" statement would necessarily make sure the
> compiler and CPU would do the __m_cnt_hi read before the hw cnt low
> read. A real memory barrier to order mmio reads wrt memory reads (or
> instruction sync barrier if the value is taken from the cpu registers)
> would be required to insure such order.

That's easy enough to fix, right?

> I also felt it would be more solid to have per-cpu structures to keep
> track of 32->64 bits TSC updates, given the TSCs can always be slightly
> out-of-sync :

If the low part is a per CPU value then the high part has to be a per 
CPU value too.  There only needs to be a per-CPU variant of the same 
algorithm where the only difference is that __m_cnt_hi would be per CPU.

If the low part is global then __m_cnt_hi has to be global, even on SMP.


Nicolas

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 00/15] Tracer Timestamping
  2008-10-21  0:20     ` Nicolas Pitre
@ 2008-10-21  1:32       ` Mathieu Desnoyers
  2008-10-21  2:32         ` Nicolas Pitre
  0 siblings, 1 reply; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-21  1:32 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Ingo Molnar, lkml,
	linux-arch, Steven Rostedt, Thomas Gleixner, David Miller

* Nicolas Pitre (nico@cam.org) wrote:
> On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> 
> > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > Have you looked at the existing 32->63 extention code in
> > > include/linux/cnt32_to_63.h and considered unifying it?
> > > 
> > 
> > Yep, I felt this code was dangerous on SMP given it could suffer from
> > the following type of race due to lack of proper barriers :
> 
> You are wrong.
> 
> > CPU    A                                 B
> >        read hw cnt low
> >        read __m_cnt_hi
> >                                          read hw cnt low
> >        (wrap detected)
> >        write __m_cnt_hi (incremented)
> >                                          read __m_cnt_hi
> >                                          (wrap detected)
> >                                          write __m_cnt_hi (incremented)
> > 
> > we therefore increment the high bits twice in the given race.
> 
> No.  Either you do the _same_ incrementation twice effectively writing 
> the _same_ high value twice, or you don't have simultaneous wrap 
> detections.  Simulate it on paper with real numbers if you are not 
> convinced.
> 

Hi Nicolas,

Yup, you are right. However, the case where one CPU sees the clock source
a little bit off-sync (late) still poses a problem. Example follows :

CPU    A                                 B
       read __m_cnt_hi (0x80000000)
       read hw cnt low (0x00000001)
       (wrap detected :
        (s32)(0x80000000 ^ 0x1) < 0)
       write __m_cnt_hi = 0x00000001
       return 0x0000000100000001
                                            read __m_cnt_hi (0x00000001)
                                     (late) read hw cnt low (0xFFFFFFFA)
                                            (wrap detected :
                                             (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
                                            write __m_cnt_hi = 0x80000001
                                            return 0x80000001FFFFFFFA
                                                                  (time jumps)
       read __m_cnt_hi (0x80000001)
       read hw cnt low (0x00000020)
       (wrap detected :
        (s32)(0x80000001 ^ 0x20) < 0)
       write __m_cnt_hi = 0x00000002
       return 0x0000000200000020
                             (time jumps again)

A similar situation can be generated by out-of-order hi/low bits reads.

> > On UP, the same race could happen if the code is called with preemption
> > enabled.
> 
> Wrong again.
> 
> > I don't think the "volatile" statement would necessarily make sure the
> > compiler and CPU would do the __m_cnt_hi read before the hw cnt low
> > read. A real memory barrier to order mmio reads wrt memory reads (or
> > instruction sync barrier if the value is taken from the cpu registers)
> > would be required to insure such order.
> 
> That's easy enough to fix, right?
> 

For memory read vs cycle counter, something like the "data_barrier(x)"
on powerpc would probably be enough when available. We would need
something that orders memory reads with the rest of instruction
execution. On other architectures, smp_rmb() + get_cycles_barrier() (the
latter being an instruction sync) would be required.

For memory read vs mmap io read, a smp_rmb() should be enough.

> > I also felt it would be more solid to have per-cpu structures to keep
> > track of 32->64 bits TSC updates, given the TSCs can always be slightly
> > out-of-sync :
> 
> If the low part is a per CPU value then the high part has to be a per 
> CPU value too.  There only needs to be a per-CPU variant of the same 
> algorithm where the only difference is that __m_cnt_hi would be per CPU.
> 
> If the low part is global then __m_cnt_hi has to be global, even on SMP.
> 

Hrm, given the performance impact of barriers that would need to be
added to insure correct read order of hi and low bits, and also
considering the risk that a "synchronized" timestamp counter off by a
few cycles poses in terms of time jumping forward, I don't see why we
would ever want to use a global __m_cnt_hi ? I would therefore recommend
to always use a per-cpu __m_cnt_hi.

Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
with a stale old __m_cnt_hi value, its comments should probably say that
it has to be always used with preemption off.

Mathieu

> 
> Nicolas

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 00/15] Tracer Timestamping
  2008-10-21  1:32       ` Mathieu Desnoyers
@ 2008-10-21  2:32         ` Nicolas Pitre
  2008-10-21  4:05           ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Nicolas Pitre @ 2008-10-21  2:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Ingo Molnar, lkml,
	linux-arch, Steven Rostedt, Thomas Gleixner, David Miller

On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:

> * Nicolas Pitre (nico@cam.org) wrote:
> > On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> > 
> > > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > > Have you looked at the existing 32->63 extention code in
> > > > include/linux/cnt32_to_63.h and considered unifying it?
> > > > 
> > > 
> > > Yep, I felt this code was dangerous on SMP given it could suffer from
> > > the following type of race due to lack of proper barriers :
> > 
> > You are wrong.
> > 
> > > CPU    A                                 B
> > >        read hw cnt low
> > >        read __m_cnt_hi
> > >                                          read hw cnt low
> > >        (wrap detected)
> > >        write __m_cnt_hi (incremented)
> > >                                          read __m_cnt_hi
> > >                                          (wrap detected)
> > >                                          write __m_cnt_hi (incremented)
> > > 
> > > we therefore increment the high bits twice in the given race.
> > 
> > No.  Either you do the _same_ incrementation twice effectively writing 
> > the _same_ high value twice, or you don't have simultaneous wrap 
> > detections.  Simulate it on paper with real numbers if you are not 
> > convinced.
> > 
> 
> Hi Nicolas,
> 
> Yup, you are right. However, the case where one CPU sees the clock source
> a little bit off-sync (late) still poses a problem. Example follows :
> 
> CPU    A                                 B
>        read __m_cnt_hi (0x80000000)
>        read hw cnt low (0x00000001)
>        (wrap detected :
>         (s32)(0x80000000 ^ 0x1) < 0)
>        write __m_cnt_hi = 0x00000001
>        return 0x0000000100000001
>                                             read __m_cnt_hi (0x00000001)
>                                      (late) read hw cnt low (0xFFFFFFFA)
>                                             (wrap detected :
>                                              (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
>                                             write __m_cnt_hi = 0x80000001
>                                             return 0x80000001FFFFFFFA
>                                                                   (time jumps)

If this is through a global counter then I have a big problem believing 
you.

If this is a per-CPU counter then you just need a per-CPU version of 
__m_cnt_hi.

> A similar situation can be generated by out-of-order hi/low bits reads.

This, of course, should and can be prevented.  No big deal.

> > If the low part is a per CPU value then the high part has to be a per 
> > CPU value too.  There only needs to be a per-CPU variant of the same 
> > algorithm where the only difference is that __m_cnt_hi would be per CPU.
> > 
> > If the low part is global then __m_cnt_hi has to be global, even on SMP.
> > 
> 
> Hrm, given the performance impact of barriers that would need to be
> added to insure correct read order of hi and low bits, and also
> considering the risk that a "synchronized" timestamp counter off by a
> few cycles poses in terms of time jumping forward, I don't see why we
> would ever want to use a global __m_cnt_hi ?

I still would like to know just _how_ you could manage to have one CPU 
perform two reads and one write and manage for another CPU to see that 
just written value but read an even older value from the global counter 
than what the first CPU saw.

> I would therefore recommend to always use a per-cpu __m_cnt_hi.

No, not until you convince me of the above nonsense.  Sure it will 
work with a per-CPU __m_cnt_hi of course, but then the timing constraint 
to keep the algorithm warm is for each CPU which, if the source counter 
is global, shouldn't be necessary.

If the 63-bit counter is queried often enough on each CPU then it is of 
course advantageous to go per-CPU to avoid excessive cache bouncing, and 
if you start worrying about cache bouncing then you certainly call this 
code often enough to keep everything in synch.  But that needs not to be 
always the case.

> Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
> with a stale old __m_cnt_hi value, its comments should probably say that
> it has to be always used with preemption off.

NO! THAT's THE WHOLE POINT OF THIS CODE: TO NOT NEED ANY KIND OF 
LOCKING!  The only requirement is that you must not be preempted away 
for longer than half (or a quarter if you want to play real safe) the 
period of the base counter.  If you cannot guarantee that then either 
your base counter is so fast that it wraps so often to be unusable, or 
you have another and bigger problem for being preempted for so long.


Nicolas

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 00/15] Tracer Timestamping
  2008-10-21  2:32         ` Nicolas Pitre
@ 2008-10-21  4:05           ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-21  4:05 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Ingo Molnar, lkml,
	linux-arch, Steven Rostedt, Thomas Gleixner, David Miller

* Nicolas Pitre (nico@cam.org) wrote:
> On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> 
> > * Nicolas Pitre (nico@cam.org) wrote:
> > > On Mon, 20 Oct 2008, Mathieu Desnoyers wrote:
> > > 
> > > > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > > > Have you looked at the existing 32->63 extention code in
> > > > > include/linux/cnt32_to_63.h and considered unifying it?
> > > > > 
> > > > 
> > > > Yep, I felt this code was dangerous on SMP given it could suffer from
> > > > the following type of race due to lack of proper barriers :
> > > 
> > > You are wrong.
> > > 
> > > > CPU    A                                 B
> > > >        read hw cnt low
> > > >        read __m_cnt_hi
> > > >                                          read hw cnt low
> > > >        (wrap detected)
> > > >        write __m_cnt_hi (incremented)
> > > >                                          read __m_cnt_hi
> > > >                                          (wrap detected)
> > > >                                          write __m_cnt_hi (incremented)
> > > > 
> > > > we therefore increment the high bits twice in the given race.
> > > 
> > > No.  Either you do the _same_ incrementation twice effectively writing 
> > > the _same_ high value twice, or you don't have simultaneous wrap 
> > > detections.  Simulate it on paper with real numbers if you are not 
> > > convinced.
> > > 
> > 
> > Hi Nicolas,
> > 
> > Yup, you are right. However, the case where one CPU sees the clock source
> > a little bit off-sync (late) still poses a problem. Example follows :
> > 
> > CPU    A                                 B
> >        read __m_cnt_hi (0x80000000)
> >        read hw cnt low (0x00000001)
> >        (wrap detected :
> >         (s32)(0x80000000 ^ 0x1) < 0)
> >        write __m_cnt_hi = 0x00000001
> >        return 0x0000000100000001
> >                                             read __m_cnt_hi (0x00000001)
> >                                      (late) read hw cnt low (0xFFFFFFFA)
> >                                             (wrap detected :
> >                                              (s32)(0x00000001 ^ 0xFFFFFFFA) < 0)
> >                                             write __m_cnt_hi = 0x80000001
> >                                             return 0x80000001FFFFFFFA
> >                                                                   (time jumps)
> 
> If this is through a global counter then I have a big problem believing 
> you.
> 

If the global counter read through mmap io insures that the values
monotonically increases and if the proper memory barriers are there, it
should be ok and two consecutive reads done on different CPUs should
never go backward.

> If this is a per-CPU counter then you just need a per-CPU version of 
> __m_cnt_hi.
> 

Counters like MIPS 32-bits timestamp counter is supposedly synchronized
across cores, and could thus be thought to be global, but could suffer
from such small drift due to inexact counter synchronization.

> > A similar situation can be generated by out-of-order hi/low bits reads.
> 
> This, of course, should and can be prevented.  No big deal.
> 
> > > If the low part is a per CPU value then the high part has to be a per 
> > > CPU value too.  There only needs to be a per-CPU variant of the same 
> > > algorithm where the only difference is that __m_cnt_hi would be per CPU.
> > > 
> > > If the low part is global then __m_cnt_hi has to be global, even on SMP.
> > > 
> > 
> > Hrm, given the performance impact of barriers that would need to be
> > added to insure correct read order of hi and low bits, and also
> > considering the risk that a "synchronized" timestamp counter off by a
> > few cycles poses in terms of time jumping forward, I don't see why we
> > would ever want to use a global __m_cnt_hi ?
> 
> I still would like to know just _how_ you could manage to have one CPU 
> perform two reads and one write and manage for another CPU to see that 
> just written value but read an even older value from the global counter 
> than what the first CPU saw.
> 

As I said above, it would be ok with a global mmio counter, given the
proper barriers are used.

> > I would therefore recommend to always use a per-cpu __m_cnt_hi.
> 
> No, not until you convince me of the above nonsense.  Sure it will 
> work with a per-CPU __m_cnt_hi of course, but then the timing constraint 
> to keep the algorithm warm is for each CPU which, if the source counter 
> is global, shouldn't be necessary.
> 

True, with a mmio time source, only a single counter seems necessary.
Only when there are slightly off time sources (such as "almost"
synchronized TSC") would the per-cpu counter be needed.

> If the 63-bit counter is queried often enough on each CPU then it is of 
> course advantageous to go per-CPU to avoid excessive cache bouncing, and 
> if you start worrying about cache bouncing then you certainly call this 
> code often enough to keep everything in synch.  But that needs not to be 
> always the case.

This code should only cache bounce about once or twice per overflow
period. The rest should only read the __m_cnt_hi value for comparison.
Therefore I don't think cache-line bouncing is such an issue here. OTOH,
added barriers can degrade performance a lot.

> 
> > Also, a small nit, to make sure cnt32_to_63 never gets scheduled out
> > with a stale old __m_cnt_hi value, its comments should probably say that
> > it has to be always used with preemption off.
> 
> NO! THAT's THE WHOLE POINT OF THIS CODE: TO NOT NEED ANY KIND OF 
> LOCKING!  The only requirement is that you must not be preempted away 
> for longer than half (or a quarter if you want to play real safe) the 
> period of the base counter.  If you cannot guarantee that then either 
> your base counter is so fast that it wraps so often to be unusable, or 
> you have another and bigger problem for being preempted for so long.
> 

Or maybe you are just unlucky and a user hit CTRL-Z at the wrong time ?
I could be wrong and maybe there is no such thing possible as to "stop"
a thread while it's in kernel-space, but unless there are restrictions
about where a thread can be stopped, I assume such scenario could cause
a very long preemption if preemption is not disabled.

Even if such thread stop is not possible, assuming the scheduler will
never leave _any_ thread out of the runqueue for a few seconds seems
questionable.

Mathieu

> 
> Nicolas

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 22:29                     ` H. Peter Anvin
@ 2008-10-21 18:10                       ` Bjorn Helgaas
  2008-10-23 15:47                         ` Linus Torvalds
  0 siblings, 1 reply; 66+ messages in thread
From: Bjorn Helgaas @ 2008-10-21 18:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, john stultz, Mathieu Desnoyers, Luck, Tony,
	Steven Rostedt, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar

On Monday 20 October 2008 04:29:07 pm H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > 
> > But it's not going to solve the "hey, I have 512 CPU's, they are all on 
> > different boards, and no, they are _not_ synchronized to one global 
> > clock!".
> > 
> > That's why I'd suggest making _purely_ local time, and then aiming for 
> > something NTP-like. But maybe there are better solutions out there.
> 
> At the same time, it would definitely be nice to encourage vendors of 
> large SMP systems to provide a common root crystal (frequency standard) 
> for a single SMP domain.  Preferrably a really good one, TCXO or better.

A single root crystal is nice for us software guys.  But it often
also turns into a single point of failure, which the hardware guys
are always trying to eliminate.  So I think multiple crystals are
inevitable for the really large machines.

Bjorn

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-18 17:35                       ` Linus Torvalds
  2008-10-18 17:50                         ` Ingo Molnar
@ 2008-10-22 15:53                         ` Mathieu Desnoyers
  1 sibling, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-22 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luck, Tony, Steven Rostedt, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Sat, 18 Oct 2008, Mathieu Desnoyers wrote:
> > 
> > So, the conclusion it brings about scalability of those time sources
> > regarding tracing is :
> > - local TSC read scales very well when the number of CPU increases
> >   (constant 50% overhead)
> 
> You should basically expect it to scale perfectly. Of course the tracing 
> itself adds overhead, and at some point the trace data generation may add 
> so much cache/memory traffic that you start getting worse scaling because 
> of _that_, but just a local TSC access itself will be perfect on any sane 
> setup.
> 

Given the rest of tracing mainly consists in reading cache-hot data to
put it in per-cpu buffers, it actually scales very well.

> > - Comparing the added overhead of both get_cyles+cmpxchg and HPET to
> >   the local sync TSC :
> > 
> >   cores    get_cycles+cmpxchg    HPET
> >       1                  0.8%     10%
> >       2                  8  %     11%
> >       8                 12  %     19%
> > 
> > So, is it me, or HPET scales even more poorly than a cache-line bouncing
> > cmpxchg ? I find it a bit surprising.
> 
> I don't think that's strictly true.
> 
> The cacheline is going to generally be faster than HPET ever will, since 
> caches are really important. But as you can see, the _degradation_ is 
> actually worse for the cacheline, since the cacheline works perfectly in 
> the UP case (not surprising) and starts degrading a lot more when you 
> start getting bouncing.
> 
> And I'm not sure what the behaviour would be for many-core, but I would 
> not be surprised of the cmpxchg actually ends up losing at some point. The 
> HPET is never fast (you can think of it as "uncached access"), and it's 
> going to degrade too (contention at the IO hub level), but it's actually 
> possible that the contention at some point becomes less than wild 
> bouncing.
> 

Too bad I don't have enough cores to generate a meaningful figure, but
I've been surprised to see how bad the HPET evolved between 2 and 8
cores (11% impact -> 19% impact) vs cacheline (8% -> 12%). The real big
step for cacheline bouncing seems to be between 1 and 2 CPUs, but after
that it seems to increase much more slowly than HPET.

> Many cacheline bouncing issues end up being almost exponential. When you 
> *really* get bouncing, things degrade in a major way. I don't think you've 
> seen the worst of it with 8 cores ;)
> 

I wonder how this can end up being exponential considering I could
switch this cache-line bouncing access into an uncached memory access.
This would slow down the few CPU cases, but would likely behave much
like the HPET read. And the performance impact of that would be expected
to increase linearly with the number of cores. I think what makes this
linear with the number of cores is because we are talking about a single
cache-line exchange. I standard programs, we have to bounce many
cache-lines between the CPUs, which therefore follows a time complexity
of O(nr cores * nr shared cache-lines). If the number of shared
cache-lines is big, it may look like an exponential increase.

I also wonder how many 8+ cores with non-sync TSC systems there are out
there, and if tracing is a vital requirement for them, and also I wonder
if it's worth all the effort we are putting into this. I am always
tempted to just detect this kind of behavior, keep a slower-but-solid
solution, and point to some documentation that says what workarounds can
be enabled.

> And that's why I'd really like to see the "only local TSC" access, even if 
> I admit that the code is going to be much more subtle, and I will also 
> admit that especially in the presense of frequency changes *and* hw with 
> unsynchronized TSC's you may be in the situation where you never get 
> exactly what you want.
> 
> But while you may not like some of the "purely local TSC" issues, I would 
> like to point out that
> 
>  - In _practice_, it's going to be essentially perfect on a lot of 
>    machines, and under a lot of loads. 
> 
>    For example, yes it's true that frequency changes will make TSC things 
>    less reliable on a number of machines, but people already end up 
>    disabling dynamic cpufreq when doing various benchmark runs, simply 
>    because people want more consistent numbers for benchmarking across 
>    different kernels etc.
> 
>    So it's entirely possible (and I'd say "likely") that most people are 
>    simply willing to do the same thing for tracing if they are tracing 
>    things at a level where CPU frequency changes might otherwise matter. 
> 
>    So maybe the "local TSC" approach isn't always perfect, but I'd expect 
>    that quite often people who do tracing are willing to work around it. 
>    The people doing tracing are generally not doing so without being aware 
>    of what they are up to..

The way I work around this issue in LTTng is to detect non-synchronized
TSCs, print a warning message telling that the time-base used for
tracing is not perfect, and let the user know about some documentation
which explains how to work-around the problem with specific kernel
command line arguments (e.g. idle=poll and disabling freq scaling).
There are however some people who will not be willing to reboot their
computer or change their setup, and this is the use-case where I think
providing a non-perfect solution (which does not scale as well) makes
sense.


> 
>  - While there is certainly a lot of hardware out there with flaky TSC's, 
>    there's also a lot of hardware (especially upcoming) that do *not* have 
>    flaky TSC's. We've been complaining to Intel about TSC behavior for 
>    years, and the thing is, it actually _is_ improving. It just takes some 
>    time.
> 

Hurray ! :)


>  - So considering that some of the tracing will actually be very important 
>    on machines that have lots of cores, and considering that a lot of the 
>    issues can generally be worked around, I really do think that it's 
>    worth trying to spend a bit of effort on doing the "local TSC + timely 
>    corrections"
> 

As we can see in other threads, it seems to have been tried for a while
by very competent people without brilliant success. Given that the
requirements tracing have for timestamps that follows as closely as
possible the event order across CPUs, which is stronger than the
standard kernel time-sources, I think those things should be tried as a
new kernel time source which has more relax constraints, and maybe
eventually become a trace_clock() source if it is judged stable and
precise enough (e.g. we could trace it with a tracer based on a
cache-line bouncing clock to see if the event order is correct on
various loads).

> For example, you mention that interrupts can be disabled for a time, 
> delaying things like regular sync events with some stable external clock 
> (say the HPET). That's true, and it would even be a problem if you'd use 
> the time of the interrupt itself as the source of the sync, but you don't 
> really need to depend on the timing of the interrupt - just that it 
> happens "reasonably often" (and now we're talking _much_ longer timeframes 
> than some interrupt-disabled time - we're talking tenths of seconds or 
> even more).

Even the "reasonably often" can be a problem. Some examples :

- boot time tracing, where interrupts can be off for a while.
- some races within the kernel which could disable interrupts for a
  long while. Yes, this should _never_ happen, but when it does, a
  tracer becomes very handy.

> 
> Then, rather than depend on the time of the interrupt, you just purely can 
> check the local TSC against the HPET (or other source), and synchronize 
> just _purely_ based on those. That you can do by basically doing something 
> like
> 
> 	do {
> 		start = read_tsc();
> 		hpet = read_hpet();
> 		end = read_tsc();
> 	} while (end - start > ERROR);
> 
> and now, even if you have interrupts enabled (or worry about NMI's), you 
> now know that you have a totally _independent_ sync point, ie you know 
> that your hpet read value is withing ERROR cycles of the start/end values, 
> so now you have a good point for doing future linear interpolation based 
> on those kinds of sync points.
> 

Just hope you don't run this on uncommon (e.g. virtualized hardware) and
always have end - start > ERROR, because you would be creating an
infinite loop. This might cause some problems. But yes, that would help
reading the hpet and tsc together. We could probably use the average for
TSC value there too :
  tsc = start + ((start - end) >> 1);

> And if you make all these linear interpolations be per-CPU (so you have 
> per-CPU offsets and frequencies) you never _ever_ need to touch any shared 
> data at all, and you know you can scale basically perfectly.
> 
> Your linear interpolations may not be _perfect_, but you'll be able to get 
> them pretty damn near. In fact, even if the TSC's aren't synchronized at 
> all, if they are at least _individually_ stable (just running at slightly 
> different frequencies because they are in different clock domains, and/or 
> at different start points), you can basically perfect the precision over 
> time.
> 

Hrm, you seem to assume the CPU frequency is nearly constant, but I
suspect AMD systems with idle cores will actually make the overall
frequency jump drastically between high and low CPU load. Therefore, I
don't even think we can really consider those to be individually stable.

The other problem with interpolation is when a CPU starts to accelerate.
In the sched_clock code, what is currently done is to cap the tsc value
to a max within the time window between two HPET reads so it does not
appear to go backwards when the next HPET read occurs. Such case would
clearly mess the event synchronization in a noticeable way.

Mathieu

> 			Linus

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-18 17:50                         ` Ingo Molnar
@ 2008-10-22 16:19                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-22 16:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Luck, Tony, Steven Rostedt, Andrew Morton,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev,
	Michael Davidson

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > And if you make all these linear interpolations be per-CPU (so you 
> > have per-CPU offsets and frequencies) you never _ever_ need to touch 
> > any shared data at all, and you know you can scale basically 
> > perfectly.
> > 
> > Your linear interpolations may not be _perfect_, but you'll be able to 
> > get them pretty damn near. In fact, even if the TSC's aren't 
> > synchronized at all, if they are at least _individually_ stable (just 
> > running at slightly different frequencies because they are in 
> > different clock domains, and/or at different start points), you can 
> > basically perfect the precision over time.
> 
> there's been code submitted by Michael Davidson recently that looked 
> interesting, which turns the TSC into such an entity:
> 
>     http://lkml.org/lkml/2008/9/25/451
> 
> The periodic synchronization uses the hpet, but it thus allows lockless 
> and globally correct readouts of the TSC .
> 
> And that would match the long term goal as well: the hw should do this 
> all automatically. So perhaps we should have a trace_clock() after all, 
> independent of sched_clock(), and derived straight from RDTSC.
> 
> The approach as propoed has a couple of practical problems, but if we 
> could be one RDTSC+multiplication away from a pretty good timestamp that 
> would be rather useful, very fast and very robust ...
> 
> 	Ingo

Looking at this code, I wonder :

- How it would support virtualization.
- How it would scale to 512 nodes, if we consider that every idle node
  is doing an HPET readl each time it exits from safe_halt() (this can
  end up taking most of the HPET timer bandwidth). So in the case where
  we have 256 idle nodes taking all the HPET timer bandwidth and a 256
  nodes doing useful work, the time these HPET reads can take on the
  useful nodes when they try to resync with the HPET could be long (they
  may need to sample it periodically or at CPU frequency change, or they
  may simply go idle once in a while). We might end up having difficulty
  getting a CPU out of idle due to the time it takes simply to get hold
  of the HPET.

Given the bad scalability numbers I've recently posted for the HPET, I
doubt this a workable solution to the scalability issue.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 18:07                       ` Luck, Tony
@ 2008-10-22 16:51                         ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-22 16:51 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Linus Torvalds, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin, ltt-dev

* Luck, Tony (tony.luck@intel.com) wrote:
> > And what do we say when we detect this ? "sorry, please upgrade your
> > hardware to get a reliable trace" ? ;)
> 
> My employer might be happy with that answer ;-) ... but I think
> we could tell the user to:
> 
>         1) adjust something in /sys/...
>         2) boot with some special option
>         3) rebuild kernel with CONFIG_INSANE_TSC=y
> 
> to switch over to a heavyweight workaround in s/w.  Systems
> that require this are already in the minority ... and I
> think (hope!) that current and future generations of cpus
> won't have these challenges.
> 
> So this is mostly a campaign for the default code path to
> be based on current (sane) TSC behaviour ... with the workarounds
> for past problems kept to one side.
> 

This is exactly what I do in this patchset actually :) The common case,
when a synchronized TSC is detected, is to do a plain TSC read. However,
if a non-synchronized TSC is detected, a warning message is written to
the console (pointing to some documentation to get precise timestamping)
and the heavyweight cmpxchg-based workaround is enabled.


> > Nope, this is not required. I removed the heartbeat event from LTTng two
> > weeks ago, implementing detection of the delta from the last timestamp
> > written into the trace. If we detect that the new timestamp is too far
> > from the previous one, we write the full 64 bits TSC in an extended
> > event header. Therefore, we have no dependency on interrupt latency to
> > get a sane time-base.
> 
> Neat.  Could you grab the HPET value here too?
> 

Yes, I could. When I detect that the TSC value is too far apart from the
previous one, I reserve extra space for the header (this could include
an extra 64-bits for the HPET). At that moment, I could also sample the
HPET, given this happens relatively rarely.

Given the frequency is expected to go at about 1GHz, the 27 bits would
overflow 7-8 times per second. The only thing is that I only need this
extended field when there are absolutely no events in the stream for
an whole overflow period, which is the only case that makes the overflow
impossible to detect without having more TSC bits. In the common case
where there is a steady flow of event, we would never have such "large
TSC header" event.

However, I could do something slightly different from the large TSC
header detection. I could make sure there would be a HPET sampling done
"periodically", or at least periodically when there are events saved to
the buffer by saving, for each buffer, the last TSC value at which the
HPET sampling has been done. When we log following events, we do a HPET
sampling (and write an extended event header) if we are too far apart
from the previous sample.

We would probably need to sample the HPET at subbuffer switch too to
allow fast time-based seek on the trace when we read it.

We could then do a pre-processing on the trace buffers which would
calculate the linear interpolation of cycles counters between the
per-buffer HPET values. The nice thing is that we know the _next_ value
coming _after_ an event (which is not the case for a standard kernel
time-base), so we can be a bit more precise and we do not suffer from
things like "the TSC of a given cpu accelerates and times appears to go
backwards when the next HPET sample is taken".

But I am not sure this would be sufficient to insure generally correct
event order; the maximum interpolation error can become quite large on
systems with different clock speeds in halt states and which would
happen to have a non-steady flow of events.

> 
> > (8 cores up)
> 
> Interesting results.  I'm not at all sure why HPET scales so badly.
> Maybe some h/w throttling/synchronizing going on???
> 

As Linus said, there is probably some contention at the IO hub level.
But it implies that we have to be careful about the frequency at which
we sample the HPET, otherwise it wouldn't scale.

Mathieu

> -Tony
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-20 20:10               ` Linus Torvalds
  2008-10-20 21:38                 ` john stultz
@ 2008-10-22 17:05                 ` Mathieu Desnoyers
  1 sibling, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-22 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luck, Tony, Steven Rostedt, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar, H. Peter Anvin

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Fri, 17 Oct 2008, Mathieu Desnoyers wrote:
> > 
> > Hrm, on such systems
> > - *large* amount of cpus
> > - no synchronized TSCs
> > 
> > What would be the best approach to order events ?
> 
> My strong opinion has been - for a longish while now, and independently of 
> any timestamping code - that we should be seriously looking at basically 
> doing essentially a "ntp" inside the kernel to give up the whole idiotic 
> notion of "synchronized TSCs". Yes, TSC's are often synchronized, but even 
> when they are, we might as well _think_ of them as not being so.
> 
> In other words, instead of expecting internal clocks to be synchronized, 
> just make the clock be a clock network of independent TSC domains. The 
> domains could in theory be per-package (assuming TSC is synchronized at 
> that level), but even if we _could_ do that, we'd probably still be better 
> off by simply always doing it per-core. If only because then the reading 
> would be per-core.
> 
> I think it's a mistake for us to maintain a single clock for 
> gettimeofday() (well, "getnstimeofday" and the whole "clocksource_read()" 
> crud to be technically correct). And sure, I bet clocksource_read() can do 
> various per-CPU things and try to do that, but it's complex and pretty 
> generic code, and as far as I know none of the clocksources have even 
> tried. The TSC clocksource read certainly does not (it just does a very 
> similar horrible "at least don't go backwards" crud that the LTTng patch 
> suggested).
> 
> So I think we should make "xtime" be a per-CPU thing, and add support for 
> per-CPU clocksources. And screw that insane "mark_tsc_unstable()" thing.
> 
> And if we did it well, we migth be able to get good timestamps that way 
> too.
> 
> 		Linus

Yep, it looks like a promising area to look into. I think, however, that
it would be good to first experiment with it as a in-kernel time source
rather than as a tracing time source, so we can use a tracer to make
sure it is stable enough. :-)

Also, we have to wonder if it's worth side-stepping tracing developement
on what I consider being a "special-case for buggy hardware". If we let
development on this specific problem at the kernel level go on its own
and decide to use it for tracing when it's judged good enough, we
(tracing people) can focus on the following steps needed to get a tracer
into Linux, namely buffering, event id management, etc. Given I feel the
need for tracing is relatively urgent for the community, I'd recommend
getting a basic, non-perfect timestamping solution in first, and keep
room for improvement.

I prefer to provide tracing for 98% of the machines out there and point
to some documentation telling how to configure the other 1.95% (and feel
sorry for the people how fall in the inevitable 0.05%) than to spend
years trying to come up with a complex scheme aiming precisely at this
1.95%.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-21 18:10                       ` Bjorn Helgaas
@ 2008-10-23 15:47                         ` Linus Torvalds
  2008-10-23 16:39                           ` H. Peter Anvin
  2008-10-23 21:54                           ` Paul Mackerras
  0 siblings, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2008-10-23 15:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: H. Peter Anvin, john stultz, Mathieu Desnoyers, Luck, Tony,
	Steven Rostedt, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar



On Tue, 21 Oct 2008, Bjorn Helgaas wrote:

> On Monday 20 October 2008 04:29:07 pm H. Peter Anvin wrote:
> > Linus Torvalds wrote:
> > > 
> > > But it's not going to solve the "hey, I have 512 CPU's, they are all on 
> > > different boards, and no, they are _not_ synchronized to one global 
> > > clock!".
> > > 
> > > That's why I'd suggest making _purely_ local time, and then aiming for 
> > > something NTP-like. But maybe there are better solutions out there.
> > 
> > At the same time, it would definitely be nice to encourage vendors of 
> > large SMP systems to provide a common root crystal (frequency standard) 
> > for a single SMP domain.  Preferrably a really good one, TCXO or better.
> 
> A single root crystal is nice for us software guys.  But it often
> also turns into a single point of failure, which the hardware guys
> are always trying to eliminate.  So I think multiple crystals are
> inevitable for the really large machines.

They are almost inevitable for another reason too: the interconnect seldom 
has a concept of "clock signal" other than for the signalling itself, and 
the signal clock is designed for the signal itself and is designed for 
signal integrity rather than "stable clock".

Does _any_ common interconnect have integral support for clock 
distribution?

And no, nobody is going to add another clock network for just clock 
distribution. 

So even ignoring redundancy issues, and the fact that people want to 
hot-plug things (and yes, that would make a central clock interesting), I 
doubt any hw manufacturer really looks at it the way we do.

The best we could hope for is some hardware assist for helping distribute 
a common clock. Ie not a real single crystal, but having time-packets in 
the interconnect that are used to synchronize nodes whenever there is 
communication between them. It's hard to do that in software, because the 
overhead is fairly high, but if hardware does at least some of it you 
could probably get a damn fine distributed clock source.

But I don't know if any hw people are worried enough about it to do it...

		Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-23 15:47                         ` Linus Torvalds
@ 2008-10-23 16:39                           ` H. Peter Anvin
  2008-10-23 21:54                           ` Paul Mackerras
  1 sibling, 0 replies; 66+ messages in thread
From: H. Peter Anvin @ 2008-10-23 16:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, john stultz, Mathieu Desnoyers, Luck, Tony,
	Steven Rostedt, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Peter Zijlstra, Thomas Gleixner, David Miller,
	Ingo Molnar

Linus Torvalds wrote:
> 
> They are almost inevitable for another reason too: the interconnect seldom 
> has a concept of "clock signal" other than for the signalling itself, and 
> the signal clock is designed for the signal itself and is designed for 
> signal integrity rather than "stable clock".
> 
> Does _any_ common interconnect have integral support for clock 
> distribution?
> 

How do you mean "integral"?  All it really would end up being would be a 
separate wire anyway carrying the 14.318 MHz clock, so the only way it 
would ever be "integral" is as part of the slot connector definition.

Note that this is a *frequency standard*.  This is a much simpler task 
than distributing a clock that has to be in phase with a bunch of data 
wires.

> And no, nobody is going to add another clock network for just clock 
> distribution. 

I have defitely seen machines with a common backplane clock source. 
That does not mean that it is common.  I can certainly see the 
redundancy issues being a big deal there.

> So even ignoring redundancy issues, and the fact that people want to 
> hot-plug things (and yes, that would make a central clock interesting), I 
> doubt any hw manufacturer really looks at it the way we do.

Hotplugging isn't so bad (the clock source is tiny, and goes on the 
backplane.)  Redundancy is harder, but not impossible -- even cheap 
TCXOs can usually operate either self-running or as slaves to an 
incoming signal.  The hard part is to avoid partition on failure, since 
in that case you have to handle the unsynchronized case correctly anyway.

> The best we could hope for is some hardware assist for helping distribute 
> a common clock. Ie not a real single crystal, but having time-packets in 
> the interconnect that are used to synchronize nodes whenever there is 
> communication between them. It's hard to do that in software, because the 
> overhead is fairly high, but if hardware does at least some of it you 
> could probably get a damn fine distributed clock source.
> 
> But I don't know if any hw people are worried enough about it to do it...

Most likely not, which means that any solutions we as software guys 
propose are probably pointless.

	-hpa


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 15/15] LTTng timestamp x86
  2008-10-23 15:47                         ` Linus Torvalds
  2008-10-23 16:39                           ` H. Peter Anvin
@ 2008-10-23 21:54                           ` Paul Mackerras
  1 sibling, 0 replies; 66+ messages in thread
From: Paul Mackerras @ 2008-10-23 21:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bjorn Helgaas, H. Peter Anvin, john stultz, Mathieu Desnoyers,
	Luck, Tony, Steven Rostedt, Andrew Morton, Ingo Molnar,
	linux-kernel, linux-arch, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar

Linus Torvalds writes:

> They are almost inevitable for another reason too: the interconnect seldom 
> has a concept of "clock signal" other than for the signalling itself, and 
> the signal clock is designed for the signal itself and is designed for 
> signal integrity rather than "stable clock".
> 
> Does _any_ common interconnect have integral support for clock 
> distribution?

I realize you're asking about x86, but just for interest, this is what
POWER6 does to give us a timebase register that increments at 512MHz
and is synchronized across the machine (i.e. sufficiently well
synchronized that the difference between the timebases on any two
cores is less than the time taken for them to synchronize via a memory
location).

The hardware distributes a 32MHz clock pulse to all nodes, which
increments the upper 60 bits of the timebase and clears the bottom 4
bits.  The bottom 4 bits are then incremented at a rate that is the
processor clock speed divided by some number N set by the hypervisor.
The bottom 4 bits also stop incrementing once they reach 0xf.

This seems to work pretty well in practice and avoids the need for
hardware to distribute a synchronous 512MHz clock everywhere.

Paul.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32
  2008-10-16 23:27 ` [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
@ 2008-10-26 10:18   ` Ralf Baechle
  2008-10-26 20:39     ` Mathieu Desnoyers
  0 siblings, 1 reply; 66+ messages in thread
From: Ralf Baechle @ 2008-10-26 10:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar

On Thu, Oct 16, 2008 at 07:27:34PM -0400, Mathieu Desnoyers wrote:

> partly reverts commit efb9ca08b5a2374b29938cdcab417ce4feb14b54. Selects
> HAVE_GET_CYCLES_32 only on CPUs where it is safe to use it.
> 
> Currently consider the "_WORKAROUND" cases for 4000 and 4400 to be unsafe, but
> should probably add other sub-architecture to the blacklist.
> 
> Do not define HAVE_GET_CYCLES because MIPS does not provide 64-bit tsc (only
> 32-bits).

[...]

> Index: linux-2.6-lttng/include/asm-mips/timex.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/asm-mips/timex.h	2008-10-16 12:25:47.000000000 -0400
> +++ linux-2.6-lttng/include/asm-mips/timex.h	2008-10-16 12:34:18.000000000 -0400
> @@ -29,14 +29,39 @@
>   * which isn't an evil thing.
>   *
>   * We know that all SMP capable CPUs have cycle counters.
> + *
> + * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> + * HAVE_GET_CYCLES makes sure that this case is handled properly :
> + *
> + * Ralf Baechle <ralf@linux-mips.org> :
> + * This avoids us executing an mfc0 c0_count instruction on processors which
> + * don't have but also on certain R4000 and R4400 versions where reading from
> + * the count register just in the very moment when its value equals c0_compare
> + * will result in the timer interrupt getting lost.
>   */

The usual workaround for this processor bug is to check if the value of
the c0_count and c0_compare registers are close.  Clone, not identical to
allow for the time skew in the pipeline.  If they are close, then
execute the timer interrupt handler.  See also the R4000/R4400 errata.

  Ralf

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32
  2008-10-26 10:18   ` Ralf Baechle
@ 2008-10-26 20:39     ` Mathieu Desnoyers
  0 siblings, 0 replies; 66+ messages in thread
From: Mathieu Desnoyers @ 2008-10-26 20:39 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel,
	linux-arch, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	David Miller, Ingo Molnar

* Ralf Baechle (ralf@linux-mips.org) wrote:
> On Thu, Oct 16, 2008 at 07:27:34PM -0400, Mathieu Desnoyers wrote:
> 
> > partly reverts commit efb9ca08b5a2374b29938cdcab417ce4feb14b54. Selects
> > HAVE_GET_CYCLES_32 only on CPUs where it is safe to use it.
> > 
> > Currently consider the "_WORKAROUND" cases for 4000 and 4400 to be unsafe, but
> > should probably add other sub-architecture to the blacklist.
> > 
> > Do not define HAVE_GET_CYCLES because MIPS does not provide 64-bit tsc (only
> > 32-bits).
> 
> [...]
> 
> > Index: linux-2.6-lttng/include/asm-mips/timex.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-mips/timex.h	2008-10-16 12:25:47.000000000 -0400
> > +++ linux-2.6-lttng/include/asm-mips/timex.h	2008-10-16 12:34:18.000000000 -0400
> > @@ -29,14 +29,39 @@
> >   * which isn't an evil thing.
> >   *
> >   * We know that all SMP capable CPUs have cycle counters.
> > + *
> > + * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > + * HAVE_GET_CYCLES makes sure that this case is handled properly :
> > + *
> > + * Ralf Baechle <ralf@linux-mips.org> :
> > + * This avoids us executing an mfc0 c0_count instruction on processors which
> > + * don't have but also on certain R4000 and R4400 versions where reading from
> > + * the count register just in the very moment when its value equals c0_compare
> > + * will result in the timer interrupt getting lost.
> >   */
> 
> The usual workaround for this processor bug is to check if the value of
> the c0_count and c0_compare registers are close.  Clone, not identical to
> allow for the time skew in the pipeline.  If they are close, then
> execute the timer interrupt handler.  See also the R4000/R4400 errata.
> 
>   Ralf

Hrm, this sounds fragile. What happens if the instrumentation is added
in a irq-disabled code path ? We would be executing the timer interrupt
handler at this site synchronously when it should be deferred until the
irq off section is done. That would therefore behave like there is an
interrupt occuring in an irq off section.

OTOH, if we detect that get_cycles() is called with irqs off and decide
not to execute the irq handler, then the handler that would be expected
to be executed when the irq off section ends will simply be discarded
and a timer interrupt will be lost.

Or is there some way to raise the interrupt line from the CPU so it
behaves like a standard asynchronous interrupt ?

The other way around would be to add a check in local_irq_enable() to
verify that such work is pending, but it might have a significant
performance impact.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2008-10-26 20:39 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-16 23:27 [RFC patch 00/15] Tracer Timestamping Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 01/15] get_cycles() : kconfig HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 02/15] get_cycles() : x86 HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-17  2:48   ` [RFC patch 03/15] get_cycles() : sparc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
2008-10-17  2:57     ` David Miller
2008-10-16 23:27 ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES Mathieu Desnoyers
2008-10-17  0:26   ` Paul Mackerras
2008-10-17  0:43     ` [RFC patch 04/15] get_cycles() : powerpc64 HAVE_GET_CYCLES (update) Mathieu Desnoyers
2008-10-17  0:54       ` Paul Mackerras
2008-10-17  1:42       ` David Miller
2008-10-17  2:08         ` Mathieu Desnoyers
2008-10-17  2:33           ` David Miller
2008-10-16 23:27 ` [RFC patch 05/15] get_cycles() : MIPS HAVE_GET_CYCLES_32 Mathieu Desnoyers
2008-10-26 10:18   ` Ralf Baechle
2008-10-26 20:39     ` Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 06/15] LTTng build Mathieu Desnoyers
2008-10-17  8:10   ` KOSAKI Motohiro
2008-10-17 16:18     ` Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 07/15] LTTng timestamp Mathieu Desnoyers
2008-10-17  8:15   ` KOSAKI Motohiro
2008-10-17 16:23     ` Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 08/15] LTTng - Timestamping Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 09/15] LTTng mips export hpt frequency Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 10/15] LTTng timestamp mips Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 11/15] LTTng timestamp powerpc Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 12/15] LTTng timestamp sparc64 Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 13/15] LTTng timestamp sh Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 14/15] LTTng - TSC synchronicity test Mathieu Desnoyers
2008-10-16 23:27 ` [RFC patch 15/15] LTTng timestamp x86 Mathieu Desnoyers
2008-10-17  0:08   ` Linus Torvalds
2008-10-17  0:12     ` Linus Torvalds
2008-10-17  1:28     ` Mathieu Desnoyers
2008-10-17  2:19       ` Luck, Tony
2008-10-17 17:25         ` Steven Rostedt
2008-10-17 18:08           ` Luck, Tony
2008-10-17 18:42             ` Mathieu Desnoyers
2008-10-17 18:58               ` Luck, Tony
2008-10-17 20:23                 ` Mathieu Desnoyers
2008-10-17 23:52                   ` Luck, Tony
2008-10-18 17:01                     ` Mathieu Desnoyers
2008-10-18 17:35                       ` Linus Torvalds
2008-10-18 17:50                         ` Ingo Molnar
2008-10-22 16:19                           ` Mathieu Desnoyers
2008-10-22 15:53                         ` Mathieu Desnoyers
2008-10-20 18:07                       ` Luck, Tony
2008-10-22 16:51                         ` Mathieu Desnoyers
2008-10-17 19:17               ` Steven Rostedt
2008-10-20 20:10               ` Linus Torvalds
2008-10-20 21:38                 ` john stultz
2008-10-20 22:06                   ` Linus Torvalds
2008-10-20 22:17                     ` Ingo Molnar
2008-10-20 22:29                     ` H. Peter Anvin
2008-10-21 18:10                       ` Bjorn Helgaas
2008-10-23 15:47                         ` Linus Torvalds
2008-10-23 16:39                           ` H. Peter Anvin
2008-10-23 21:54                           ` Paul Mackerras
2008-10-20 23:47                     ` john stultz
2008-10-22 17:05                 ` Mathieu Desnoyers
2008-10-17 19:36         ` Christoph Lameter
2008-10-17  7:59 ` [RFC patch 00/15] Tracer Timestamping Peter Zijlstra
2008-10-20 20:25   ` Mathieu Desnoyers
2008-10-21  0:20     ` Nicolas Pitre
2008-10-21  1:32       ` Mathieu Desnoyers
2008-10-21  2:32         ` Nicolas Pitre
2008-10-21  4:05           ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).