All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/9] LTTng core kernel instrumentation
@ 2009-03-24 15:56 Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 1/9] IRQ handle prepare for instrumentation Mathieu Desnoyers
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev

Hi,

This is the core of the Linux kernel LTTng instrumentation. I left out what I
considered that would be eventually covered by a wide syscall instrumentation
and I also picked the most important pieces of instrumentation. The LTTng tree
contains more tracepoints, some of which (e.g. mm instrumentation) will be
proposed later once we agree on the tracepoints proposed in this patchset.

Ingo, the irq and softirq instrumentation patches will probably clash with the
patches from Jason you currently have in your tree. Please see the individual
changelogs for details.

This patchset applies over 2.6.29 in the following order :

irq-handle-prepare-for-instrumentation.patch
lttng-instrumentation-irq.patch
lttng-instrumentation-tasklets.patch
lttng-instrumentation-softirq.patch
lttng-instrumentation-scheduler-fixup-task-migration.patch
lttng-instrumentation-timer.patch
lttng-instrumentation-kernel.patch
lttng-instrumentation-filemap.patch
lttng-instrumentation-swap.patch

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 1/9] IRQ handle prepare for instrumentation
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev; +Cc: Mathieu Desnoyers

[-- Attachment #1: irq-handle-prepare-for-instrumentation.patch --]
[-- Type: text/plain, Size: 2348 bytes --]

Rename handle_IRQ_event to _handle_IRQ_event and declare it static so it can be
embedded in its caller. Create a handle_IRQ_event wrapper (non-static) to still
show the symbol to arch-specific code still using it.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 kernel/irq/handle.c |   25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/kernel/irq/handle.c
===================================================================
--- linux-2.6-lttng.orig/kernel/irq/handle.c	2009-01-30 09:47:59.000000000 -0500
+++ linux-2.6-lttng/kernel/irq/handle.c	2009-01-30 10:38:14.000000000 -0500
@@ -316,14 +316,7 @@ irqreturn_t no_action(int cpl, void *dev
 	return IRQ_NONE;
 }
 
-/**
- * handle_IRQ_event - irq action chain handler
- * @irq:	the interrupt number
- * @action:	the interrupt action chain for this irq
- *
- * Handles the action chain of an irq event
- */
-irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
+static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
 {
 	irqreturn_t ret, retval = IRQ_NONE;
 	unsigned int status = 0;
@@ -346,6 +339,18 @@ irqreturn_t handle_IRQ_event(unsigned in
 	return retval;
 }
 
+/**
+ * handle_IRQ_event - irq action chain handler
+ * @irq:	the interrupt number
+ * @action:	the interrupt action chain for this irq
+ *
+ * Handles the action chain of an irq event
+ */
+irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
+{
+	return _handle_IRQ_event(irq, action);
+}
+
 #ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
 /**
  * __do_IRQ - original all in one highlevel IRQ handler
@@ -378,7 +383,7 @@ unsigned int __do_IRQ(unsigned int irq)
 			desc = irq_remap_to_desc(irq, desc);
 		}
 		if (likely(!(desc->status & IRQ_DISABLED))) {
-			action_ret = handle_IRQ_event(irq, desc->action);
+			action_ret = _handle_IRQ_event(irq, desc->action);
 			if (!noirqdebug)
 				note_interrupt(irq, desc, action_ret);
 		}
@@ -434,7 +439,7 @@ unsigned int __do_IRQ(unsigned int irq)
 
 		spin_unlock(&desc->lock);
 
-		action_ret = handle_IRQ_event(irq, action);
+		action_ret = _handle_IRQ_event(irq, action);
 		if (!noirqdebug)
 			note_interrupt(irq, desc, action_ret);
 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 2/9] LTTng instrumentation - irq
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 1/9] IRQ handle prepare for instrumentation Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 17:33   ` Jason Baron
                     ` (2 more replies)
  2009-03-24 15:56 ` [patch 3/9] LTTng instrumentation tasklets Mathieu Desnoyers
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-irq.patch --]
[-- Type: text/plain, Size: 6582 bytes --]

Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.

It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

irq_entry and irq_exit not declared static because they appear in x86 arch code.

The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.

The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.

I propose we save the "action" in the irq_entry, and use the irq exit "retval"
to know the return value of the last interrupt handler. So in common cases where
only one interrupt handler is connected to an interrupt line, we only have 2
events. Then we also add a irq_next_handler, which saves the previous interrupt
handler return value and the next handler action when there are more than 1
handler called. That would generate the minimum amount of traffic to save all
the information both Jason and I need.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Jason Baron <jbaron@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/irq.h |   22 ++++++++++++++++++++++
 kernel/irq/handle.c |   45 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 8 deletions(-)

Index: linux-2.6-lttng/kernel/irq/handle.c
===================================================================
--- linux-2.6-lttng.orig/kernel/irq/handle.c	2009-03-24 10:10:36.000000000 -0400
+++ linux-2.6-lttng/kernel/irq/handle.c	2009-03-24 11:35:55.000000000 -0400
@@ -17,10 +17,18 @@
 #include <linux/kernel_stat.h>
 #include <linux/rculist.h>
 #include <linux/hash.h>
+#include <trace/irq.h>
 
 #include "internals.h"
 
 /*
+ * Also used in architecture-specific code.
+ */
+DEFINE_TRACE(irq_entry);
+DEFINE_TRACE(irq_exit);
+DEFINE_TRACE(irq_next_handler);
+
+/*
  * lockdep: we want to handle all irq_desc locks as a single lock-class:
  */
 struct lock_class_key irq_desc_lock_class;
@@ -316,6 +324,19 @@ irqreturn_t no_action(int cpl, void *dev
 	return IRQ_NONE;
 }
 
+static irqreturn_t __handle_irq_next_handler(unsigned int irq,
+	struct irqaction **action, irqreturn_t *retval, unsigned int *status)
+{
+	irqreturn_t ret;
+
+	ret = (*action)->handler(irq, (*action)->dev_id);
+	if (ret == IRQ_HANDLED)
+		*status |= (*action)->flags;
+	*retval |= ret;
+	*action = (*action)->next;
+	return ret;
+}
+
 static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
 {
 	irqreturn_t ret, retval = IRQ_NONE;
@@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
 	if (!(action->flags & IRQF_DISABLED))
 		local_irq_enable_in_hardirq();
 
-	do {
-		ret = action->handler(irq, action->dev_id);
-		if (ret == IRQ_HANDLED)
-			status |= action->flags;
-		retval |= ret;
-		action = action->next;
-	} while (action);
+	ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+
+	while (action) {
+		trace_irq_next_handler(irq, action, ret);
+		ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+	}
 
 	if (status & IRQF_SAMPLE_RANDOM)
 		add_interrupt_randomness(irq);
@@ -348,7 +368,12 @@ static irqreturn_t _handle_IRQ_event(uns
  */
 irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
 {
-	return _handle_IRQ_event(irq, action);
+	irqreturn_t ret;
+
+	trace_irq_entry(irq, NULL, action);
+	ret = _handle_IRQ_event(irq, action);
+	trace_irq_exit(ret);
+	return ret;
 }
 
 #ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
@@ -383,7 +408,9 @@ unsigned int __do_IRQ(unsigned int irq)
 			desc = irq_remap_to_desc(irq, desc);
 		}
 		if (likely(!(desc->status & IRQ_DISABLED))) {
+			trace_irq_entry(irq, NULL, desc->action);
 			action_ret = _handle_IRQ_event(irq, desc->action);
+			trace_irq_exit(action_ret);
 			if (!noirqdebug)
 				note_interrupt(irq, desc, action_ret);
 		}
@@ -439,7 +466,9 @@ unsigned int __do_IRQ(unsigned int irq)
 
 		spin_unlock(&desc->lock);
 
+		trace_irq_entry(irq, NULL, action);
 		action_ret = _handle_IRQ_event(irq, action);
+		trace_irq_exit(action_ret);
 		if (!noirqdebug)
 			note_interrupt(irq, desc, action_ret);
 
Index: linux-2.6-lttng/include/trace/irq.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/irq.h	2009-03-24 11:48:58.000000000 -0400
@@ -0,0 +1,22 @@
+#ifndef _TRACE_IRQ_H
+#define _TRACE_IRQ_H
+
+#include <linux/kdebug.h>
+#include <linux/interrupt.h>
+#include <linux/tracepoint.h>
+
+/*
+ * action can be NULL if not available.
+ */
+DECLARE_TRACE(irq_entry,
+	TPPROTO(unsigned int id, struct pt_regs *regs,
+			struct irqaction *action),
+		TPARGS(id, regs, action));
+DECLARE_TRACE(irq_exit,
+	TPPROTO(irqreturn_t retval),
+		TPARGS(retval));
+DECLARE_TRACE(irq_next_handler,
+	TPPROTO(unsigned int id, struct irqaction *action, irqreturn_t ret),
+		TPARGS(id, action, ret));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 3/9] LTTng instrumentation tasklets
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 1/9] IRQ handle prepare for instrumentation Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 17:56   ` Ingo Molnar
  2009-03-24 15:56 ` [patch 4/9] LTTng instrumentation softirq Mathieu Desnoyers
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-tasklets.patch --]
[-- Type: text/plain, Size: 3072 bytes --]

tasklet entry and exit events.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Jason Baron <jbaron@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/irq.h |   13 +++++++++++++
 kernel/softirq.c    |   10 ++++++++++
 2 files changed, 23 insertions(+)

Index: linux-2.6-lttng/kernel/softirq.c
===================================================================
--- linux-2.6-lttng.orig/kernel/softirq.c	2009-03-24 11:46:28.000000000 -0400
+++ linux-2.6-lttng/kernel/softirq.c	2009-03-24 11:48:36.000000000 -0400
@@ -23,6 +23,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/tick.h>
+#include <trace/irq.h>
 
 #include <asm/irq.h>
 /*
@@ -52,6 +53,11 @@ static struct softirq_action softirq_vec
 
 static DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
 
+DEFINE_TRACE(irq_tasklet_high_entry);
+DEFINE_TRACE(irq_tasklet_high_exit);
+DEFINE_TRACE(irq_tasklet_low_entry);
+DEFINE_TRACE(irq_tasklet_low_exit);
+
 /*
  * we cannot loop indefinitely here to avoid userspace starvation,
  * but we also don't want to introduce a worst case 1/HZ latency
@@ -380,7 +386,9 @@ static void tasklet_action(struct softir
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_irq_tasklet_low_entry(t);
 				t->func(t->data);
+				trace_irq_tasklet_low_exit(t);
 				tasklet_unlock(t);
 				continue;
 			}
@@ -415,7 +423,9 @@ static void tasklet_hi_action(struct sof
 			if (!atomic_read(&t->count)) {
 				if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 					BUG();
+				trace_irq_tasklet_high_entry(t);
 				t->func(t->data);
+				trace_irq_tasklet_high_exit(t);
 				tasklet_unlock(t);
 				continue;
 			}
Index: linux-2.6-lttng/include/trace/irq.h
===================================================================
--- linux-2.6-lttng.orig/include/trace/irq.h	2009-03-24 11:46:29.000000000 -0400
+++ linux-2.6-lttng/include/trace/irq.h	2009-03-24 11:48:36.000000000 -0400
@@ -19,4 +19,17 @@ DECLARE_TRACE(irq_next_handler,
 	TPPROTO(unsigned int id, struct irqaction *action, irqreturn_t ret),
 		TPARGS(id, action, ret));
 
+DECLARE_TRACE(irq_tasklet_low_entry,
+	TPPROTO(struct tasklet_struct *t),
+		TPARGS(t));
+DECLARE_TRACE(irq_tasklet_low_exit,
+	TPPROTO(struct tasklet_struct *t),
+		TPARGS(t));
+DECLARE_TRACE(irq_tasklet_high_entry,
+	TPPROTO(struct tasklet_struct *t),
+		TPARGS(t));
+DECLARE_TRACE(irq_tasklet_high_exit,
+	TPPROTO(struct tasklet_struct *t),
+		TPARGS(t));
+
 #endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 4/9] LTTng instrumentation softirq
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2009-03-24 15:56 ` [patch 3/9] LTTng instrumentation tasklets Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 18:01   ` Ingo Molnar
  2009-03-24 15:56 ` [patch 5/9] LTTng instrumentation scheduler fix task migration Mathieu Desnoyers
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-softirq.patch --]
[-- Type: text/plain, Size: 2711 bytes --]

Compared to Jason's patch, this patch also instruments softirq raise event.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Jason Baron <jbaron@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/irq.h |   10 ++++++++++
 kernel/softirq.c    |    7 +++++++
 2 files changed, 17 insertions(+)

Index: linux-2.6-lttng/include/trace/irq.h
===================================================================
--- linux-2.6-lttng.orig/include/trace/irq.h	2009-03-24 11:48:36.000000000 -0400
+++ linux-2.6-lttng/include/trace/irq.h	2009-03-24 11:48:56.000000000 -0400
@@ -32,4 +32,14 @@ DECLARE_TRACE(irq_tasklet_high_exit,
 	TPPROTO(struct tasklet_struct *t),
 		TPARGS(t));
 
+DECLARE_TRACE(irq_softirq_entry,
+	TPPROTO(struct softirq_action *h, struct softirq_action *softirq_vec),
+		TPARGS(h, softirq_vec));
+DECLARE_TRACE(irq_softirq_exit,
+	TPPROTO(struct softirq_action *h, struct softirq_action *softirq_vec),
+		TPARGS(h, softirq_vec));
+DECLARE_TRACE(irq_softirq_raise,
+	TPPROTO(unsigned int nr),
+		TPARGS(nr));
+
 #endif
Index: linux-2.6-lttng/kernel/softirq.c
===================================================================
--- linux-2.6-lttng.orig/kernel/softirq.c	2009-03-24 11:48:36.000000000 -0400
+++ linux-2.6-lttng/kernel/softirq.c	2009-03-24 11:48:56.000000000 -0400
@@ -58,6 +58,10 @@ DEFINE_TRACE(irq_tasklet_high_exit);
 DEFINE_TRACE(irq_tasklet_low_entry);
 DEFINE_TRACE(irq_tasklet_low_exit);
 
+DEFINE_TRACE(irq_softirq_entry);
+DEFINE_TRACE(irq_softirq_exit);
+DEFINE_TRACE(irq_softirq_raise);
+
 /*
  * we cannot loop indefinitely here to avoid userspace starvation,
  * but we also don't want to introduce a worst case 1/HZ latency
@@ -201,7 +205,9 @@ restart:
 		if (pending & 1) {
 			int prev_count = preempt_count();
 
+			trace_irq_softirq_entry(h, softirq_vec);
 			h->action(h);
+			trace_irq_softirq_exit(h, softirq_vec);
 
 			if (unlikely(prev_count != preempt_count())) {
 				printk(KERN_ERR "huh, entered softirq %td %p"
@@ -300,6 +306,7 @@ void irq_exit(void)
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
+	trace_irq_softirq_raise(nr);
 	__raise_softirq_irqoff(nr);
 
 	/*

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 5/9] LTTng instrumentation scheduler fix task migration
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2009-03-24 15:56 ` [patch 4/9] LTTng instrumentation softirq Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 17:53   ` Ingo Molnar
  2009-03-24 15:56 ` [patch 6/9] LTTng instrumentation - timer Mathieu Desnoyers
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Jason Baron, Peter Zijlstra

[-- Attachment #1: lttng-instrumentation-scheduler-fixup-task-migration.patch --]
[-- Type: text/plain, Size: 1740 bytes --]

The scheduler migration only needs to pass the task struct pointer and the
destination CPU. The source CPU can be found by using task_cpu(p) in the probe.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Jason Baron <jbaron@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
---
 include/trace/sched.h |    4 ++--
 kernel/sched.c        |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/include/trace/sched.h
===================================================================
--- linux-2.6-lttng.orig/include/trace/sched.h	2009-01-30 10:42:51.000000000 -0500
+++ linux-2.6-lttng/include/trace/sched.h	2009-01-30 10:43:19.000000000 -0500
@@ -30,8 +30,8 @@ DECLARE_TRACE(sched_switch,
 		TPARGS(rq, prev, next));
 
 DECLARE_TRACE(sched_migrate_task,
-	TPPROTO(struct task_struct *p, int orig_cpu, int dest_cpu),
-		TPARGS(p, orig_cpu, dest_cpu));
+	TPPROTO(struct task_struct *p, int dest_cpu),
+		TPARGS(p, dest_cpu));
 
 DECLARE_TRACE(sched_process_free,
 	TPPROTO(struct task_struct *p),
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2009-01-30 10:43:07.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2009-01-30 10:43:38.000000000 -0500
@@ -1876,7 +1876,7 @@ void set_task_cpu(struct task_struct *p,
 
 	clock_offset = old_rq->clock - new_rq->clock;
 
-	trace_sched_migrate_task(p, task_cpu(p), new_cpu);
+	trace_sched_migrate_task(p, new_cpu);
 
 #ifdef CONFIG_SCHEDSTATS
 	if (p->se.wait_start)

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 6/9] LTTng instrumentation - timer
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2009-03-24 15:56 ` [patch 5/9] LTTng instrumentation scheduler fix task migration Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 18:21   ` Ingo Molnar
  2009-03-24 15:56 ` [patch 7/9] LTTng instrumentation - kernel Mathieu Desnoyers
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, Frederic Weisbecker, David S. Miller,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-timer.patch --]
[-- Type: text/plain, Size: 5020 bytes --]

Instrument timer activity (timer set, expired, current time updates) to keep
information about the "real time" flow within the kernel. It can be used by a
trace analysis tool to synchronize information coming from various sources, e.g.
to merge traces with system logs.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Note : do_setitimer instrumentation could potentially be done with a more
generic system call instrumentation.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/timer.h |   24 ++++++++++++++++++++++++
 kernel/itimer.c       |    8 ++++++++
 kernel/timer.c        |   12 +++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/itimer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/itimer.c	2009-03-24 09:30:50.000000000 -0400
+++ linux-2.6-lttng/kernel/itimer.c	2009-03-24 09:31:39.000000000 -0400
@@ -12,9 +12,13 @@
 #include <linux/time.h>
 #include <linux/posix-timers.h>
 #include <linux/hrtimer.h>
+#include <trace/timer.h>
 
 #include <asm/uaccess.h>
 
+DEFINE_TRACE(timer_itimer_expired);
+DEFINE_TRACE(timer_itimer_set);
+
 /**
  * itimer_get_remtime - get remaining time for the timer
  *
@@ -123,6 +127,8 @@ enum hrtimer_restart it_real_fn(struct h
 	struct signal_struct *sig =
 		container_of(timer, struct signal_struct, real_timer);
 
+	trace_timer_itimer_expired(sig);
+
 	kill_pid_info(SIGALRM, SEND_SIG_PRIV, sig->leader_pid);
 
 	return HRTIMER_NORESTART;
@@ -148,6 +154,8 @@ int do_setitimer(int which, struct itime
 	    !timeval_valid(&value->it_interval))
 		return -EINVAL;
 
+	trace_timer_itimer_set(which, value);
+
 	switch (which) {
 	case ITIMER_REAL:
 again:
Index: linux-2.6-lttng/kernel/timer.c
===================================================================
--- linux-2.6-lttng.orig/kernel/timer.c	2009-03-24 09:30:50.000000000 -0400
+++ linux-2.6-lttng/kernel/timer.c	2009-03-24 09:31:39.000000000 -0400
@@ -37,17 +37,23 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <trace/timer.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/div64.h>
 #include <asm/timex.h>
 #include <asm/io.h>
+#include <asm/irq_regs.h>
 
 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
 
 EXPORT_SYMBOL(jiffies_64);
 
+DEFINE_TRACE(timer_set);
+DEFINE_TRACE(timer_update_time);
+DEFINE_TRACE(timer_timeout);
+
 /*
  * per-CPU timer vector definitions:
  */
@@ -357,6 +363,7 @@ static void internal_add_timer(struct tv
 		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
 		vec = base->tv5.vec + i;
 	}
+	trace_timer_set(timer);
 	/*
 	 * Timers are FIFO:
 	 */
@@ -1121,6 +1128,7 @@ void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_times(ticks);
+	trace_timer_update_time(&xtime, &wall_to_monotonic);
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM
@@ -1202,7 +1210,9 @@ SYSCALL_DEFINE0(getegid)
 
 static void process_timeout(unsigned long __data)
 {
-	wake_up_process((struct task_struct *)__data);
+	struct task_struct *task = (struct task_struct *)__data;
+	trace_timer_timeout(task);
+	wake_up_process(task);
 }
 
 /**
Index: linux-2.6-lttng/include/trace/timer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/timer.h	2009-03-24 09:31:51.000000000 -0400
@@ -0,0 +1,24 @@
+#ifndef _TRACE_TIMER_H
+#define _TRACE_TIMER_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(timer_itimer_expired,
+	TPPROTO(struct signal_struct *sig),
+		TPARGS(sig));
+DECLARE_TRACE(timer_itimer_set,
+	TPPROTO(int which, struct itimerval *value),
+		TPARGS(which, value));
+DECLARE_TRACE(timer_set,
+	TPPROTO(struct timer_list *timer),
+		TPARGS(timer));
+/*
+ * xtime_lock is taken when kernel_timer_update_time tracepoint is reached.
+ */
+DECLARE_TRACE(timer_update_time,
+	TPPROTO(struct timespec *_xtime, struct timespec *_wall_to_monotonic),
+		TPARGS(_xtime, _wall_to_monotonic));
+DECLARE_TRACE(timer_timeout,
+	TPPROTO(struct task_struct *p),
+		TPARGS(p));
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 7/9] LTTng instrumentation - kernel
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2009-03-24 15:56 ` [patch 6/9] LTTng instrumentation - timer Mathieu Desnoyers
@ 2009-03-24 15:56 ` Mathieu Desnoyers
  2009-03-24 18:33   ` Ingo Molnar
  2009-03-24 15:56   ` Mathieu Desnoyers
  2009-03-24 15:56   ` Mathieu Desnoyers
  8 siblings, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, Frederic Weisbecker, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-kernel.patch --]
[-- Type: text/plain, Size: 4832 bytes --]

Instrument the core kernel : module load/free and printk events. It helps the
tracer to keep track of module related events and to export valuable printk
information into the traces.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/kernel.h |   19 +++++++++++++++++++
 kernel/module.c        |    8 ++++++++
 kernel/printk.c        |    7 +++++++
 3 files changed, 34 insertions(+)

Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c	2009-03-24 09:31:53.000000000 -0400
@@ -32,6 +32,7 @@
 #include <linux/security.h>
 #include <linux/bootmem.h>
 #include <linux/syscalls.h>
+#include <trace/kernel.h>
 
 #include <asm/uaccess.h>
 
@@ -59,6 +60,7 @@ int console_printk[4] = {
 	MINIMUM_CONSOLE_LOGLEVEL,	/* minimum_console_loglevel */
 	DEFAULT_CONSOLE_LOGLEVEL,	/* default_console_loglevel */
 };
+EXPORT_SYMBOL_GPL(console_printk);
 
 /*
  * Low level drivers may need that to know if they can schedule in
@@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
 /* Flag: console code may call schedule() */
 static int console_may_schedule;
 
+DEFINE_TRACE(kernel_printk);
+DEFINE_TRACE(kernel_vprintk);
+
 #ifdef CONFIG_PRINTK
 
 static char __log_buf[__LOG_BUF_LEN];
@@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
 	int r;
 
 	va_start(args, fmt);
+	trace_kernel_printk(_RET_IP_);
 	r = vprintk(fmt, args);
 	va_end(args);
 
@@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt, 
 	printed_len += vscnprintf(printk_buf + printed_len,
 				  sizeof(printk_buf) - printed_len, fmt, args);
 
+	trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);
 
 	/*
 	 * Copy the output into log_buf.  If the caller didn't provide
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2009-03-24 09:09:59.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2009-03-24 09:31:53.000000000 -0400
@@ -51,6 +51,7 @@
 #include <linux/tracepoint.h>
 #include <linux/ftrace.h>
 #include <linux/async.h>
+#include <trace/kernel.h>
 
 #if 0
 #define DEBUGP printk
@@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
 /* Bounds of module allocation, for speeding __module_text_address */
 static unsigned long module_addr_min = -1UL, module_addr_max = 0;
 
+DEFINE_TRACE(kernel_module_load);
+DEFINE_TRACE(kernel_module_free);
+
 int register_module_notifier(struct notifier_block * nb)
 {
 	return blocking_notifier_chain_register(&module_notify_list, nb);
@@ -1445,6 +1449,8 @@ static int __unlink_module(void *_mod)
 /* Free a module, remove from lists, etc (must hold module_mutex). */
 static void free_module(struct module *mod)
 {
+	trace_kernel_module_free(mod);
+
 	/* Delete from various lists */
 	stop_machine(__unlink_module, mod, NULL);
 	remove_notes_attrs(mod);
@@ -2276,6 +2282,8 @@ static noinline struct module *load_modu
 	vfree(hdr);
 
 	stop_machine_destroy();
+
+	trace_kernel_module_load(mod);
 	/* Done! */
 	return mod;
 
Index: linux-2.6-lttng/include/trace/kernel.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/kernel.h	2009-03-24 09:32:03.000000000 -0400
@@ -0,0 +1,19 @@
+#ifndef _TRACE_KERNEL_H
+#define _TRACE_KERNEL_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(kernel_printk,
+	TPPROTO(unsigned long retaddr),
+		TPARGS(retaddr));
+DECLARE_TRACE(kernel_vprintk,
+	TPPROTO(unsigned long retaddr, char *buf, int len),
+		TPARGS(retaddr, buf, len));
+DECLARE_TRACE(kernel_module_free,
+	TPPROTO(struct module *mod),
+		TPARGS(mod));
+DECLARE_TRACE(kernel_module_load,
+	TPPROTO(struct module *mod),
+		TPARGS(mod));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 8/9] LTTng instrumentation - filemap
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
@ 2009-03-24 15:56   ` Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, linux-mm, Dave Hansen, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Frederic Weisbecker,
	Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-filemap.patch --]
[-- Type: text/plain, Size: 2817 bytes --]

Instrumentation of waits caused by memory accesses on mmap regions.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/filemap.h |   13 +++++++++++++
 mm/filemap.c            |    5 +++++
 2 files changed, 18 insertions(+)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/mm/filemap.c	2009-03-24 09:32:05.000000000 -0400
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <trace/filemap.h>
 #include "internal.h"
 
 /*
@@ -43,6 +44,8 @@
 
 #include <asm/mman.h>
 
+DEFINE_TRACE(wait_on_page_start);
+DEFINE_TRACE(wait_on_page_end);
 
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
@@ -558,9 +561,11 @@ void wait_on_page_bit(struct page *page,
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
+	trace_wait_on_page_start(page, bit_nr);
 	if (test_bit(bit_nr, &page->flags))
 		__wait_on_bit(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
+	trace_wait_on_page_end(page, bit_nr);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
Index: linux-2.6-lttng/include/trace/filemap.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/filemap.h	2009-03-24 09:32:13.000000000 -0400
@@ -0,0 +1,13 @@
+#ifndef _TRACE_FILEMAP_H
+#define _TRACE_FILEMAP_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(wait_on_page_start,
+	TPPROTO(struct page *page, int bit_nr),
+		TPARGS(page, bit_nr));
+DECLARE_TRACE(wait_on_page_end,
+	TPPROTO(struct page *page, int bit_nr),
+		TPARGS(page, bit_nr));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 8/9] LTTng instrumentation - filemap
@ 2009-03-24 15:56   ` Mathieu Desnoyers
  0 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, linux-mm, Dave Hansen, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Frederic Weisbecker,
	Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-filemap.patch --]
[-- Type: text/plain, Size: 3043 bytes --]

Instrumentation of waits caused by memory accesses on mmap regions.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/filemap.h |   13 +++++++++++++
 mm/filemap.c            |    5 +++++
 2 files changed, 18 insertions(+)

Index: linux-2.6-lttng/mm/filemap.c
===================================================================
--- linux-2.6-lttng.orig/mm/filemap.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/mm/filemap.c	2009-03-24 09:32:05.000000000 -0400
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <trace/filemap.h>
 #include "internal.h"
 
 /*
@@ -43,6 +44,8 @@
 
 #include <asm/mman.h>
 
+DEFINE_TRACE(wait_on_page_start);
+DEFINE_TRACE(wait_on_page_end);
 
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
@@ -558,9 +561,11 @@ void wait_on_page_bit(struct page *page,
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
+	trace_wait_on_page_start(page, bit_nr);
 	if (test_bit(bit_nr, &page->flags))
 		__wait_on_bit(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
+	trace_wait_on_page_end(page, bit_nr);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
Index: linux-2.6-lttng/include/trace/filemap.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/filemap.h	2009-03-24 09:32:13.000000000 -0400
@@ -0,0 +1,13 @@
+#ifndef _TRACE_FILEMAP_H
+#define _TRACE_FILEMAP_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(wait_on_page_start,
+	TPPROTO(struct page *page, int bit_nr),
+		TPARGS(page, bit_nr));
+DECLARE_TRACE(wait_on_page_end,
+	TPPROTO(struct page *page, int bit_nr),
+		TPARGS(page, bit_nr));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 9/9] LTTng instrumentation - swap
  2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
@ 2009-03-24 15:56   ` Mathieu Desnoyers
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, linux-mm, Dave Hansen, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Frederic Weisbecker,
	Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-swap.patch --]
[-- Type: text/plain, Size: 4865 bytes --]

Instrumentation of waits caused by swap activity. Also instrumentation
swapon/swapoff events to keep track of active swap partitions.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/swap.h |   20 ++++++++++++++++++++
 mm/memory.c          |    4 ++++
 mm/page_io.c         |    4 ++++
 mm/swapfile.c        |    6 ++++++
 4 files changed, 34 insertions(+)

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2009-03-24 09:09:55.000000000 -0400
+++ linux-2.6-lttng/mm/memory.c	2009-03-24 09:32:15.000000000 -0400
@@ -55,6 +55,7 @@
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <trace/swap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -64,6 +65,8 @@
 
 #include "internal.h"
 
+DEFINE_TRACE(swap_in);
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
@@ -2431,6 +2434,7 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+		trace_swap_in(page, entry);
 	}
 
 	mark_page_accessed(page);
Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/mm/page_io.c	2009-03-24 09:32:15.000000000 -0400
@@ -17,8 +17,11 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <trace/swap.h>
 #include <asm/pgtable.h>
 
+DEFINE_TRACE(swap_out);
+
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
 				struct page *page, bio_end_io_t end_io)
 {
@@ -114,6 +117,7 @@ int swap_writepage(struct page *page, st
 		rw |= (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG);
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
+	trace_swap_out(page);
 	unlock_page(page);
 	submit_bio(rw, bio);
 out:
Index: linux-2.6-lttng/mm/swapfile.c
===================================================================
--- linux-2.6-lttng.orig/mm/swapfile.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/mm/swapfile.c	2009-03-24 09:32:15.000000000 -0400
@@ -29,12 +29,16 @@
 #include <linux/capability.h>
 #include <linux/syscalls.h>
 #include <linux/memcontrol.h>
+#include <trace/swap.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/page_cgroup.h>
 
+DEFINE_TRACE(swap_file_open);
+DEFINE_TRACE(swap_file_close);
+
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 long nr_swap_pages;
@@ -1497,6 +1501,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	p->flags = 0;
+	trace_swap_file_close(swap_file);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
@@ -1886,6 +1891,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 	} else {
 		swap_info[prev].next = p - swap_info;
 	}
+	trace_swap_file_open(swap_file, name);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	error = 0;
Index: linux-2.6-lttng/include/trace/swap.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/swap.h	2009-03-24 09:32:26.000000000 -0400
@@ -0,0 +1,20 @@
+#ifndef _TRACE_SWAP_H
+#define _TRACE_SWAP_H
+
+#include <linux/swap.h>
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(swap_in,
+	TPPROTO(struct page *page, swp_entry_t entry),
+		TPARGS(page, entry));
+DECLARE_TRACE(swap_out,
+	TPPROTO(struct page *page),
+		TPARGS(page));
+DECLARE_TRACE(swap_file_open,
+	TPPROTO(struct file *file, char *filename),
+		TPARGS(file, filename));
+DECLARE_TRACE(swap_file_close,
+	TPPROTO(struct file *file),
+		TPARGS(file));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [patch 9/9] LTTng instrumentation - swap
@ 2009-03-24 15:56   ` Mathieu Desnoyers
  0 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 15:56 UTC (permalink / raw)
  To: akpm, Ingo Molnar, linux-kernel, ltt-dev
  Cc: Mathieu Desnoyers, linux-mm, Dave Hansen, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Frederic Weisbecker,
	Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

[-- Attachment #1: lttng-instrumentation-swap.patch --]
[-- Type: text/plain, Size: 5091 bytes --]

Instrumentation of waits caused by swap activity. Also instrumentation
swapon/swapoff events to keep track of active swap partitions.

Those tracepoints are used by LTTng.

About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: linux-mm@kvack.org
CC: Dave Hansen <haveblue@us.ibm.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: 'Peter Zijlstra' <peterz@infradead.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: 'Ingo Molnar' <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: 'Hideo AOKI' <haoki@redhat.com>
CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt@goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
---
 include/trace/swap.h |   20 ++++++++++++++++++++
 mm/memory.c          |    4 ++++
 mm/page_io.c         |    4 ++++
 mm/swapfile.c        |    6 ++++++
 4 files changed, 34 insertions(+)

Index: linux-2.6-lttng/mm/memory.c
===================================================================
--- linux-2.6-lttng.orig/mm/memory.c	2009-03-24 09:09:55.000000000 -0400
+++ linux-2.6-lttng/mm/memory.c	2009-03-24 09:32:15.000000000 -0400
@@ -55,6 +55,7 @@
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <trace/swap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -64,6 +65,8 @@
 
 #include "internal.h"
 
+DEFINE_TRACE(swap_in);
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
@@ -2431,6 +2434,7 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+		trace_swap_in(page, entry);
 	}
 
 	mark_page_accessed(page);
Index: linux-2.6-lttng/mm/page_io.c
===================================================================
--- linux-2.6-lttng.orig/mm/page_io.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/mm/page_io.c	2009-03-24 09:32:15.000000000 -0400
@@ -17,8 +17,11 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <trace/swap.h>
 #include <asm/pgtable.h>
 
+DEFINE_TRACE(swap_out);
+
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
 				struct page *page, bio_end_io_t end_io)
 {
@@ -114,6 +117,7 @@ int swap_writepage(struct page *page, st
 		rw |= (1 << BIO_RW_SYNCIO) | (1 << BIO_RW_UNPLUG);
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
+	trace_swap_out(page);
 	unlock_page(page);
 	submit_bio(rw, bio);
 out:
Index: linux-2.6-lttng/mm/swapfile.c
===================================================================
--- linux-2.6-lttng.orig/mm/swapfile.c	2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/mm/swapfile.c	2009-03-24 09:32:15.000000000 -0400
@@ -29,12 +29,16 @@
 #include <linux/capability.h>
 #include <linux/syscalls.h>
 #include <linux/memcontrol.h>
+#include <trace/swap.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/page_cgroup.h>
 
+DEFINE_TRACE(swap_file_open);
+DEFINE_TRACE(swap_file_close);
+
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 long nr_swap_pages;
@@ -1497,6 +1501,7 @@ SYSCALL_DEFINE1(swapoff, const char __us
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	p->flags = 0;
+	trace_swap_file_close(swap_file);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	vfree(swap_map);
@@ -1886,6 +1891,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 	} else {
 		swap_info[prev].next = p - swap_info;
 	}
+	trace_swap_file_open(swap_file, name);
 	spin_unlock(&swap_lock);
 	mutex_unlock(&swapon_mutex);
 	error = 0;
Index: linux-2.6-lttng/include/trace/swap.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/swap.h	2009-03-24 09:32:26.000000000 -0400
@@ -0,0 +1,20 @@
+#ifndef _TRACE_SWAP_H
+#define _TRACE_SWAP_H
+
+#include <linux/swap.h>
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(swap_in,
+	TPPROTO(struct page *page, swp_entry_t entry),
+		TPARGS(page, entry));
+DECLARE_TRACE(swap_out,
+	TPPROTO(struct page *page),
+		TPARGS(page));
+DECLARE_TRACE(swap_file_open,
+	TPPROTO(struct file *file, char *filename),
+		TPARGS(file, filename));
+DECLARE_TRACE(swap_file_close,
+	TPPROTO(struct file *file),
+		TPARGS(file));
+
+#endif

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
@ 2009-03-24 17:33   ` Jason Baron
  2009-03-24 17:50     ` Ingo Molnar
  2009-03-24 19:14   ` Ingo Molnar
  2009-03-27 22:12   ` Thomas Gleixner
  2 siblings, 1 reply; 48+ messages in thread
From: Jason Baron @ 2009-03-24 17:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, ltt-dev, Frederic Weisbecker,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

On Tue, Mar 24, 2009 at 11:56:27AM -0400, Mathieu Desnoyers wrote:
> Instrumentation of IRQ related events : irq_entry, irq_exit and
> irq_next_handler.
> 
> It allows tracers to perform latency analysis on those various types of
> interrupts and to detect interrupts with max/min/avg duration. It helps
> detecting driver or hardware problems which cause an ISR to take ages to
> execute. It has been shown to be the case with bogus hardware causing an mmio
> read to take a few milliseconds.
> 
> Those tracepoints are used by LTTng.
> 
> About the performance impact of tracepoints (which is comparable to markers),
> even without immediate values optimizations, tests done by Hideo Aoki on ia64
> show no regression. His test case was using hackbench on a kernel where
> scheduler instrumentation (about 5 events in code scheduler code) was added.
> See the "Tracepoints" patch header for performance result detail.
> 
> irq_entry and irq_exit not declared static because they appear in x86 arch code.
> 
> The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
> exit events is to be able to recreate the kernel execution state at a given
> point in time. Knowing which execution context is responsible for a given trace
> event is _very_ valuable in trace data analysis.
> 
> The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
> instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
> provides information about which handler is being called, but does not map
> correctly to the fact that _multiple_ handlers are being called from within the
> same interrupt handler. From an interrupt latency analysis POV, this is
> incorrect.
> 

Since we are passing back the irq number, and we can not be interrupted
by the same irq, I think it should be pretty clear we are in the same
handler. That said, the extra entry/exit tracepoints could make the
sequence of events simpler to decipher, which is important. The code
looks good, and provides at least as much information as the patch that
I proposed. So i'll be happy either way :)

thanks,

-Jason


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 17:33   ` Jason Baron
@ 2009-03-24 17:50     ` Ingo Molnar
  2009-03-24 17:57       ` Jason Baron
  0 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 17:50 UTC (permalink / raw)
  To: Jason Baron
  Cc: Mathieu Desnoyers, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu


* Jason Baron <jbaron@redhat.com> wrote:

> On Tue, Mar 24, 2009 at 11:56:27AM -0400, Mathieu Desnoyers wrote:
> > Instrumentation of IRQ related events : irq_entry, irq_exit and
> > irq_next_handler.
> > 
> > It allows tracers to perform latency analysis on those various types of
> > interrupts and to detect interrupts with max/min/avg duration. It helps
> > detecting driver or hardware problems which cause an ISR to take ages to
> > execute. It has been shown to be the case with bogus hardware causing an mmio
> > read to take a few milliseconds.
> > 
> > Those tracepoints are used by LTTng.
> > 
> > About the performance impact of tracepoints (which is comparable to markers),
> > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > show no regression. His test case was using hackbench on a kernel where
> > scheduler instrumentation (about 5 events in code scheduler code) was added.
> > See the "Tracepoints" patch header for performance result detail.
> > 
> > irq_entry and irq_exit not declared static because they appear in x86 arch code.
> > 
> > The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
> > exit events is to be able to recreate the kernel execution state at a given
> > point in time. Knowing which execution context is responsible for a given trace
> > event is _very_ valuable in trace data analysis.
> > 
> > The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
> > instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
> > provides information about which handler is being called, but does not map
> > correctly to the fact that _multiple_ handlers are being called from within the
> > same interrupt handler. From an interrupt latency analysis POV, this is
> > incorrect.
> > 
> 
> Since we are passing back the irq number, and we can not be 
> interrupted by the same irq, I think it should be pretty clear we 
> are in the same handler. That said, the extra entry/exit 
> tracepoints could make the sequence of events simpler to decipher, 
> which is important. The code looks good, and provides at least as 
> much information as the patch that I proposed. So i'll be happy 
> either way :)

We already have your patch merged up in the tracing tree and it 
gives entry+exit tracepoints.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 5/9] LTTng instrumentation scheduler fix task migration
  2009-03-24 15:56 ` [patch 5/9] LTTng instrumentation scheduler fix task migration Mathieu Desnoyers
@ 2009-03-24 17:53   ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 17:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> The scheduler migration only needs to pass the task struct pointer 
> and the destination CPU. The source CPU can be found by using 
> task_cpu(p) in the probe.

> +++ linux-2.6-lttng/include/trace/sched.h	2009-01-30 10:43:19.000000000 -0500
> @@ -30,8 +30,8 @@ DECLARE_TRACE(sched_switch,
>  		TPARGS(rq, prev, next));
>  
>  DECLARE_TRACE(sched_migrate_task,
> -	TPPROTO(struct task_struct *p, int orig_cpu, int dest_cpu),
> -		TPARGS(p, orig_cpu, dest_cpu));
> +	TPPROTO(struct task_struct *p, int dest_cpu),
> +		TPARGS(p, dest_cpu));

Makes sense. Since this tracepoint has been changed in the tracing 
tree already, mind doing this change in that context?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 3/9] LTTng instrumentation tasklets
  2009-03-24 15:56 ` [patch 3/9] LTTng instrumentation tasklets Mathieu Desnoyers
@ 2009-03-24 17:56   ` Ingo Molnar
  2009-03-25 13:52     ` Chetan.Loke
  0 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 17:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> tasklet entry and exit events.

> +DEFINE_TRACE(irq_tasklet_high_entry);
> +DEFINE_TRACE(irq_tasklet_high_exit);
> +DEFINE_TRACE(irq_tasklet_low_entry);
> +DEFINE_TRACE(irq_tasklet_low_exit);

Dunno - tasklets are a legacy mechanism, not sure we want to 
instrument them. That some tasklet activity happens will be obvious 
from the softirq tracepoints already.

Anyway ... i guess we could do it if it's done via the more complete 
TRACE_EVENT() mechanism in the tracing tree.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 17:50     ` Ingo Molnar
@ 2009-03-24 17:57       ` Jason Baron
  2009-03-24 19:12         ` Ingo Molnar
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Baron @ 2009-03-24 17:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

On Tue, Mar 24, 2009 at 06:50:49PM +0100, Ingo Molnar wrote:
> * Jason Baron <jbaron@redhat.com> wrote:
> 
> > On Tue, Mar 24, 2009 at 11:56:27AM -0400, Mathieu Desnoyers wrote:
> > > Instrumentation of IRQ related events : irq_entry, irq_exit and
> > > irq_next_handler.
> > > 
> > > It allows tracers to perform latency analysis on those various types of
> > > interrupts and to detect interrupts with max/min/avg duration. It helps
> > > detecting driver or hardware problems which cause an ISR to take ages to
> > > execute. It has been shown to be the case with bogus hardware causing an mmio
> > > read to take a few milliseconds.
> > > 
> > > Those tracepoints are used by LTTng.
> > > 
> > > About the performance impact of tracepoints (which is comparable to markers),
> > > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > > show no regression. His test case was using hackbench on a kernel where
> > > scheduler instrumentation (about 5 events in code scheduler code) was added.
> > > See the "Tracepoints" patch header for performance result detail.
> > > 
> > > irq_entry and irq_exit not declared static because they appear in x86 arch code.
> > > 
> > > The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
> > > exit events is to be able to recreate the kernel execution state at a given
> > > point in time. Knowing which execution context is responsible for a given trace
> > > event is _very_ valuable in trace data analysis.
> > > 
> > > The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
> > > instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
> > > provides information about which handler is being called, but does not map
> > > correctly to the fact that _multiple_ handlers are being called from within the
> > > same interrupt handler. From an interrupt latency analysis POV, this is
> > > incorrect.
> > > 
> > 
> > Since we are passing back the irq number, and we can not be 
> > interrupted by the same irq, I think it should be pretty clear we 
> > are in the same handler. That said, the extra entry/exit 
> > tracepoints could make the sequence of events simpler to decipher, 
> > which is important. The code looks good, and provides at least as 
> > much information as the patch that I proposed. So i'll be happy 
> > either way :)
> 
> We already have your patch merged up in the tracing tree and it 
> gives entry+exit tracepoints.
> 
> 	Ingo

maybe i wasn't clear. Entry and exit as I proposed and as in the tracing
tree are for entry and exit into each handler per irq. Mathieu is
proposing an entry/exit tracepoint per irq, and a 3rd tracepoint to
tell us which handler is being called and its return code. hope this is
clear.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 4/9] LTTng instrumentation softirq
  2009-03-24 15:56 ` [patch 4/9] LTTng instrumentation softirq Mathieu Desnoyers
@ 2009-03-24 18:01   ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Compared to Jason's patch, this patch also instruments softirq raise event.

Mind sending this on top of Jason's patch, which is already applied 
to the tracing tree?

Note, your instrumentation misses one source of softirq raise 
events:

> @@ -300,6 +306,7 @@ void irq_exit(void)
>   */
>  inline void raise_softirq_irqoff(unsigned int nr)
>  {
> +	trace_irq_softirq_raise(nr);
>  	__raise_softirq_irqoff(nr);

__raise_softirq_irqoff() is used in the networking tree. It makes 
little sense to instrument raise_softirq_irqoff() while not 
instrumenting __raise_softirq_irqoff().

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 6/9] LTTng instrumentation - timer
  2009-03-24 15:56 ` [patch 6/9] LTTng instrumentation - timer Mathieu Desnoyers
@ 2009-03-24 18:21   ` Ingo Molnar
  2009-03-24 19:14     ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:21 UTC (permalink / raw)
  To: Mathieu Desnoyers, Thomas Gleixner, Peter Zijlstra
  Cc: akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	David S. Miller, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Instrument timer activity (timer set, expired, current time 
> updates) to keep information about the "real time" flow within the 
> kernel. It can be used by a trace analysis tool to synchronize 
> information coming from various sources, e.g. to merge traces with 
> system logs.
> 
> Those tracepoints are used by LTTng.
> 
> About the performance impact of tracepoints (which is comparable 
> to markers), even without immediate values optimizations, tests 
> done by Hideo Aoki on ia64 show no regression. His test case was 
> using hackbench on a kernel where scheduler instrumentation (about 
> 5 events in code scheduler code) was added. See the "Tracepoints" 
> patch header for performance result detail.
> 
> Note : do_setitimer instrumentation could potentially be done with 
> a more generic system call instrumentation.
>
> CC: 'Steven Rostedt' <rostedt@goodmis.org>
> CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> ---
>  include/trace/timer.h |   24 ++++++++++++++++++++++++
>  kernel/itimer.c       |    8 ++++++++
>  kernel/timer.c        |   12 +++++++++++-
>  3 files changed, 43 insertions(+), 1 deletion(-)

For complete timer instrumentation, hrtimers should be instrumented 
as well.

>  	return HRTIMER_NORESTART;
> @@ -148,6 +154,8 @@ int do_setitimer(int which, struct itime
>  	    !timeval_valid(&value->it_interval))
>  		return -EINVAL;
>  
> +	trace_timer_itimer_set(which, value);

for consistency, i'd name the timer start tracepoints the following 
way:

  trace_timer_init()
  trace_itimer_init()
  trace_hrtimer_init()

  trace_timer_start()
  trace_itimer_start()
  trace_hrtimer_start()

  trace_timer_expire()
  trace_itimer_expire()
  trace_hrtimer_expire()

  trace_timer_cancel()
  trace_itimer_cancel()
  trace_hrtimer_cancel()

the init methods are missing from your patch entirely - and the rest 
is partially incomplete as well.

Instrumentation of the del_timer() variants is missing. For a 
complete lifetime analysis of timers this cannot be left out.

> +DEFINE_TRACE(timer_set);
> +DEFINE_TRACE(timer_timeout);

these two should be 'timer_start' and 'timer_expire'.

> +DEFINE_TRACE(timer_update_time);

This is a misnomer. This is in timer.c but is not a 'timer' 
tracepoint - this is a time of day tracepoint - and this should be 
reflected in its naming - so it should be: trace_gtod_update_xtime() 
or so.

> @@ -357,6 +363,7 @@ static void internal_add_timer(struct tv
>  		i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
>  		vec = base->tv5.vec + i;
>  	}
> +	trace_timer_set(timer);
>  	/*
>  	 * Timers are FIFO:
>  	 */

this should be trace_timer_start().

> @@ -1121,6 +1128,7 @@ void do_timer(unsigned long ticks)
>  {
>  	jiffies_64 += ticks;
>  	update_times(ticks);
> +	trace_timer_update_time(&xtime, &wall_to_monotonic);
>  }

gtod_update_xtime().

The GTOD instrumentation should move into a separate patch, and 
should probably be extended with NTP correction events as well.

>  #ifdef __ARCH_WANT_SYS_ALARM
> @@ -1202,7 +1210,9 @@ SYSCALL_DEFINE0(getegid)
>  
>  static void process_timeout(unsigned long __data)
>  {
> -	wake_up_process((struct task_struct *)__data);
> +	struct task_struct *task = (struct task_struct *)__data;
> +	trace_timer_timeout(task);
> +	wake_up_process(task);

[ nit: misssing new line after local variable declaration. ]

this should be trace_timer_expire().

> Index: linux-2.6-lttng/include/trace/timer.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/include/trace/timer.h	2009-03-24 09:31:51.000000000 -0400
> @@ -0,0 +1,24 @@
> +#ifndef _TRACE_TIMER_H
> +#define _TRACE_TIMER_H
> +
> +#include <linux/tracepoint.h>
> +
> +DECLARE_TRACE(timer_itimer_expired,
> +	TPPROTO(struct signal_struct *sig),
> +		TPARGS(sig));
> +DECLARE_TRACE(timer_itimer_set,
> +	TPPROTO(int which, struct itimerval *value),
> +		TPARGS(which, value));
> +DECLARE_TRACE(timer_set,
> +	TPPROTO(struct timer_list *timer),
> +		TPARGS(timer));
> +/*
> + * xtime_lock is taken when kernel_timer_update_time tracepoint is reached.
> + */
> +DECLARE_TRACE(timer_update_time,
> +	TPPROTO(struct timespec *_xtime, struct timespec *_wall_to_monotonic),
> +		TPARGS(_xtime, _wall_to_monotonic));
> +DECLARE_TRACE(timer_timeout,
> +	TPPROTO(struct task_struct *p),
> +		TPARGS(p));
> +#endif

For completeness and consistency, trace_itimer_start() should be 
split into two cases:

	if (value)
		trace_itimer_start(which, value);
	else
		trace_itimer_cancel(which);

That extra branch does not matter much and the resulting 
instrumentaton is more consistent across all the timer types.

So this patch needs more work, but it's a good first step.

It would also be fantastic if you did them via the TRACE_EVENT() 
mechanism in the tracing tree. It would still be just as useful to 
LTTNG - but it would also be useful to ftrace that way.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 7/9] LTTng instrumentation - kernel
  2009-03-24 15:56 ` [patch 7/9] LTTng instrumentation - kernel Mathieu Desnoyers
@ 2009-03-24 18:33   ` Ingo Molnar
  2009-03-25  1:13     ` Rusty Russell
  2009-03-25 13:06     ` Frederic Weisbecker
  0 siblings, 2 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers, Rusty Russell
  Cc: akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu


(Rusty Cc:-ed - for the module.c tracepoints below)

* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Instrument the core kernel : module load/free and printk events. It helps the
> tracer to keep track of module related events and to export valuable printk
> information into the traces.
> 
> Those tracepoints are used by LTTng.
> 
> About the performance impact of tracepoints (which is comparable to markers),
> even without immediate values optimizations, tests done by Hideo Aoki on ia64
> show no regression. His test case was using hackbench on a kernel where
> scheduler instrumentation (about 5 events in code scheduler code) was added.
> See the "Tracepoints" patch header for performance result detail.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: 'Ingo Molnar' <mingo@elte.hu>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Masami Hiramatsu <mhiramat@redhat.com>
> CC: 'Peter Zijlstra' <peterz@infradead.org>
> CC: "Frank Ch. Eigler" <fche@redhat.com>
> CC: 'Hideo AOKI' <haoki@redhat.com>
> CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
> CC: 'Steven Rostedt' <rostedt@goodmis.org>
> CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> ---
>  include/trace/kernel.h |   19 +++++++++++++++++++
>  kernel/module.c        |    8 ++++++++
>  kernel/printk.c        |    7 +++++++
>  3 files changed, 34 insertions(+)
> 
> Index: linux-2.6-lttng/kernel/printk.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/printk.c	2009-03-24 09:09:52.000000000 -0400
> +++ linux-2.6-lttng/kernel/printk.c	2009-03-24 09:31:53.000000000 -0400
> @@ -32,6 +32,7 @@
>  #include <linux/security.h>
>  #include <linux/bootmem.h>
>  #include <linux/syscalls.h>
> +#include <trace/kernel.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -59,6 +60,7 @@ int console_printk[4] = {
>  	MINIMUM_CONSOLE_LOGLEVEL,	/* minimum_console_loglevel */
>  	DEFAULT_CONSOLE_LOGLEVEL,	/* default_console_loglevel */
>  };
> +EXPORT_SYMBOL_GPL(console_printk);
>  
>  /*
>   * Low level drivers may need that to know if they can schedule in
> @@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
>  /* Flag: console code may call schedule() */
>  static int console_may_schedule;
>  
> +DEFINE_TRACE(kernel_printk);
> +DEFINE_TRACE(kernel_vprintk);
> +
>  #ifdef CONFIG_PRINTK
>  
>  static char __log_buf[__LOG_BUF_LEN];
> @@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
>  	int r;
>  
>  	va_start(args, fmt);
> +	trace_kernel_printk(_RET_IP_);
>  	r = vprintk(fmt, args);
>  	va_end(args);
>
> @@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt, 
>  	printed_len += vscnprintf(printk_buf + printed_len,
>  				  sizeof(printk_buf) - printed_len, fmt, args);
>  
> +	trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);

So here we pass in the formatted output. What sense does it make to 
have the above printk tracepoint? Little i think.

Also, i'm not entirely convinced about the wiseness of instrumenting 
an essential debug facility like printk(). Lets keep this one at the 
tail portion of the patch-queue, ok?

> Index: linux-2.6-lttng/kernel/module.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/module.c	2009-03-24 09:09:59.000000000 -0400
> +++ linux-2.6-lttng/kernel/module.c	2009-03-24 09:31:53.000000000 -0400
> @@ -51,6 +51,7 @@
>  #include <linux/tracepoint.h>
>  #include <linux/ftrace.h>
>  #include <linux/async.h>
> +#include <trace/kernel.h>
>  
>  #if 0
>  #define DEBUGP printk
> @@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
>  /* Bounds of module allocation, for speeding __module_text_address */
>  static unsigned long module_addr_min = -1UL, module_addr_max = 0;
>  
> +DEFINE_TRACE(kernel_module_load);
> +DEFINE_TRACE(kernel_module_free);

I believe that to have a complete picture of module usage, module 
refcount get/put events should be included as well, beyond the basic 
load/free events.

These both have performance impact (a module get/put in a fastpath 
hurts scalability), and are informative in terms of establishing the 
module dependency graph.

Another thing that is iteresting and which is not covered here are 
module request events and usermode helper callouts. These too have 
instrumentation and performance analysis uses.

Plus, here too it would be desired to put in default probes as well, 
via TRACE_EVENT().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 8/9] LTTng instrumentation - filemap
  2009-03-24 15:56   ` Mathieu Desnoyers
@ 2009-03-24 18:39     ` Ingo Molnar
  -1 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, linux-mm, Dave Hansen,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler,
	Frederic Weisbecker, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Index: linux-2.6-lttng/mm/filemap.c

> +DEFINE_TRACE(wait_on_page_start);
> +DEFINE_TRACE(wait_on_page_end);

These are extremely incomplete - to the level of being useless.

To understand the lifetime of the pagecache, the following basic 
events have to be observed and instrumented:

 - create a new page
 - fill in a new page
 - dirty a page [when we know this]
 - request writeout of a page
 - clean a page / complete writeout
 - free a page due to MM pressure
 - free a page due to truncation/delete

The following additional events are useful as well:

 - mmap a page to a user-space address
 - copy a page to a user-space address (read)
 - write to a page from a user-space address (write)
 - unmap a page from a user-space address
 - fault in a user-space mapped pagecache page

optional:
   - shmem attach/detach events
   - shmem map/unmap events
   - hugetlb map/unmap events

I'm sure i havent listed them all. Have a look at the function-graph 
tracer output to see what kind of basic events can happen to a 
pagecache page.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 8/9] LTTng instrumentation - filemap
@ 2009-03-24 18:39     ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, linux-mm, Dave Hansen,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler,
	Frederic Weisbecker, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> Index: linux-2.6-lttng/mm/filemap.c

> +DEFINE_TRACE(wait_on_page_start);
> +DEFINE_TRACE(wait_on_page_end);

These are extremely incomplete - to the level of being useless.

To understand the lifetime of the pagecache, the following basic 
events have to be observed and instrumented:

 - create a new page
 - fill in a new page
 - dirty a page [when we know this]
 - request writeout of a page
 - clean a page / complete writeout
 - free a page due to MM pressure
 - free a page due to truncation/delete

The following additional events are useful as well:

 - mmap a page to a user-space address
 - copy a page to a user-space address (read)
 - write to a page from a user-space address (write)
 - unmap a page from a user-space address
 - fault in a user-space mapped pagecache page

optional:
   - shmem attach/detach events
   - shmem map/unmap events
   - hugetlb map/unmap events

I'm sure i havent listed them all. Have a look at the function-graph 
tracer output to see what kind of basic events can happen to a 
pagecache page.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 9/9] LTTng instrumentation - swap
  2009-03-24 15:56   ` Mathieu Desnoyers
@ 2009-03-24 18:51     ` Ingo Molnar
  -1 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, linux-mm, Dave Hansen,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler,
	Frederic Weisbecker, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> +DECLARE_TRACE(swap_in,
> +	TPPROTO(struct page *page, swp_entry_t entry),
> +		TPARGS(page, entry));
> +DECLARE_TRACE(swap_out,
> +	TPPROTO(struct page *page),
> +		TPARGS(page));
> +DECLARE_TRACE(swap_file_open,
> +	TPPROTO(struct file *file, char *filename),
> +		TPARGS(file, filename));
> +DECLARE_TRACE(swap_file_close,
> +	TPPROTO(struct file *file),
> +		TPARGS(file));

These are more complete than the pagecache tracepoints, but still 
incomplete to make a comprehensive picture about swap activities.

Firstly, the swap_file_open/close events seem quite pointless. Most 
systems enable swap during bootup and never close it. These 
tracepoints just wont be excercised in practice.

Also, to _really_ help with debugging VM pressure problems, the 
whole LRU state-machine should be instrumented, and linked up with 
pagecache instrumentation via page frame numbers and (inode,offset) 
[file] and (pgd,addr) [anon] pairs.

Not just the fact that something got swapped out is interesting, but 
also the whole decision chain that leads up to it. The lifetime of a 
page how it jumps between the various stages of eviction and LRU 
scores.

a minor nit:

> +DECLARE_TRACE(swap_file_open,
> +	TPPROTO(struct file *file, char *filename),
> +		TPARGS(file, filename));

there's no need to pass in the filename - it can be deducted in the 
probe from struct file.

a small inconsistency:

> +DECLARE_TRACE(swap_in,
> +	TPPROTO(struct page *page, swp_entry_t entry),
> +		TPARGS(page, entry));
> +DECLARE_TRACE(swap_out,
> +	TPPROTO(struct page *page),
> +		TPARGS(page));

you pass in swp_entry to trace_swap_in(), which encodes the offset - 
but that parameter is not needed, the page already represents the 
offset at that stage in do_swap_page(). (the actual data is not read 
in yet from swap, but the page is already linked up in the 
swap-cache and has the offset available - which a probe can 
recover.)

So this suffices:

 DECLARE_TRACE(swap_in,
	TPPROTO(struct page *page),
		TPARGS(page));

 DECLARE_TRACE(swap_out,
	TPPROTO(struct page *page),
		TPARGS(page));

And here again i'd like to see actual meaningful probe contents via 
a TRACE_EVENT() construct. That shows and proves that it's all part 
of a comprehensive framework, and the data that is recovered is 
understood and put into a coherent whole - upstream. That makes it 
immediately useful to the built-in tracers, and will also cause 
fewer surprises downstream.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 9/9] LTTng instrumentation - swap
@ 2009-03-24 18:51     ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 18:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, linux-mm, Dave Hansen,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler,
	Frederic Weisbecker, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> +DECLARE_TRACE(swap_in,
> +	TPPROTO(struct page *page, swp_entry_t entry),
> +		TPARGS(page, entry));
> +DECLARE_TRACE(swap_out,
> +	TPPROTO(struct page *page),
> +		TPARGS(page));
> +DECLARE_TRACE(swap_file_open,
> +	TPPROTO(struct file *file, char *filename),
> +		TPARGS(file, filename));
> +DECLARE_TRACE(swap_file_close,
> +	TPPROTO(struct file *file),
> +		TPARGS(file));

These are more complete than the pagecache tracepoints, but still 
incomplete to make a comprehensive picture about swap activities.

Firstly, the swap_file_open/close events seem quite pointless. Most 
systems enable swap during bootup and never close it. These 
tracepoints just wont be excercised in practice.

Also, to _really_ help with debugging VM pressure problems, the 
whole LRU state-machine should be instrumented, and linked up with 
pagecache instrumentation via page frame numbers and (inode,offset) 
[file] and (pgd,addr) [anon] pairs.

Not just the fact that something got swapped out is interesting, but 
also the whole decision chain that leads up to it. The lifetime of a 
page how it jumps between the various stages of eviction and LRU 
scores.

a minor nit:

> +DECLARE_TRACE(swap_file_open,
> +	TPPROTO(struct file *file, char *filename),
> +		TPARGS(file, filename));

there's no need to pass in the filename - it can be deducted in the 
probe from struct file.

a small inconsistency:

> +DECLARE_TRACE(swap_in,
> +	TPPROTO(struct page *page, swp_entry_t entry),
> +		TPARGS(page, entry));
> +DECLARE_TRACE(swap_out,
> +	TPPROTO(struct page *page),
> +		TPARGS(page));

you pass in swp_entry to trace_swap_in(), which encodes the offset - 
but that parameter is not needed, the page already represents the 
offset at that stage in do_swap_page(). (the actual data is not read 
in yet from swap, but the page is already linked up in the 
swap-cache and has the offset available - which a probe can 
recover.)

So this suffices:

 DECLARE_TRACE(swap_in,
	TPPROTO(struct page *page),
		TPARGS(page));

 DECLARE_TRACE(swap_out,
	TPPROTO(struct page *page),
		TPARGS(page));

And here again i'd like to see actual meaningful probe contents via 
a TRACE_EVENT() construct. That shows and proves that it's all part 
of a comprehensive framework, and the data that is recovered is 
understood and put into a coherent whole - upstream. That makes it 
immediately useful to the built-in tracers, and will also cause 
fewer surprises downstream.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 17:57       ` Jason Baron
@ 2009-03-24 19:12         ` Ingo Molnar
  2009-03-24 20:11           ` Mathieu Desnoyers
  2009-03-27 19:18           ` Jason Baron
  0 siblings, 2 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 19:12 UTC (permalink / raw)
  To: Jason Baron
  Cc: Mathieu Desnoyers, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu


* Jason Baron <jbaron@redhat.com> wrote:

> On Tue, Mar 24, 2009 at 06:50:49PM +0100, Ingo Molnar wrote:
> > * Jason Baron <jbaron@redhat.com> wrote:
> > 
> > > On Tue, Mar 24, 2009 at 11:56:27AM -0400, Mathieu Desnoyers wrote:
> > > > Instrumentation of IRQ related events : irq_entry, irq_exit and
> > > > irq_next_handler.
> > > > 
> > > > It allows tracers to perform latency analysis on those various types of
> > > > interrupts and to detect interrupts with max/min/avg duration. It helps
> > > > detecting driver or hardware problems which cause an ISR to take ages to
> > > > execute. It has been shown to be the case with bogus hardware causing an mmio
> > > > read to take a few milliseconds.
> > > > 
> > > > Those tracepoints are used by LTTng.
> > > > 
> > > > About the performance impact of tracepoints (which is comparable to markers),
> > > > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > > > show no regression. His test case was using hackbench on a kernel where
> > > > scheduler instrumentation (about 5 events in code scheduler code) was added.
> > > > See the "Tracepoints" patch header for performance result detail.
> > > > 
> > > > irq_entry and irq_exit not declared static because they appear in x86 arch code.
> > > > 
> > > > The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
> > > > exit events is to be able to recreate the kernel execution state at a given
> > > > point in time. Knowing which execution context is responsible for a given trace
> > > > event is _very_ valuable in trace data analysis.
> > > > 
> > > > The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
> > > > instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
> > > > provides information about which handler is being called, but does not map
> > > > correctly to the fact that _multiple_ handlers are being called from within the
> > > > same interrupt handler. From an interrupt latency analysis POV, this is
> > > > incorrect.
> > > > 
> > > 
> > > Since we are passing back the irq number, and we can not be 
> > > interrupted by the same irq, I think it should be pretty clear we 
> > > are in the same handler. That said, the extra entry/exit 
> > > tracepoints could make the sequence of events simpler to decipher, 
> > > which is important. The code looks good, and provides at least as 
> > > much information as the patch that I proposed. So i'll be happy 
> > > either way :)
> > 
> > We already have your patch merged up in the tracing tree and it 
> > gives entry+exit tracepoints.
> > 
> > 	Ingo
> 
> maybe i wasn't clear. Entry and exit as I proposed and as in the 
> tracing tree are for entry and exit into each handler per irq. 
> Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd 
> tracepoint to tell us which handler is being called and its return 
> code. hope this is clear.

Ok, i misunderstood that.

Mathieu's is slightly more compact, but yours is more logical.

I believe your pre/post IRQ handler callback is the right model - it 
decouples device IRQ handling from any notion of 'IRQ'.

For example, we could correctly express "handler got executed by an 
IRQ thread" via it - while via Mathieu's scheme it does not really 
map to that.

So if then i think there should be a third tracepoint in addition to 
your two existing tracepoints: a 'raw vector' type of tracepoint. 
It's added both to do_IRQ() entry point, but also to the various 
common SMP IPI entry points: reschedule, TLB flush and local timer 
IRQ tick.

The best information there to pass to the probe is the raw vector 
number, and the ptregs structure.

Hm?

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 6/9] LTTng instrumentation - timer
  2009-03-24 18:21   ` Ingo Molnar
@ 2009-03-24 19:14     ` Thomas Gleixner
  2009-03-24 20:47       ` Ingo Molnar
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2009-03-24 19:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Peter Zijlstra, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, David S. Miller, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu

On Tue, 24 Mar 2009, Ingo Molnar wrote:
> the init methods are missing from your patch entirely - and the rest 
> is partially incomplete as well.
> 
> Instrumentation of the del_timer() variants is missing. For a 
> complete lifetime analysis of timers this cannot be left out.

Look at the debugobjects hooks. They cover the complete timer life
time already.

Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
  2009-03-24 17:33   ` Jason Baron
@ 2009-03-24 19:14   ` Ingo Molnar
  2009-03-27 22:12   ` Thomas Gleixner
  2 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 19:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, ltt-dev, Frederic Weisbecker, Jason Baron,
	Peter Zijlstra, Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> +static irqreturn_t __handle_irq_next_handler(unsigned int irq,
> +	struct irqaction **action, irqreturn_t *retval, unsigned int *status)
> +{
> +	irqreturn_t ret;
> +
> +	ret = (*action)->handler(irq, (*action)->dev_id);
> +	if (ret == IRQ_HANDLED)
> +		*status |= (*action)->flags;
> +	*retval |= ret;
> +	*action = (*action)->next;
> +	return ret;
> +}
> +
>  static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
>  {
>  	irqreturn_t ret, retval = IRQ_NONE;
> @@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
>  	if (!(action->flags & IRQF_DISABLED))
>  		local_irq_enable_in_hardirq();
>  
> -	do {
> -		ret = action->handler(irq, action->dev_id);
> -		if (ret == IRQ_HANDLED)
> -			status |= action->flags;
> -		retval |= ret;
> -		action = action->next;
> -	} while (action);
> +	ret = __handle_irq_next_handler(irq, &action, &retval, &status);
> +
> +	while (action) {
> +		trace_irq_next_handler(irq, action, ret);
> +		ret = __handle_irq_next_handler(irq, &action, &retval, &status);
> +	}

Hm, this is rather unclean. Why open-code the first handler 
execution?

This is a sign (and side effect) of the logical model being slightly 
incorrect - see my previous mail to Jason.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 19:12         ` Ingo Molnar
@ 2009-03-24 20:11           ` Mathieu Desnoyers
  2009-03-24 20:51             ` Ingo Molnar
                               ` (2 more replies)
  2009-03-27 19:18           ` Jason Baron
  1 sibling, 3 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-24 20:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Christoph Hellwig
  Cc: Jason Baron, akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jason Baron <jbaron@redhat.com> wrote:
> 
> > On Tue, Mar 24, 2009 at 06:50:49PM +0100, Ingo Molnar wrote:
> > > * Jason Baron <jbaron@redhat.com> wrote:
> > > 
> > > > On Tue, Mar 24, 2009 at 11:56:27AM -0400, Mathieu Desnoyers wrote:
> > > > > Instrumentation of IRQ related events : irq_entry, irq_exit and
> > > > > irq_next_handler.
> > > > > 
> > > > > It allows tracers to perform latency analysis on those various types of
> > > > > interrupts and to detect interrupts with max/min/avg duration. It helps
> > > > > detecting driver or hardware problems which cause an ISR to take ages to
> > > > > execute. It has been shown to be the case with bogus hardware causing an mmio
> > > > > read to take a few milliseconds.
> > > > > 
> > > > > Those tracepoints are used by LTTng.
> > > > > 
> > > > > About the performance impact of tracepoints (which is comparable to markers),
> > > > > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > > > > show no regression. His test case was using hackbench on a kernel where
> > > > > scheduler instrumentation (about 5 events in code scheduler code) was added.
> > > > > See the "Tracepoints" patch header for performance result detail.
> > > > > 
> > > > > irq_entry and irq_exit not declared static because they appear in x86 arch code.
> > > > > 
> > > > > The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
> > > > > exit events is to be able to recreate the kernel execution state at a given
> > > > > point in time. Knowing which execution context is responsible for a given trace
> > > > > event is _very_ valuable in trace data analysis.
> > > > > 
> > > > > The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
> > > > > instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
> > > > > provides information about which handler is being called, but does not map
> > > > > correctly to the fact that _multiple_ handlers are being called from within the
> > > > > same interrupt handler. From an interrupt latency analysis POV, this is
> > > > > incorrect.
> > > > > 
> > > > 
> > > > Since we are passing back the irq number, and we can not be 
> > > > interrupted by the same irq, I think it should be pretty clear we 
> > > > are in the same handler. That said, the extra entry/exit 
> > > > tracepoints could make the sequence of events simpler to decipher, 
> > > > which is important. The code looks good, and provides at least as 
> > > > much information as the patch that I proposed. So i'll be happy 
> > > > either way :)
> > > 
> > > We already have your patch merged up in the tracing tree and it 
> > > gives entry+exit tracepoints.
> > > 
> > > 	Ingo
> > 
> > maybe i wasn't clear. Entry and exit as I proposed and as in the 
> > tracing tree are for entry and exit into each handler per irq. 
> > Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd 
> > tracepoint to tell us which handler is being called and its return 
> > code. hope this is clear.
> 
> Ok, i misunderstood that.
> 
> Mathieu's is slightly more compact, but yours is more logical.
> 
> I believe your pre/post IRQ handler callback is the right model - it 
> decouples device IRQ handling from any notion of 'IRQ'.
> 
> For example, we could correctly express "handler got executed by an 
> IRQ thread" via it - while via Mathieu's scheme it does not really 
> map to that.
> 
> So if then i think there should be a third tracepoint in addition to 
> your two existing tracepoints: a 'raw vector' type of tracepoint. 
> It's added both to do_IRQ() entry point, but also to the various 
> common SMP IPI entry points: reschedule, TLB flush and local timer 
> IRQ tick.
> 
> The best information there to pass to the probe is the raw vector 
> number, and the ptregs structure.
> 
> Hm?
> 

This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.

If we want to do this logically, without thinking about tracer
performance impact, we could/should do :

trace_irq_entry(irqno, pt_regs)
  for_each_handler() {
    trace_irq_handler_entry(action)
    action->handler()
    trace_irq_handler_exit(ret)
  }
trace_irq_exit(retval)

And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.

But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.

On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).

If we add interrupt threads to the kernel, then we can switch to the
following scheme :

* instrumentation of the real interrupt handler :

trace_irq_entry(irqno, pt_regs)

trace_irq_exit(ret)

* instrumentation of the irq threads :

trace_irq_thread_entry()

trace_irq_thread_exit()

I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).

This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".

Mathieu

> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 6/9] LTTng instrumentation - timer
  2009-03-24 19:14     ` Thomas Gleixner
@ 2009-03-24 20:47       ` Ingo Molnar
  2009-03-27 22:05         ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 20:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, Peter Zijlstra, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, David S. Miller, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, 24 Mar 2009, Ingo Molnar wrote:
> > the init methods are missing from your patch entirely - and the rest 
> > is partially incomplete as well.
> > 
> > Instrumentation of the del_timer() variants is missing. For a 
> > complete lifetime analysis of timers this cannot be left out.
> 
> Look at the debugobjects hooks. They cover the complete timer life 
> time already.
> 
> Piggypacking tracepoints on those would simplify maintenance and 
> reduce the number of extra lines in the code.

makes sense. They dont cover itimers nor hrtimers though.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 20:11           ` Mathieu Desnoyers
@ 2009-03-24 20:51             ` Ingo Molnar
  2009-03-25  8:47               ` Ingo Molnar
  2009-03-25  2:00             ` Steven Rostedt
  2009-03-25  2:09             ` Steven Rostedt
  2 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2009-03-24 20:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Christoph Hellwig, Jason Baron, akpm,
	linux-kernel, ltt-dev, Frederic Weisbecker, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu


* Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> If we want to do this logically, without thinking about tracer 
> performance impact, we could/should do :
> 
> trace_irq_entry(irqno, pt_regs)
>   for_each_handler() {
>     trace_irq_handler_entry(action)
>     action->handler()
>     trace_irq_handler_exit(ret)
>   }
> trace_irq_exit(retval)

Not really.

As i said, the handler invocation should be thought of separately 
from vectored IRQs. A device IRQ handler can be invoked in a number 
of non-vectored ways: it can be called in an IRQ thread for example. 
(there are other invocation modes possible too)

For IRQ vectors, the 'retval' makes little sense - so the exit event 
can be left out.

Which leaves us with what i suggested: to add an IRQ vector entry 
event on top of Jason's (already integrated) patches.

Hm?

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 7/9] LTTng instrumentation - kernel
  2009-03-24 18:33   ` Ingo Molnar
@ 2009-03-25  1:13     ` Rusty Russell
  2009-03-25  8:40       ` Ingo Molnar
  2009-03-25 13:06     ` Frederic Weisbecker
  1 sibling, 1 reply; 48+ messages in thread
From: Rusty Russell @ 2009-03-25  1:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

On Wednesday 25 March 2009 05:03:13 Ingo Molnar wrote:
> 
> (Rusty Cc:-ed - for the module.c tracepoints below)

Thanks, tho they look fine and non-intrusive to me.

> I believe that to have a complete picture of module usage, module 
> refcount get/put events should be included as well, beyond the basic 
> load/free events.
> 
> These both have performance impact (a module get/put in a fastpath 
> hurts scalability), and are informative in terms of establishing the 
> module dependency graph.

A module_get()/put() should not hurt scalability at all!  I went to great and
horrible lengths to ensure that was the case since the rewrite in 2.4.

But a module dependency graph et. al. would be kind of cool.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 20:11           ` Mathieu Desnoyers
  2009-03-24 20:51             ` Ingo Molnar
@ 2009-03-25  2:00             ` Steven Rostedt
  2009-03-26 18:27               ` Mathieu Desnoyers
  2009-03-25  2:09             ` Steven Rostedt
  2 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2009-03-25  2:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Christoph Hellwig, Jason Baron,
	akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu


On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
> 
> This is actually a very good example of what Christoph Hellwig, Peter
> Zijlstra and myself have been trying to warn you about the TRACE_EVENT
> macro : it exports the tracepoints to userspace, and thus makes them a
> userspace-visible API, when those tracepoints should be tied to the
> kernel code and nothing else. An adaptation layer should provide the
> abstractions that makes the information presented to the user more
> "logical".

Let me correct you here. TRACE_EVENT does ***NOT*** export anything to 
userspace. There is no code what so ever in TRACE_EVENT that does so.

Now, ftrace does export information using TRACE_EVENT to userspace. But 
that is the way ftrace wants to handle it. There's nothing needed to 
export to userspace. What is exported, is exported ***BECAUSE*** it can 
change. I'll only try to keep the format that is exported the same. But 
nothing should rely on what the format represents staying the same.

If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.


-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 20:11           ` Mathieu Desnoyers
  2009-03-24 20:51             ` Ingo Molnar
  2009-03-25  2:00             ` Steven Rostedt
@ 2009-03-25  2:09             ` Steven Rostedt
  2009-03-26 18:28               ` Mathieu Desnoyers
  2 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2009-03-25  2:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Christoph Hellwig, Jason Baron,
	akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu


On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
> 
> This third type of tracepoint for IRQs you are talking about is actually
> what I had in LTTng. I decided to add the irq_next_handler and to add a
> action field to irq_entry to include the irq handler information needed
> by Jason.
> 
> If we want to do this logically, without thinking about tracer
> performance impact, we could/should do :
> 
> trace_irq_entry(irqno, pt_regs)
>   for_each_handler() {
>     trace_irq_handler_entry(action)
>     action->handler()
>     trace_irq_handler_exit(ret)
>   }
> trace_irq_exit(retval)
> 
> And add the irq_entry/irq_exit events to the arch-specific reschedule,
> tlb flush, local timer irq, as I have in my lttng tree already.
> 
> But given the trace_irq_handler_entry/trace_irq_handler_exit events
> could be combined, given we can record action and ret in the
> irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
> the single-handler fast path by adding this information to the irq
> entry/exit events, and decided to combine the irq_handler entry/exit
> into a single next_handler event, which records the previous ret value
> and the next action to execute.
> 
> On an interrupt-driven workload, it will have a significant impact.
> (2 events vs 4).

I thought tracepoints while not active are very low overhead.

If you only want to use 2 of the 4, would that be just as fast?

> 
> If we add interrupt threads to the kernel, then we can switch to the
> following scheme :
> 
> * instrumentation of the real interrupt handler :
> 
> trace_irq_entry(irqno, pt_regs)
> 
> trace_irq_exit(ret)
> 
> * instrumentation of the irq threads :
> 
> trace_irq_thread_entry()
> 
> trace_irq_thread_exit()
> 
> I don't see why we should mind trying to make the tracepoints "logical",
> especially if it hurts performance considerably. Doing these
> implementation-specific versions of irq tracepoints would provide the
> best performance we can get when tracing. It's up to the tracers to
> specialize their analysis based on the underlying IRQ mechanism
> (non-threaded vs threaded).

Perhaps we want to make them logical so that other things might hook into 
these trace points besides a tracer. I do not agree that the code should 
be modified just to make the trace points faster. The trace points are 
just hooks into code, and should have no effect when disabled. Once the 
code starts to change due to better placement of tracepoints for tracers, 
that's when those trace points should be NACKed.

-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 7/9] LTTng instrumentation - kernel
  2009-03-25  1:13     ` Rusty Russell
@ 2009-03-25  8:40       ` Ingo Molnar
  0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2009-03-25  8:40 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mathieu Desnoyers, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, Masami Hiramatsu, Peter Zijlstra,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> On Wednesday 25 March 2009 05:03:13 Ingo Molnar wrote:
> > 
> > (Rusty Cc:-ed - for the module.c tracepoints below)
> 
> Thanks, tho they look fine and non-intrusive to me.

Thanks - i'll take this as an Acked-by :-)

( Mathieu, mind re-sending a variant that does them via
  TRACE_EVENT(), against the tracing tree? That makes it useful not 
  just to LTTng but the default mainline kernel as well. Thanks! )

> > I believe that to have a complete picture of module usage, module 
> > refcount get/put events should be included as well, beyond the basic 
> > load/free events.
> > 
> > These both have performance impact (a module get/put in a fastpath 
> > hurts scalability), and are informative in terms of establishing the 
> > module dependency graph.
> 
> A module_get()/put() should not hurt scalability at all!  I went 
> to great and horrible lengths to ensure that was the case since 
> the rewrite in 2.4.

i know, it uses percpu refcounts :) Still it can be somewhat 
non-trivial in a fastpath:

static inline int try_module_get(struct module *module)
{
        int ret = 1;

        if (module) {
                unsigned int cpu = get_cpu();
                if (likely(module_is_live(module)))
                        local_inc(__module_ref_addr(module, cpu));
                else
                        ret = 0;
                put_cpu();
        }
        return ret;
}

So we want to reduce excessive uses of it.

[ And please forget i mentioned scalability - that bit is fine and 
  you were offended rightfully :) ]

This reminds me. Couldnt we now:

 #ifdef CONFIG_SMP
         char *refptr;
 #else
         local_t ref;
 #endif

... unify these bits to just standardize on a per-cpu refptr all the 
time, with Tejun's ueber-cool percpu changes in place?

> But a module dependency graph et. al. would be kind of cool.

Yeah. Can be in a separate patch as well - load/unload events are 
meaningful in isolation as well.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 20:51             ` Ingo Molnar
@ 2009-03-25  8:47               ` Ingo Molnar
  2009-03-25 18:30                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2009-03-25  8:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Christoph Hellwig, Jason Baron, akpm,
	linux-kernel, ltt-dev, Frederic Weisbecker, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > If we want to do this logically, without thinking about tracer 
> > performance impact, we could/should do :
> > 
> > trace_irq_entry(irqno, pt_regs)
> >   for_each_handler() {
> >     trace_irq_handler_entry(action)
> >     action->handler()
> >     trace_irq_handler_exit(ret)
> >   }
> > trace_irq_exit(retval)
> 
> Not really.

Put differently: we seem to agree on handler invocation entry/exit 
events (and those are handled by Jason's patch already), and that's 
good.

I dont think we need two events for physical IRQ events though. One 
is enough: entry event. What meaning does 'retval' have in your 
pseudo-code above? None typically.

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 7/9] LTTng instrumentation - kernel
  2009-03-24 18:33   ` Ingo Molnar
  2009-03-25  1:13     ` Rusty Russell
@ 2009-03-25 13:06     ` Frederic Weisbecker
  1 sibling, 0 replies; 48+ messages in thread
From: Frederic Weisbecker @ 2009-03-25 13:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Rusty Russell, akpm, linux-kernel, ltt-dev,
	Masami Hiramatsu, Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

On Tue, Mar 24, 2009 at 07:33:13PM +0100, Ingo Molnar wrote:
> 
> (Rusty Cc:-ed - for the module.c tracepoints below)
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Instrument the core kernel : module load/free and printk events. It helps the
> > tracer to keep track of module related events and to export valuable printk
> > information into the traces.
> > 
> > Those tracepoints are used by LTTng.
> > 
> > About the performance impact of tracepoints (which is comparable to markers),
> > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > show no regression. His test case was using hackbench on a kernel where
> > scheduler instrumentation (about 5 events in code scheduler code) was added.
> > See the "Tracepoints" patch header for performance result detail.
> > 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > CC: 'Ingo Molnar' <mingo@elte.hu>
> > CC: Frederic Weisbecker <fweisbec@gmail.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: Masami Hiramatsu <mhiramat@redhat.com>
> > CC: 'Peter Zijlstra' <peterz@infradead.org>
> > CC: "Frank Ch. Eigler" <fche@redhat.com>
> > CC: 'Hideo AOKI' <haoki@redhat.com>
> > CC: Takashi Nishiie <t-nishiie@np.css.fujitsu.com>
> > CC: 'Steven Rostedt' <rostedt@goodmis.org>
> > CC: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
> > ---
> >  include/trace/kernel.h |   19 +++++++++++++++++++
> >  kernel/module.c        |    8 ++++++++
> >  kernel/printk.c        |    7 +++++++
> >  3 files changed, 34 insertions(+)
> > 
> > Index: linux-2.6-lttng/kernel/printk.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/printk.c	2009-03-24 09:09:52.000000000 -0400
> > +++ linux-2.6-lttng/kernel/printk.c	2009-03-24 09:31:53.000000000 -0400
> > @@ -32,6 +32,7 @@
> >  #include <linux/security.h>
> >  #include <linux/bootmem.h>
> >  #include <linux/syscalls.h>
> > +#include <trace/kernel.h>
> >  
> >  #include <asm/uaccess.h>
> >  
> > @@ -59,6 +60,7 @@ int console_printk[4] = {
> >  	MINIMUM_CONSOLE_LOGLEVEL,	/* minimum_console_loglevel */
> >  	DEFAULT_CONSOLE_LOGLEVEL,	/* default_console_loglevel */
> >  };
> > +EXPORT_SYMBOL_GPL(console_printk);
> >  
> >  /*
> >   * Low level drivers may need that to know if they can schedule in
> > @@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
> >  /* Flag: console code may call schedule() */
> >  static int console_may_schedule;
> >  
> > +DEFINE_TRACE(kernel_printk);
> > +DEFINE_TRACE(kernel_vprintk);
> > +
> >  #ifdef CONFIG_PRINTK
> >  
> >  static char __log_buf[__LOG_BUF_LEN];
> > @@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
> >  	int r;
> >  
> >  	va_start(args, fmt);
> > +	trace_kernel_printk(_RET_IP_);
> >  	r = vprintk(fmt, args);
> >  	va_end(args);
> >
> > @@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt, 
> >  	printed_len += vscnprintf(printk_buf + printed_len,
> >  				  sizeof(printk_buf) - printed_len, fmt, args);
> >  
> > +	trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);
> 
> So here we pass in the formatted output. What sense does it make to 
> have the above printk tracepoint? Little i think.
> 
> Also, i'm not entirely convinced about the wiseness of instrumenting 
> an essential debug facility like printk(). Lets keep this one at the 
> tail portion of the patch-queue, ok?


Especially the trace_kernel_printk hook which only probes the printk callers.
I don't think a performance measurement of a printk call in that relevant.

Concerning trace_kernel_vprintk(), if the goal is to capture the printk messages,
I would rather see it through the dynamic printk facility or setting a console which
route printk output to trace_printk(). If that is useful for someone.

Thanks,
Frederic.

 
> > Index: linux-2.6-lttng/kernel/module.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/module.c	2009-03-24 09:09:59.000000000 -0400
> > +++ linux-2.6-lttng/kernel/module.c	2009-03-24 09:31:53.000000000 -0400
> > @@ -51,6 +51,7 @@
> >  #include <linux/tracepoint.h>
> >  #include <linux/ftrace.h>
> >  #include <linux/async.h>
> > +#include <trace/kernel.h>
> >  
> >  #if 0
> >  #define DEBUGP printk
> > @@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
> >  /* Bounds of module allocation, for speeding __module_text_address */
> >  static unsigned long module_addr_min = -1UL, module_addr_max = 0;
> >  
> > +DEFINE_TRACE(kernel_module_load);
> > +DEFINE_TRACE(kernel_module_free);
> 
> I believe that to have a complete picture of module usage, module 
> refcount get/put events should be included as well, beyond the basic 
> load/free events.
> 
> These both have performance impact (a module get/put in a fastpath 
> hurts scalability), and are informative in terms of establishing the 
> module dependency graph.
> 
> Another thing that is iteresting and which is not covered here are 
> module request events and usermode helper callouts. These too have 
> instrumentation and performance analysis uses.
> 
> Plus, here too it would be desired to put in default probes as well, 
> via TRACE_EVENT().
> 
> Thanks,
> 
> 	Ingo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [patch 3/9] LTTng instrumentation tasklets
  2009-03-24 17:56   ` Ingo Molnar
@ 2009-03-25 13:52     ` Chetan.Loke
  2009-03-25 14:17       ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Chetan.Loke @ 2009-03-25 13:52 UTC (permalink / raw)
  To: mingo, mathieu.desnoyers
  Cc: akpm, linux-kernel, ltt-dev, fweisbec, jbaron, peterz, tglx,
	rmk+lkml, mhiramat, fche, haoki, t-nishiie, rostedt,
	eduard.munteanu

 

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Ingo Molnar
> Sent: Tuesday, March 24, 2009 1:56 PM
> To: Mathieu Desnoyers
> Cc: akpm@linux-foundation.org; linux-kernel@vger.kernel.org; 
> ltt-dev@lists.casi.polymtl.ca; Frederic Weisbecker; Jason 
> Baron; Peter Zijlstra; Thomas Gleixner; Russell King; Masami 
> Hiramatsu; Frank Ch. Eigler; Hideo AOKI; Takashi Nishiie; 
> Steven Rostedt; Eduard - Gabriel Munteanu
> Subject: Re: [patch 3/9] LTTng instrumentation tasklets
> 
> 
> * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > tasklet entry and exit events.
> 
> > +DEFINE_TRACE(irq_tasklet_high_entry);
> > +DEFINE_TRACE(irq_tasklet_high_exit);
> > +DEFINE_TRACE(irq_tasklet_low_entry);
> > +DEFINE_TRACE(irq_tasklet_low_exit);
> 
> Dunno - tasklets are a legacy mechanism, not sure we want to 
> instrument them. 


Quick question. I understand this is unrelated to this patch. So I apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a plan to phase them out ? Let me draw a small picture as to what's bothering me.

With the SR-IOV support if there are 'N' virtual functions then there will be 'N' driver instances(actually N+1, 1 for the PF). If that driver drains the responses in the interrupt context then all such VF-instances could virtually block everyone else(because SR-IOV guys might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's can queue tasklets and can also get the performance they want.

Any suggestions?

thanks
Chetan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [patch 3/9] LTTng instrumentation tasklets
  2009-03-25 13:52     ` Chetan.Loke
@ 2009-03-25 14:17       ` Peter Zijlstra
  2009-03-25 17:37         ` Chetan.Loke
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2009-03-25 14:17 UTC (permalink / raw)
  To: Chetan.Loke
  Cc: mingo, mathieu.desnoyers, akpm, linux-kernel, ltt-dev, fweisbec,
	jbaron, tglx, rmk+lkml, mhiramat, fche, haoki, t-nishiie,
	rostedt, eduard.munteanu

On Wed, 2009-03-25 at 06:52 -0700, Chetan.Loke@Emulex.Com wrote:
> Quick question. I understand this is unrelated to this patch. So I
> apologize in advance.
> Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
> plan to phase them out ? Let me draw a small picture as to what's
> bothering me.
> 
> With the SR-IOV support if there are 'N' virtual functions then there
> will be 'N' driver instances(actually N+1, 1 for the PF). If that
> driver drains the responses in the interrupt context then all such
> VF-instances could virtually block everyone else(because SR-IOV guys
> might also have MSI-X enabled).
> So now all such drivers should alter their Rx path.Driver's can queue
> tasklets and can also get the performance they want.
> 
> Any suggestions?

Threaded interrupts?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [patch 3/9] LTTng instrumentation tasklets
  2009-03-25 14:17       ` Peter Zijlstra
@ 2009-03-25 17:37         ` Chetan.Loke
  2009-03-25 17:52           ` Steven Rostedt
  0 siblings, 1 reply; 48+ messages in thread
From: Chetan.Loke @ 2009-03-25 17:37 UTC (permalink / raw)
  To: peterz
  Cc: mingo, mathieu.desnoyers, akpm, linux-kernel, ltt-dev, fweisbec,
	jbaron, tglx, rmk+lkml, mhiramat, fche, haoki, t-nishiie,
	rostedt, eduard.munteanu

 

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org] 
> Sent: Wednesday, March 25, 2009 10:18 AM
> To: Loke,Chetan
> Cc: mingo@elte.hu; mathieu.desnoyers@polymtl.ca; 
> akpm@linux-foundation.org; linux-kernel@vger.kernel.org; 
> ltt-dev@lists.casi.polymtl.ca; fweisbec@gmail.com; 
> jbaron@redhat.com; tglx@linutronix.de; 
> rmk+lkml@arm.linux.org.uk; mhiramat@redhat.com; 
> fche@redhat.com; haoki@redhat.com; 
> t-nishiie@np.css.fujitsu.com; rostedt@goodmis.org; 
> eduard.munteanu@linux360.ro
> Subject: RE: [patch 3/9] LTTng instrumentation tasklets
> 
> On Wed, 2009-03-25 at 06:52 -0700, Chetan.Loke@Emulex.Com wrote:
> > Quick question. I understand this is unrelated to this patch. So I 
> > apologize in advance.
> > Ingo - you mentioned "tasklets are a legacy mechanism". Is there a 
> > plan to phase them out ? Let me draw a small picture as to what's 
> > bothering me.
> > 
> > With the SR-IOV support if there are 'N' virtual functions 
> then there 
> > will be 'N' driver instances(actually N+1, 1 for the PF). If that 
> > driver drains the responses in the interrupt context then all such 
> > VF-instances could virtually block everyone else(because 
> SR-IOV guys 
> > might also have MSI-X enabled).
> > So now all such drivers should alter their Rx path.Driver's 
> can queue 
> > tasklets and can also get the performance they want.
> > 
> > Any suggestions?
> 
> Threaded interrupts?
> 

If we truly need to address performance and scalability for the SCSI-subsystem then we need something lightweight.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [patch 3/9] LTTng instrumentation tasklets
  2009-03-25 17:37         ` Chetan.Loke
@ 2009-03-25 17:52           ` Steven Rostedt
  0 siblings, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2009-03-25 17:52 UTC (permalink / raw)
  To: Chetan.Loke
  Cc: peterz, mingo, mathieu.desnoyers, akpm, linux-kernel, ltt-dev,
	fweisbec, jbaron, tglx, rmk+lkml, mhiramat, fche, haoki,
	t-nishiie, eduard.munteanu


On Wed, 25 Mar 2009, Chetan.Loke@Emulex.Com wrote:

>  
> 
> > -----Original Message-----
> > From: Peter Zijlstra [mailto:peterz@infradead.org] 
> > Sent: Wednesday, March 25, 2009 10:18 AM
> > To: Loke,Chetan
> > Cc: mingo@elte.hu; mathieu.desnoyers@polymtl.ca; 
> > akpm@linux-foundation.org; linux-kernel@vger.kernel.org; 
> > ltt-dev@lists.casi.polymtl.ca; fweisbec@gmail.com; 
> > jbaron@redhat.com; tglx@linutronix.de; 
> > rmk+lkml@arm.linux.org.uk; mhiramat@redhat.com; 
> > fche@redhat.com; haoki@redhat.com; 
> > t-nishiie@np.css.fujitsu.com; rostedt@goodmis.org; 
> > eduard.munteanu@linux360.ro
> > Subject: RE: [patch 3/9] LTTng instrumentation tasklets
> > 
> > On Wed, 2009-03-25 at 06:52 -0700, Chetan.Loke@Emulex.Com wrote:
> > > Quick question. I understand this is unrelated to this patch. So I 
> > > apologize in advance.
> > > Ingo - you mentioned "tasklets are a legacy mechanism". Is there a 
> > > plan to phase them out ? Let me draw a small picture as to what's 
> > > bothering me.
> > > 
> > > With the SR-IOV support if there are 'N' virtual functions 
> > then there 
> > > will be 'N' driver instances(actually N+1, 1 for the PF). If that 
> > > driver drains the responses in the interrupt context then all such 
> > > VF-instances could virtually block everyone else(because 
> > SR-IOV guys 
> > > might also have MSI-X enabled).
> > > So now all such drivers should alter their Rx path.Driver's 
> > can queue 
> > > tasklets and can also get the performance they want.
> > > 
> > > Any suggestions?
> > 
> > Threaded interrupts?
> > 
> 
> If we truly need to address performance and scalability for the SCSI-subsystem then we need something lightweight.

Threaded interrupts are actually quite light.

-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-25  8:47               ` Ingo Molnar
@ 2009-03-25 18:30                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-25 18:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Christoph Hellwig, Jason Baron, akpm,
	linux-kernel, ltt-dev, Frederic Weisbecker, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > > If we want to do this logically, without thinking about tracer 
> > > performance impact, we could/should do :
> > > 
> > > trace_irq_entry(irqno, pt_regs)
> > >   for_each_handler() {
> > >     trace_irq_handler_entry(action)
> > >     action->handler()
> > >     trace_irq_handler_exit(ret)
> > >   }
> > > trace_irq_exit(retval)
> > 
> > Not really.
> 
> Put differently: we seem to agree on handler invocation entry/exit 
> events (and those are handled by Jason's patch already), and that's 
> good.
> 

Yes, we need to get this information into the trace stream somehow.

> I dont think we need two events for physical IRQ events though. One 
> is enough: entry event. What meaning does 'retval' have in your 
> pseudo-code above? None typically.
> 

Hrm, an event means more that the "payload" it has. Just saving "here is
a return from interrupt handler" without any return value is valuable
per se because it lets the trace analyzer know when the hardirq handler
ended.

e.g., if we instrument the x86 local apic irq handler only with an "irq
entry" event, we would not be able to detect if the following events are
nested on top of the IRQ handler or are executed in thread context.

Having those two hardirq entry/exit events is also especially useful to
figure out the maximum interrupt latency caused by long interrupt
handlers (or that could be many handlers hooked on the same interrupt
line).

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-25  2:00             ` Steven Rostedt
@ 2009-03-26 18:27               ` Mathieu Desnoyers
  2009-03-27 22:53                 ` Steven Rostedt
  0 siblings, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-26 18:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Christoph Hellwig, Jason Baron,
	akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
> > 
> > This is actually a very good example of what Christoph Hellwig, Peter
> > Zijlstra and myself have been trying to warn you about the TRACE_EVENT
> > macro : it exports the tracepoints to userspace, and thus makes them a
> > userspace-visible API, when those tracepoints should be tied to the
> > kernel code and nothing else. An adaptation layer should provide the
> > abstractions that makes the information presented to the user more
> > "logical".
> 
> Let me correct you here. TRACE_EVENT does ***NOT*** export anything to 
> userspace. There is no code what so ever in TRACE_EVENT that does so.
> 

Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?

 *      *
 *      * Fast binary tracing: define the trace record via
 *      * TP_STRUCT__entry(). You can think about it like a
 *      * regular C structure local variable definition.
 *      *
 *      * This is how the trace record is structured and will
 *      * be saved into the ring buffer. These are the fields
 *      * that will be exposed to user-space in
 *      * /debug/tracing/events/<*>/format.

You don't need to have code within the infrastructure to actually export
stuff to userspace. Stating in the API that you some information will be
presented to userspace counts.

> Now, ftrace does export information using TRACE_EVENT to userspace. But 
> that is the way ftrace wants to handle it. There's nothing needed to 
> export to userspace. What is exported, is exported ***BECAUSE*** it can 
> change. I'll only try to keep the format that is exported the same. But 
> nothing should rely on what the format represents staying the same.
> 
> If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
> you like. Ftrace will use it to show how to read the binary data, which
> is only needed if you want to do that. It uses the print format to print
> to the console in case of failure. Or to the trace file, which by the way
> can also change without notice.

Hrm, so you are planning to add, in
include/trace/sched_event_types.h, things like :

TRACE_EVENT(sched_kthread_stop,

        TP_PROTO(struct task_struct *t),

        TP_ARGS(t),

        TP_STRUCT__entry(
                __array(        char,   comm,   TASK_COMM_LEN   )
                __field(        pid_t,  pid                     )
        ),

        TP_fast_assign(
                memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
                __entry->pid    = t->pid;
        ),

        TP_printk("task %s:%d", __entry->comm, __entry->pid)
);

Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
TP_printk are specific to ftrace and should not be considered as a
stable API, am I correct ?

Then only when ftrace is built in the kernel do we have 
kernel/trace/events.c including the holy include/trace/trace_events.h,
which includes all the trace_events headers, and then including
kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
macro and create the callbacks for each TRACE_EVENT statements.

Then big problem with this is that, whether you like it or not, you
*are* adding an API to the tracepoints, but just validating the types
when ftrace is being built. If you want to add an API to the
tracepoints, then the type verification should be done _even when ftrace
is disabled_.

Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
would verify the argument types.

I think it's ok to specify that the arguments of a given TRACE_EVENT may
change without notice. We have to say it explicitly in the TRACE_EVENT
header though.

And while we talk about this, I also wonder why we won't simply embed
the result of the TP_fast_assign and TP_printk in the tracepoint
unlikely branch ? This would skip a function call in the tracing fast
path, and would save the overhead of a whole function call when tracing
is active. But that implies more bounds between tracepoints and data
output, but given you are already declaring this in the same API, I
don't see the problem anymore.

About the struct ftrace_raw_##name created in stage 1, I think the
padding at the end of the structure is a complete waste of space. You
should probably have a look at the ltt-type-serializer.[ch] in LTTng.

I am still unsure that the approach you take with TRACE_EVENT, which I
would call "automatically generating code using header and macro
tricks", is in the end easier to review than the simple C callback
approach I have taken in LTTng, adding the "glue" in the
DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
tracepoint.

The good thing about your approach is that everyting about a trace event
can be declared within the same macro. In LTTng, I have to create probe
modules and write stuff like :

void probe_irq_softirq_exit(struct softirq_action *h,
        struct softirq_action *softirq_vec);

DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
        probe_irq_softirq_exit, "softirq_id #1u%lu");

void probe_irq_softirq_exit(struct softirq_action *h,
        struct softirq_action *softirq_vec)
{
        struct marker *marker;
        unsigned char data;

        data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);

        marker = &GET_MARKER(kernel, softirq_exit);
        ltt_specialized_trace(marker, marker->single.probe_private,
                &data, sizeof(data), sizeof(data));
}

by hand, and then the kernel softirq_exit event is shown in
debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
you get by doing it in TRACE_EVENT, especially if it is not
type-verified when ftrace is disabled, compared to adding those
callbacks in standard kernel modules, which is a coding-style we have
been used to for years.

I haven't seen much automatically generated code around in the kernel
tree, maybe there is a good reason ? I can't wait to see the first
kernel JIT based on ftrace. ;-)

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-25  2:09             ` Steven Rostedt
@ 2009-03-26 18:28               ` Mathieu Desnoyers
  0 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-03-26 18:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Christoph Hellwig, Jason Baron,
	akpm, linux-kernel, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
> > 
> > This third type of tracepoint for IRQs you are talking about is actually
> > what I had in LTTng. I decided to add the irq_next_handler and to add a
> > action field to irq_entry to include the irq handler information needed
> > by Jason.
> > 
> > If we want to do this logically, without thinking about tracer
> > performance impact, we could/should do :
> > 
> > trace_irq_entry(irqno, pt_regs)
> >   for_each_handler() {
> >     trace_irq_handler_entry(action)
> >     action->handler()
> >     trace_irq_handler_exit(ret)
> >   }
> > trace_irq_exit(retval)
> > 
> > And add the irq_entry/irq_exit events to the arch-specific reschedule,
> > tlb flush, local timer irq, as I have in my lttng tree already.
> > 
> > But given the trace_irq_handler_entry/trace_irq_handler_exit events
> > could be combined, given we can record action and ret in the
> > irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
> > the single-handler fast path by adding this information to the irq
> > entry/exit events, and decided to combine the irq_handler entry/exit
> > into a single next_handler event, which records the previous ret value
> > and the next action to execute.
> > 
> > On an interrupt-driven workload, it will have a significant impact.
> > (2 events vs 4).
> 
> I thought tracepoints while not active are very low overhead.
> 
> If you only want to use 2 of the 4, would that be just as fast?
> 

Probably, but I was talking about active tracing overhead here.

> > 
> > If we add interrupt threads to the kernel, then we can switch to the
> > following scheme :
> > 
> > * instrumentation of the real interrupt handler :
> > 
> > trace_irq_entry(irqno, pt_regs)
> > 
> > trace_irq_exit(ret)
> > 
> > * instrumentation of the irq threads :
> > 
> > trace_irq_thread_entry()
> > 
> > trace_irq_thread_exit()
> > 
> > I don't see why we should mind trying to make the tracepoints "logical",
> > especially if it hurts performance considerably. Doing these
> > implementation-specific versions of irq tracepoints would provide the
> > best performance we can get when tracing. It's up to the tracers to
> > specialize their analysis based on the underlying IRQ mechanism
> > (non-threaded vs threaded).
> 
> Perhaps we want to make them logical so that other things might hook into 
> these trace points besides a tracer. I do not agree that the code should 
> be modified just to make the trace points faster. The trace points are 
> just hooks into code, and should have no effect when disabled. Once the 
> code starts to change due to better placement of tracepoints for tracers, 
> that's when those trace points should be NACKed.
> 
> -- Steve
> 

If it makes the code messy, then yes, I agree that those tracepoints
should not go in.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 19:12         ` Ingo Molnar
  2009-03-24 20:11           ` Mathieu Desnoyers
@ 2009-03-27 19:18           ` Jason Baron
  1 sibling, 0 replies; 48+ messages in thread
From: Jason Baron @ 2009-03-27 19:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Russell King, Masami Hiramatsu, Frank Ch. Eigler, Hideo AOKI,
	Takashi Nishiie, Steven Rostedt, Eduard - Gabriel Munteanu

On Tue, Mar 24, 2009 at 08:12:52PM +0100, Ingo Molnar wrote:
> * Jason Baron <jbaron@redhat.com> wrote:
> 
> > On Tue, Mar 24, 2009 at 06:50:49PM +0100, Ingo Molnar wrote:
> > > * Jason Baron <jbaron@redhat.com> wrote:
> > > 
> > > > On Tue, Mar 24, 2009 at 11:56:27AM -0400, Mathieu Desnoyers wrote:
> > > > > Instrumentation of IRQ related events : irq_entry, irq_exit and
> > > > > irq_next_handler.
> > > > > 
> > > > > It allows tracers to perform latency analysis on those various types of
> > > > > interrupts and to detect interrupts with max/min/avg duration. It helps
> > > > > detecting driver or hardware problems which cause an ISR to take ages to
> > > > > execute. It has been shown to be the case with bogus hardware causing an mmio
> > > > > read to take a few milliseconds.
> > > > > 
> > > > > Those tracepoints are used by LTTng.
> > > > > 
> > > > > About the performance impact of tracepoints (which is comparable to markers),
> > > > > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > > > > show no regression. His test case was using hackbench on a kernel where
> > > > > scheduler instrumentation (about 5 events in code scheduler code) was added.
> > > > > See the "Tracepoints" patch header for performance result detail.
> > > > > 
> > > > > irq_entry and irq_exit not declared static because they appear in x86 arch code.
> > > > > 
> > > > > The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
> > > > > exit events is to be able to recreate the kernel execution state at a given
> > > > > point in time. Knowing which execution context is responsible for a given trace
> > > > > event is _very_ valuable in trace data analysis.
> > > > > 
> > > > > The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
> > > > > instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
> > > > > provides information about which handler is being called, but does not map
> > > > > correctly to the fact that _multiple_ handlers are being called from within the
> > > > > same interrupt handler. From an interrupt latency analysis POV, this is
> > > > > incorrect.
> > > > > 
> > > > 
> > > > Since we are passing back the irq number, and we can not be 
> > > > interrupted by the same irq, I think it should be pretty clear we 
> > > > are in the same handler. That said, the extra entry/exit 
> > > > tracepoints could make the sequence of events simpler to decipher, 
> > > > which is important. The code looks good, and provides at least as 
> > > > much information as the patch that I proposed. So i'll be happy 
> > > > either way :)
> > > 
> > > We already have your patch merged up in the tracing tree and it 
> > > gives entry+exit tracepoints.
> > > 
> > > 	Ingo
> > 
> > maybe i wasn't clear. Entry and exit as I proposed and as in the 
> > tracing tree are for entry and exit into each handler per irq. 
> > Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd 
> > tracepoint to tell us which handler is being called and its return 
> > code. hope this is clear.
> 
> Ok, i misunderstood that.
> 
> Mathieu's is slightly more compact, but yours is more logical.
> 
> I believe your pre/post IRQ handler callback is the right model - it 
> decouples device IRQ handling from any notion of 'IRQ'.
> 
> For example, we could correctly express "handler got executed by an 
> IRQ thread" via it - while via Mathieu's scheme it does not really 
> map to that.
> 
> So if then i think there should be a third tracepoint in addition to 
> your two existing tracepoints: a 'raw vector' type of tracepoint. 
> It's added both to do_IRQ() entry point, but also to the various 
> common SMP IPI entry points: reschedule, TLB flush and local timer 
> IRQ tick.
> 
> The best information there to pass to the probe is the raw vector 
> number, and the ptregs structure.
> 
> Hm?
> 

yes, this was my thinking too. The handler tracepoints are useful in and
of themselves...and I was planning to do a separate patch, instrumenting what
you are calling the 'raw vector'...

thanks,

-Jason



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 6/9] LTTng instrumentation - timer
  2009-03-24 20:47       ` Ingo Molnar
@ 2009-03-27 22:05         ` Thomas Gleixner
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2009-03-27 22:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mathieu Desnoyers, Peter Zijlstra, akpm, linux-kernel, ltt-dev,
	Frederic Weisbecker, David S. Miller, Masami Hiramatsu,
	Peter Zijlstra, Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Steven Rostedt, Eduard - Gabriel Munteanu



On Tue, 24 Mar 2009, Ingo Molnar wrote:

> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Tue, 24 Mar 2009, Ingo Molnar wrote:
> > > the init methods are missing from your patch entirely - and the rest 
> > > is partially incomplete as well.
> > > 
> > > Instrumentation of the del_timer() variants is missing. For a 
> > > complete lifetime analysis of timers this cannot be left out.
> > 
> > Look at the debugobjects hooks. They cover the complete timer life 
> > time already.
> > 
> > Piggypacking tracepoints on those would simplify maintenance and 
> > reduce the number of extra lines in the code.
> 
> makes sense. They dont cover itimers nor hrtimers though.

They cover all (hr)timers independent of the call site. And we really
do not need any freaking call site instrumented.

The CPU* timer stuff is a different playground and needs to be
instrumented separate.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
  2009-03-24 17:33   ` Jason Baron
  2009-03-24 19:14   ` Ingo Molnar
@ 2009-03-27 22:12   ` Thomas Gleixner
  2 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2009-03-27 22:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, Ingo Molnar, linux-kernel, ltt-dev, Frederic Weisbecker,
	Jason Baron, Peter Zijlstra, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie, Steven Rostedt,
	Eduard - Gabriel Munteanu

On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
>   * lockdep: we want to handle all irq_desc locks as a single lock-class:
>   */
>  struct lock_class_key irq_desc_lock_class;
> @@ -316,6 +324,19 @@ irqreturn_t no_action(int cpl, void *dev
>  	return IRQ_NONE;
>  }
>  
> +static irqreturn_t __handle_irq_next_handler(unsigned int irq,
> +	struct irqaction **action, irqreturn_t *retval, unsigned int *status)
> +{
> +	irqreturn_t ret;
> +
> +	ret = (*action)->handler(irq, (*action)->dev_id);
> +	if (ret == IRQ_HANDLED)
> +		*status |= (*action)->flags;
> +	*retval |= ret;
> +	*action = (*action)->next;
> +	return ret;
> +}
> +
>  static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
>  {
>  	irqreturn_t ret, retval = IRQ_NONE;
> @@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
>  	if (!(action->flags & IRQF_DISABLED))
>  		local_irq_enable_in_hardirq();
>  
> -	do {
> -		ret = action->handler(irq, action->dev_id);
> -		if (ret == IRQ_HANDLED)
> -			status |= action->flags;
> -		retval |= ret;
> -		action = action->next;
> -	} while (action);
> +	ret = __handle_irq_next_handler(irq, &action, &retval, &status);
> +
> +	while (action) {
> +		trace_irq_next_handler(irq, action, ret);
> +		ret = __handle_irq_next_handler(irq, &action, &retval, &status);
> +	}

This one is the next candidate for the "ugly patch of the week contest".

It does not trace the return value of the first handler and just moves
code around for no good reason. Darn why can' t you simply do:

	do {
		ret = action->handler(irq, action->dev_id);
+		trace_action_handler(....);
	}

and keep the rest of the code unchanged ?    

>  
>  #ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ

This code is going away. No tracepoint needed here.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-26 18:27               ` Mathieu Desnoyers
@ 2009-03-27 22:53                 ` Steven Rostedt
  2009-04-02  2:42                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2009-03-27 22:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Peter Zijlstra, Christoph Hellwig, Jason Baron,
	Andrew Morton, LKML, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu


On Thu, 26 Mar 2009, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
> > > 
> > > This is actually a very good example of what Christoph Hellwig, Peter
> > > Zijlstra and myself have been trying to warn you about the TRACE_EVENT
> > > macro : it exports the tracepoints to userspace, and thus makes them a
> > > userspace-visible API, when those tracepoints should be tied to the
> > > kernel code and nothing else. An adaptation layer should provide the
> > > abstractions that makes the information presented to the user more
> > > "logical".
> > 
> > Let me correct you here. TRACE_EVENT does ***NOT*** export anything to 
> > userspace. There is no code what so ever in TRACE_EVENT that does so.
> > 
> 
> Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?
> 
>  *      *
>  *      * Fast binary tracing: define the trace record via
>  *      * TP_STRUCT__entry(). You can think about it like a
>  *      * regular C structure local variable definition.
>  *      *
>  *      * This is how the trace record is structured and will
>  *      * be saved into the ring buffer. These are the fields
>  *      * that will be exposed to user-space in
>  *      * /debug/tracing/events/<*>/format.
> 
> You don't need to have code within the infrastructure to actually export
> stuff to userspace. Stating in the API that you some information will be
> presented to userspace counts.

That is done by ftrace not TRACE_EVENT.

> 
> > Now, ftrace does export information using TRACE_EVENT to userspace. But 
> > that is the way ftrace wants to handle it. There's nothing needed to 
> > export to userspace. What is exported, is exported ***BECAUSE*** it can 
> > change. I'll only try to keep the format that is exported the same. But 
> > nothing should rely on what the format represents staying the same.
> > 
> > If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
> > you like. Ftrace will use it to show how to read the binary data, which
> > is only needed if you want to do that. It uses the print format to print
> > to the console in case of failure. Or to the trace file, which by the way
> > can also change without notice.
> 
> Hrm, so you are planning to add, in
> include/trace/sched_event_types.h, things like :
> 
> TRACE_EVENT(sched_kthread_stop,
> 
>         TP_PROTO(struct task_struct *t),
> 
>         TP_ARGS(t),
> 
>         TP_STRUCT__entry(
>                 __array(        char,   comm,   TASK_COMM_LEN   )
>                 __field(        pid_t,  pid                     )
>         ),
> 
>         TP_fast_assign(
>                 memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
>                 __entry->pid    = t->pid;
>         ),
> 
>         TP_printk("task %s:%d", __entry->comm, __entry->pid)
> );
> 
> Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
> TP_printk are specific to ftrace and should not be considered as a
> stable API, am I correct ?

No they are not specific to ftrace. I said they are used by ftrace.
They can be modified at will by the maintainer.  The macros makes the 
updates automatic for ftrace.

> 
> Then only when ftrace is built in the kernel do we have 
> kernel/trace/events.c including the holy include/trace/trace_events.h,
> which includes all the trace_events headers, and then including
> kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
> macro and create the callbacks for each TRACE_EVENT statements.

Currently ftrace is the only user of the include/trace/trace_events.h 
Others are welcome to do what they wish.

> 
> Then big problem with this is that, whether you like it or not, you
> *are* adding an API to the tracepoints, but just validating the types
> when ftrace is being built. If you want to add an API to the
> tracepoints, then the type verification should be done _even when ftrace
> is disabled_.

It's an internal API not a user space one. Internal APIs are fine, and can 
change whenever we want.

> 
> Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
> would verify the argument types.

We could do that too.

> 
> I think it's ok to specify that the arguments of a given TRACE_EVENT may
> change without notice. We have to say it explicitly in the TRACE_EVENT
> header though.

Sure, I totally agree.

> 
> And while we talk about this, I also wonder why we won't simply embed
> the result of the TP_fast_assign and TP_printk in the tracepoint
> unlikely branch ? This would skip a function call in the tracing fast
> path, and would save the overhead of a whole function call when tracing
> is active. But that implies more bounds between tracepoints and data
> output, but given you are already declaring this in the same API, I
> don't see the problem anymore.

I'm not exactly what you mean here?  You mean to add it at the location in 
the code? If that is the case, I disagree. Since we don't want to clutter 
code with trace point data. It is nice to keep it out in separate headers.

This is what we also think about #ifdefs, it is better to have static 
inlines in headers that to clutter the code with them.

> 
> About the struct ftrace_raw_##name created in stage 1, I think the
> padding at the end of the structure is a complete waste of space. You
> should probably have a look at the ltt-type-serializer.[ch] in LTTng.

I don't know what padding you are talking about.

> 
> I am still unsure that the approach you take with TRACE_EVENT, which I
> would call "automatically generating code using header and macro
> tricks", is in the end easier to review than the simple C callback
> approach I have taken in LTTng, adding the "glue" in the
> DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
> tracepoint.
> 
> The good thing about your approach is that everyting about a trace event
> can be declared within the same macro. In LTTng, I have to create probe
> modules and write stuff like :
> 
> void probe_irq_softirq_exit(struct softirq_action *h,
>         struct softirq_action *softirq_vec);
> 
> DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
>         probe_irq_softirq_exit, "softirq_id #1u%lu");
> 
> void probe_irq_softirq_exit(struct softirq_action *h,
>         struct softirq_action *softirq_vec)
> {
>         struct marker *marker;
>         unsigned char data;
> 
>         data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);
> 
>         marker = &GET_MARKER(kernel, softirq_exit);
>         ltt_specialized_trace(marker, marker->single.probe_private,
>                 &data, sizeof(data), sizeof(data));
> }
> 
> by hand, and then the kernel softirq_exit event is shown in
> debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
> you get by doing it in TRACE_EVENT, especially if it is not
> type-verified when ftrace is disabled, compared to adding those
> callbacks in standard kernel modules, which is a coding-style we have
> been used to for years.

Well, I originally had it like you did above. But I found that if I did
that, the number of events would be greatly limited. Once I added this
infrastructure, a bunch of events were able to be made quickly.

As for the type checking, that should be easy to add. Don't need ftrace to 
do it.

-- Steve


> 
> I haven't seen much automatically generated code around in the kernel
> tree, maybe there is a good reason ? I can't wait to see the first
> kernel JIT based on ftrace. ;-)
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [patch 2/9] LTTng instrumentation - irq
  2009-03-27 22:53                 ` Steven Rostedt
@ 2009-04-02  2:42                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2009-04-02  2:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Christoph Hellwig, Jason Baron,
	Andrew Morton, LKML, ltt-dev, Frederic Weisbecker,
	Thomas Gleixner, Russell King, Masami Hiramatsu,
	Frank Ch. Eigler, Hideo AOKI, Takashi Nishiie,
	Eduard - Gabriel Munteanu

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Thu, 26 Mar 2009, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > On Tue, 24 Mar 2009, Mathieu Desnoyers wrote:
> > > > 
> > > > This is actually a very good example of what Christoph Hellwig, Peter
> > > > Zijlstra and myself have been trying to warn you about the TRACE_EVENT
> > > > macro : it exports the tracepoints to userspace, and thus makes them a
> > > > userspace-visible API, when those tracepoints should be tied to the
> > > > kernel code and nothing else. An adaptation layer should provide the
> > > > abstractions that makes the information presented to the user more
> > > > "logical".
> > > 
> > > Let me correct you here. TRACE_EVENT does ***NOT*** export anything to 
> > > userspace. There is no code what so ever in TRACE_EVENT that does so.
> > > 
> > 
> > Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?
> > 
> >  *      *
> >  *      * Fast binary tracing: define the trace record via
> >  *      * TP_STRUCT__entry(). You can think about it like a
> >  *      * regular C structure local variable definition.
> >  *      *
> >  *      * This is how the trace record is structured and will
> >  *      * be saved into the ring buffer. These are the fields
> >  *      * that will be exposed to user-space in
> >  *      * /debug/tracing/events/<*>/format.
> > 
> > You don't need to have code within the infrastructure to actually export
> > stuff to userspace. Stating in the API that you some information will be
> > presented to userspace counts.
> 
> That is done by ftrace not TRACE_EVENT.
> 

This comment and API is exposed within the kernel by TRACE_EVENT, and
its content is used by ftrace actually. But anyway, if it gets us
tracepoints into the kernel, and we can find a way that eventually LTTng
could piggyback on that information, I'm for it.

> > 
> > > Now, ftrace does export information using TRACE_EVENT to userspace. But 
> > > that is the way ftrace wants to handle it. There's nothing needed to 
> > > export to userspace. What is exported, is exported ***BECAUSE*** it can 
> > > change. I'll only try to keep the format that is exported the same. But 
> > > nothing should rely on what the format represents staying the same.
> > > 
> > > If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
> > > you like. Ftrace will use it to show how to read the binary data, which
> > > is only needed if you want to do that. It uses the print format to print
> > > to the console in case of failure. Or to the trace file, which by the way
> > > can also change without notice.
> > 
> > Hrm, so you are planning to add, in
> > include/trace/sched_event_types.h, things like :
> > 
> > TRACE_EVENT(sched_kthread_stop,
> > 
> >         TP_PROTO(struct task_struct *t),
> > 
> >         TP_ARGS(t),
> > 
> >         TP_STRUCT__entry(
> >                 __array(        char,   comm,   TASK_COMM_LEN   )
> >                 __field(        pid_t,  pid                     )
> >         ),
> > 
> >         TP_fast_assign(
> >                 memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> >                 __entry->pid    = t->pid;
> >         ),
> > 
> >         TP_printk("task %s:%d", __entry->comm, __entry->pid)
> > );
> > 
> > Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
> > TP_printk are specific to ftrace and should not be considered as a
> > stable API, am I correct ?
> 
> No they are not specific to ftrace. I said they are used by ftrace.
> They can be modified at will by the maintainer.  The macros makes the 
> updates automatic for ftrace.
> 

Eventually we might want to combine the lttng probes I have and these
TP_STRUCT__entry and TP_fast_assign.

> > 
> > Then only when ftrace is built in the kernel do we have 
> > kernel/trace/events.c including the holy include/trace/trace_events.h,
> > which includes all the trace_events headers, and then including
> > kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
> > macro and create the callbacks for each TRACE_EVENT statements.
> 
> Currently ftrace is the only user of the include/trace/trace_events.h 
> Others are welcome to do what they wish.
> 
> > 
> > Then big problem with this is that, whether you like it or not, you
> > *are* adding an API to the tracepoints, but just validating the types
> > when ftrace is being built. If you want to add an API to the
> > tracepoints, then the type verification should be done _even when ftrace
> > is disabled_.
> 
> It's an internal API not a user space one. Internal APIs are fine, and can 
> change whenever we want.
> 

Well, I think we should consider the ftrace user-space data consumer as
part of this "internal" api then, event if it crosses the
kernel-userspace boundary. If we agree that this is what need and want
to do, I'm fine with it.

> > 
> > Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
> > would verify the argument types.
> 
> We could do that too.
> 
> > 
> > I think it's ok to specify that the arguments of a given TRACE_EVENT may
> > change without notice. We have to say it explicitly in the TRACE_EVENT
> > header though.
> 
> Sure, I totally agree.
> 
> > 
> > And while we talk about this, I also wonder why we won't simply embed
> > the result of the TP_fast_assign and TP_printk in the tracepoint
> > unlikely branch ? This would skip a function call in the tracing fast
> > path, and would save the overhead of a whole function call when tracing
> > is active. But that implies more bounds between tracepoints and data
> > output, but given you are already declaring this in the same API, I
> > don't see the problem anymore.
> 
> I'm not exactly what you mean here?  You mean to add it at the location in 
> the code? If that is the case, I disagree. Since we don't want to clutter 
> code with trace point data. It is nice to keep it out in separate headers.
> 
> This is what we also think about #ifdefs, it is better to have static 
> inlines in headers that to clutter the code with them.
> 

Yes, leaving the body of the tracing out of the kernel objects as much
as possible is a good thing, because it won't pollute the original
kernel functions too much and will leave the compiler do its
optimisations in peace. So we agree that we should not go as far as to
embed the result of TP_fast_assign and TP_printk into the caller.

> > 
> > About the struct ftrace_raw_##name created in stage 1, I think the
> > padding at the end of the structure is a complete waste of space. You
> > should probably have a look at the ltt-type-serializer.[ch] in LTTng.
> 
> I don't know what padding you are talking about.

If you declare a struct with :

struct {
	void *a;
	char b;
};

the structure is then 2 * sizeof(void *) in size, because the end of the
structure is aligned on the largest type within the structure. In a
standard C program memory layout, this makes perfect sense, because
we often declare arrays of structures and we expect the next item to be
aligned on the largest type. Therefore, the beginning and the end of the
structure are aligned on such largest type.

However, in a trace, given we will typically write a compact header
(lttng event headers are 32-bits) after the structure, it makes no
sense to re-align on sizeof(void*) after the data.

Therefore, in LTTng, I have declared special structures for event
payloads, e.g. :

(include/linux/ltt-type-serializer.h from the LTTng tree)

struct serialize_long_char {
        unsigned long f1;
        unsigned char f2;
        unsigned char end_field[0];
} LTT_ALIGN;

Where LTT_ALIGN maps either to nothing or __attribute__((packed)),
depending if the architecture supports efficient unaligned writes or
not.

The size of the structure can then be obtained with :

#define serialize_sizeof(type)  offsetof(typeof(type), end_field)

I hope this explanation makes it clearer.

> 
> > 
> > I am still unsure that the approach you take with TRACE_EVENT, which I
> > would call "automatically generating code using header and macro
> > tricks", is in the end easier to review than the simple C callback
> > approach I have taken in LTTng, adding the "glue" in the
> > DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
> > tracepoint.
> > 
> > The good thing about your approach is that everyting about a trace event
> > can be declared within the same macro. In LTTng, I have to create probe
> > modules and write stuff like :
> > 
> > void probe_irq_softirq_exit(struct softirq_action *h,
> >         struct softirq_action *softirq_vec);
> > 
> > DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
> >         probe_irq_softirq_exit, "softirq_id #1u%lu");
> > 
> > void probe_irq_softirq_exit(struct softirq_action *h,
> >         struct softirq_action *softirq_vec)
> > {
> >         struct marker *marker;
> >         unsigned char data;
> > 
> >         data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);
> > 
> >         marker = &GET_MARKER(kernel, softirq_exit);
> >         ltt_specialized_trace(marker, marker->single.probe_private,
> >                 &data, sizeof(data), sizeof(data));
> > }
> > 
> > by hand, and then the kernel softirq_exit event is shown in
> > debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
> > you get by doing it in TRACE_EVENT, especially if it is not
> > type-verified when ftrace is disabled, compared to adding those
> > callbacks in standard kernel modules, which is a coding-style we have
> > been used to for years.
> 
> Well, I originally had it like you did above. But I found that if I did
> that, the number of events would be greatly limited. Once I added this
> infrastructure, a bunch of events were able to be made quickly.
> 
> As for the type checking, that should be easy to add. Don't need ftrace to 
> do it.
> 

Ideally we should find a way to integrate the new instrumentation so
it's useful for both ftrace and lttng. I have already started to discuss
with the Fujitsu team to ask them if they have time to extend the
LTTng instrumentation following recommendations from Ingo. We plan to
duplicate a bit of work for now, adding the LTTng probes and also the
ftrace format strings, until we can bring the two infrastructures
closer.

I should be able to get back to more serious coding in about 2 months,
after the core of my Ph.D. thesis is finished. I will then be able to
concentrate on LTTng integration as much as I want to. Meanwhile, I am
ready to work with you and the Fujitsu team to gradually improve LTTng
instrumentation, while I switch my main output language from C to
English. :-)

Thanks,

Mathieu

> -- Steve
> 
> 
> > 
> > I haven't seen much automatically generated code around in the kernel
> > tree, maybe there is a good reason ? I can't wait to see the first
> > kernel JIT based on ftrace. ;-)
> > 
> > Mathieu
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2009-04-02  2:47 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-24 15:56 [patch 0/9] LTTng core kernel instrumentation Mathieu Desnoyers
2009-03-24 15:56 ` [patch 1/9] IRQ handle prepare for instrumentation Mathieu Desnoyers
2009-03-24 15:56 ` [patch 2/9] LTTng instrumentation - irq Mathieu Desnoyers
2009-03-24 17:33   ` Jason Baron
2009-03-24 17:50     ` Ingo Molnar
2009-03-24 17:57       ` Jason Baron
2009-03-24 19:12         ` Ingo Molnar
2009-03-24 20:11           ` Mathieu Desnoyers
2009-03-24 20:51             ` Ingo Molnar
2009-03-25  8:47               ` Ingo Molnar
2009-03-25 18:30                 ` Mathieu Desnoyers
2009-03-25  2:00             ` Steven Rostedt
2009-03-26 18:27               ` Mathieu Desnoyers
2009-03-27 22:53                 ` Steven Rostedt
2009-04-02  2:42                   ` Mathieu Desnoyers
2009-03-25  2:09             ` Steven Rostedt
2009-03-26 18:28               ` Mathieu Desnoyers
2009-03-27 19:18           ` Jason Baron
2009-03-24 19:14   ` Ingo Molnar
2009-03-27 22:12   ` Thomas Gleixner
2009-03-24 15:56 ` [patch 3/9] LTTng instrumentation tasklets Mathieu Desnoyers
2009-03-24 17:56   ` Ingo Molnar
2009-03-25 13:52     ` Chetan.Loke
2009-03-25 14:17       ` Peter Zijlstra
2009-03-25 17:37         ` Chetan.Loke
2009-03-25 17:52           ` Steven Rostedt
2009-03-24 15:56 ` [patch 4/9] LTTng instrumentation softirq Mathieu Desnoyers
2009-03-24 18:01   ` Ingo Molnar
2009-03-24 15:56 ` [patch 5/9] LTTng instrumentation scheduler fix task migration Mathieu Desnoyers
2009-03-24 17:53   ` Ingo Molnar
2009-03-24 15:56 ` [patch 6/9] LTTng instrumentation - timer Mathieu Desnoyers
2009-03-24 18:21   ` Ingo Molnar
2009-03-24 19:14     ` Thomas Gleixner
2009-03-24 20:47       ` Ingo Molnar
2009-03-27 22:05         ` Thomas Gleixner
2009-03-24 15:56 ` [patch 7/9] LTTng instrumentation - kernel Mathieu Desnoyers
2009-03-24 18:33   ` Ingo Molnar
2009-03-25  1:13     ` Rusty Russell
2009-03-25  8:40       ` Ingo Molnar
2009-03-25 13:06     ` Frederic Weisbecker
2009-03-24 15:56 ` [patch 8/9] LTTng instrumentation - filemap Mathieu Desnoyers
2009-03-24 15:56   ` Mathieu Desnoyers
2009-03-24 18:39   ` Ingo Molnar
2009-03-24 18:39     ` Ingo Molnar
2009-03-24 15:56 ` [patch 9/9] LTTng instrumentation - swap Mathieu Desnoyers
2009-03-24 15:56   ` Mathieu Desnoyers
2009-03-24 18:51   ` Ingo Molnar
2009-03-24 18:51     ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.